phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Shah <amits...@gmail.com>
Subject Re: Speeding Up Group By Queries
Date Mon, 28 Mar 2016 07:32:52 GMT
Thanks Mujtaba and James for replying back.

Mujtaba, Below are details to your follow up queries

1. How wide is your table


I have 26 columns in the TRANSACTIONS table with a couple of columns
combined to be marked as a primary key

2. How many region servers is your data distributed on and what's the heap
> size?


When I posted the initial readings of the query taking around 2 minutes, I
had one region server storing 4 regions for the 10 mil records TRANSACTIONS
table. The heap size on the master server is 1 GB while the region server
has 3.63 GB heap setting.

Later I added 2 more region servers to the cluster and configured them as
data nodes and region servers. After this step, the regions got split on
two region servers with the count as 2 on one region server and 6 on
another. I didn't follow what action caused this region split or was it
automatically done by hbase (load balancer??)

3. Do you see lots of disk I/O on region servers during aggregation?


 Yes I see disk reads with 607 ios/second on the hosts that stores 6
regions. Kindly find the disk io statistics attached as images.

4. Can you try your query after major compacting your table?


I tried to compact the table from the hbase web UI. For some reason, the
compaction table attribute on the web ui is still shown as NONE. After
these changes, the query time is down to *42 secs. *
Is compression different from compaction? Would the query performance
improve by compressing the data by one of the algorithms? Logically it
doesn't sound right though.

Can you also replace log4j.properties with the attached one and reply back
> with phoenix.log created by executing your query in sqlline?


After replacing the log4j.properties, I have captured the logs for the
group by query execution and attached.


James,
If I follow the queries that you pasted, I see the index getting used but
if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
do not see the index being used. Probably the query plan is changing based
on whether the table has data or not.

The query time is reduced down to 42 secs right now. Let me know if you
have more suggestions on to improve it further.

Thanks,
Amit.

On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestaylor@apache.org>
wrote:

> Hi Amit,
> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
> (see below). An index will help some, as the aggregation can be done in
> place as the scan over the index is occurring (as opposed to having to hold
> the distinct values found during grouping in memory per chunk of work and
> sorting each chunk on the client). It's not going to prevent the entire
> index from being scanned though. You'll need a WHERE clause to prevent that.
>
> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
> No rows affected (1.32 seconds)
> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
> No rows affected (6.452 seconds)
> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>
> +--------------------------------------------------------------------------+
> |                                   PLAN
> |
>
> +--------------------------------------------------------------------------+
> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX
>  |
> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
> |
> | CLIENT MERGE SORT
>  |
>
> +--------------------------------------------------------------------------+
> 3 rows selected (0.028 seconds)
>
> Thanks,
> James
>
>
> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujtaba@apache.org>
> wrote:
>
>> That seems excessively slow for 10M rows which should be in order of few
>> seconds at most without index. 1. How wide is your table 2. How many region
>> servers is your data distributed on and what's the heap size? 3. Do you see
>> lots of disk I/O on region servers during aggregation? 4. Can you try your
>> query after major compacting your table?
>>
>> Can you also replace log4j.properties with the attached one and reply
>> back with phoenix.log created by executing your query in sqlline?
>>
>> Thanks,
>> Mujtaba
>>
>>
>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits.84@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>> (version 4.6) deployed through cloudera for our OLAP workfload. I have
>>> a table that has 10 mil rows. I try to execute the below roll up query and
>>> it takes around 2 mins to return 1,850 rows.
>>>
>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
>>> T_COUNTRY;
>>>
>>> I tried applying the "joining with indices" example given on the website
>>> <https://phoenix.apache.org/joins.html> on the TRANSACTIONS table by
>>> creating an index on the grouped by column as below but that doesn't help.
>>>
>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>
>>> This index is not getting used when the query is executed. The query
>>> plan is as below
>>>
>>> +------------------------------------------+
>>> |                   PLAN                   |
>>> +------------------------------------------+
>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>> | CLIENT MERGE SORT                        |
>>> +------------------------------------------+
>>>
>>> Theoretically can secondary indexes help improve the performance of
>>> group by queries?
>>>
>>> Any suggestions on what are different options in phoenix I could try out
>>> to speed up GROUP BY queries?
>>>
>>> Thanks,
>>> Amit.
>>>
>>
>>
>

Mime
View raw message