phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Shah <amits...@gmail.com>
Subject Re: Speeding Up Group By Queries
Date Tue, 29 Mar 2016 13:45:05 GMT
Hi Mujtaba,

I did try the two optimization techniques by recreating the table and then
loading it again with 10 mil records. They do not seem to help out much in
terms of the timings. Kindly find the phoenix log file attached. Let me
know if I am missing anything.

Thanks,
Amit.

On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujtaba@apache.org> wrote:

> Here's the chart for time it takes for each of the parallel scans after
> split. On RS where data is not read from disk scan gets back in ~20 secs
> but for the RS which has 6 it's ~45 secs.
>
> [image: Inline image 2]
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>> regions
>>
>
> Two things that you should try to reduce disk reads or maybe a combination
> of both 1. Have only the columns used in your group by query in a separate
> column family CREATE TABLE T (K integer primary key,
> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
> compaction.
>
> I tried to compact the table from the hbase web UI
>>
>
> You need to do *major_compact* from HBase shell. From UI it's minor.
>
> - mujtaba
>
> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits.84@gmail.com> wrote:
>
>> Thanks Mujtaba and James for replying back.
>>
>> Mujtaba, Below are details to your follow up queries
>>
>> 1. How wide is your table
>>
>>
>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>> combined to be marked as a primary key
>>
>> 2. How many region servers is your data distributed on and what's the
>>> heap size?
>>
>>
>> When I posted the initial readings of the query taking around 2 minutes,
>> I had one region server storing 4 regions for the 10 mil records
>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>> region server has 3.63 GB heap setting.
>>
>> Later I added 2 more region servers to the cluster and configured them as
>> data nodes and region servers. After this step, the regions got split on
>> two region servers with the count as 2 on one region server and 6 on
>> another. I didn't follow what action caused this region split or was it
>> automatically done by hbase (load balancer??)
>>
>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>
>>
>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>> regions. Kindly find the disk io statistics attached as images.
>>
>> 4. Can you try your query after major compacting your table?
>>
>>
>> I tried to compact the table from the hbase web UI. For some reason, the
>> compaction table attribute on the web ui is still shown as NONE. After
>> these changes, the query time is down to *42 secs. *
>> Is compression different from compaction? Would the query performance
>> improve by compressing the data by one of the algorithms? Logically it
>> doesn't sound right though.
>>
>> Can you also replace log4j.properties with the attached one and reply
>>> back with phoenix.log created by executing your query in sqlline?
>>
>>
>> After replacing the log4j.properties, I have captured the logs for the
>> group by query execution and attached.
>>
>>
>> James,
>> If I follow the queries that you pasted, I see the index getting used but
>> if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
>> do not see the index being used. Probably the query plan is changing based
>> on whether the table has data or not.
>>
>> The query time is reduced down to 42 secs right now. Let me know if you
>> have more suggestions on to improve it further.
>>
>> Thanks,
>> Amit.
>>
>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestaylor@apache.org>
>> wrote:
>>
>>> Hi Amit,
>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
>>> (see below). An index will help some, as the aggregation can be done in
>>> place as the scan over the index is occurring (as opposed to having to hold
>>> the distinct values found during grouping in memory per chunk of work and
>>> sorting each chunk on the client). It's not going to prevent the entire
>>> index from being scanned though. You'll need a WHERE clause to prevent that.
>>>
>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>>> No rows affected (1.32 seconds)
>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>> No rows affected (6.452 seconds)
>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>>
>>> +--------------------------------------------------------------------------+
>>> |                                   PLAN
>>>   |
>>>
>>> +--------------------------------------------------------------------------+
>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER
>>> TRANSACTIONS_COUNTRY_INDEX  |
>>> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>>   |
>>> | CLIENT MERGE SORT
>>>    |
>>>
>>> +--------------------------------------------------------------------------+
>>> 3 rows selected (0.028 seconds)
>>>
>>> Thanks,
>>> James
>>>
>>>
>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujtaba@apache.org>
>>> wrote:
>>>
>>>> That seems excessively slow for 10M rows which should be in order of
>>>> few seconds at most without index. 1. How wide is your table 2. How many
>>>> region servers is your data distributed on and what's the heap size? 3. Do
>>>> you see lots of disk I/O on region servers during aggregation? 4. Can you
>>>> try your query after major compacting your table?
>>>>
>>>> Can you also replace log4j.properties with the attached one and reply
>>>> back with phoenix.log created by executing your query in sqlline?
>>>>
>>>> Thanks,
>>>> Mujtaba
>>>>
>>>>
>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits.84@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I
>>>>> have a table that has 10 mil rows. I try to execute the below roll up
query
>>>>> and it takes around 2 mins to return 1,850 rows.
>>>>>
>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
>>>>> T_COUNTRY;
>>>>>
>>>>> I tried applying the "joining with indices" example given on the
>>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS
>>>>> table by creating an index on the grouped by column as below but that
>>>>> doesn't help.
>>>>>
>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>>
>>>>> This index is not getting used when the query is executed. The query
>>>>> plan is as below
>>>>>
>>>>> +------------------------------------------+
>>>>> |                   PLAN                   |
>>>>> +------------------------------------------+
>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>>>> | CLIENT MERGE SORT                        |
>>>>> +------------------------------------------+
>>>>>
>>>>> Theoretically can secondary indexes help improve the performance of
>>>>> group by queries?
>>>>>
>>>>> Any suggestions on what are different options in phoenix I could try
>>>>> out to speed up GROUP BY queries?
>>>>>
>>>>> Thanks,
>>>>> Amit.
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message