phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NaHeon Kim <honey.and...@gmail.com>
Subject Re: Ask performance advice
Date Fri, 07 Apr 2017 02:39:09 GMT
Thank you all.

1.
The other query is:

select keyword, sum(cnt) as count
from my_table
where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
to_timestamp('2017-04-03 23:00')
and *obj_id in ( obj_ids comes here... )*
and obj_grp = '100'
group by keyword
order by count desc;

This query is pretty fast because of the first two PKs.
The reason obj_id comes at second PK is, we first assumed 'select with
obj_id' will be called much more often, and then 'select with obj_grp' is
also required.

2.
4.8.0-HBase-1.1 version is used.
I'm testing on HBase server for our product, so that upgrading looks not
easy. : (
If there definitely exist index bug fixes, I could try though.

3.
With Jonathan's reply and optional PK, this schema looks worthy to try:
create table my_table (
   obj_grp varchar(50),
   obj_id varchar(50) not null,
   regymdt timestamp not null,
   keyword varchar not null,
   cnt integer,
   post_tp varchar(20)
   constraint pk primary key (obj_grp, obj_id, regymdt, keyword)
)

Thanks,
NaHeon




2017-04-05 15:17 GMT+09:00 Jonathan Leech <jonathaz@gmail.com>:

> Also, your schema as defined seems to have a pretty common hbase
> anti-pattern, using a steadily increasing value as the first part of the
> rowkey (assuming your data is also loaded in order of time). This will lead
> to region hotspotting on the load, region splitting, etc. Additionally,
> your queries may not take advantage of the parallelism of the cluster if
> they end up on a small number of regions. If the queries are more likely to
> be on the most recent data, as is often the case, you'll end up with a
> single server doing the bulk the work in the cluster.
>
> - Jonathan
>
> On Apr 4, 2017, at 10:10 PM, James Taylor <jamestaylor@apache.org> wrote:
>
> What other queries do you want to be fast? What version of Phoenix are you
> using? Have you seen our new Tuning Guide [1]?
>
> You could try moving OBJ_ID to end of PK and adding OBJ_GRP as the
> leading PK column (it's fine to include it in PK even if it's optional).
>
> Your index should be used, though. Can you try doing an explain on the
> query with 4.10 and file a  JIRA if the index isn't being used?
>
> Thanks,
> James
>
> [1] http://phoenix.apache.org/tuning_guide.html
>
> On Tue, Apr 4, 2017 at 8:01 PM NaHeon Kim <honey.and.sw@gmail.com> wrote:
>
>> Hi all,
>>
>> My team has a Phoenix table containing over 30,000,000 rows and try to
>> speed up its query performance.
>> It'll be perfect if all queries could be done within 1~2 seconds.
>>
>> ----- table schema -----
>>
>> CREATE TABLE MY_TABLE (
>>    REGYMDT timestamp not null,
>>    OBJ_ID varchar(50) not null,
>>    KEYWORD varchar not null,
>>    OBJ_GRP  varchar(50),
>>    CNT integer,
>>    POST_TP varchar(20),
>>    CONSTRAINT PK PRIMARY KEY (REGYMDT, OBJ_ID, KEYWORD)
>> ) IMMUTABLE_ROWS = true;
>>
>> create index objgrp_reg_kwd on my_table(obj_grp, regymdt, keyword)
>> include (cnt);
>>
>> ----- tuning-needed query -----
>>
>> select keyword, sum(cnt) as count
>> from my_table
>> where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
>> to_timestamp('2017-04-03 23:00')
>> and obj_grp = '100'
>> group by keyword
>> order by count desc;
>>
>>
>> The reason why obj_grp isn't PK is that it's optional value.
>> I could use obj_grp's default value and propagate it to PK, but still not
>> think it's desired pattern.
>>
>> The query above takes more than 20 seconds!
>> It does *not use* objgrp_reg_kwd index, doing just regymdt range scan.
>>
>> Thanks in advance!
>> NaHeon
>>
>>

Mime
View raw message