phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Leech <jonat...@gmail.com>
Subject Re: Select date range performance issue
Date Fri, 24 Feb 2017 04:25:32 GMT
1. No, I am not confused. A skip scan would "skip" over entire ranges of obj_id and all create_dt
values for it. This will only be effective if there are many less distinct values of obj_id
than there are total rows. If there are too many distinct obj_ids then it either wont speed
the query up at all, or not enough, but it's simple to try it and see.

2. Your index isnt used because it doesn't contain the other columns used in the query; e.g
your query is isn't "covered". You get the column(s) defined in the index + anything in the
rowkey. You can also use the "include" keyword to add other columns to the index. Alternatively,
you can look at "local" indexes, or it may be possible to nest the query with a sub-select
to fetch the desired primary key values from the index by create_dt and the others from the
main table.

3. No, not all. Phoenix will assign the the internal hbase timestamp of the row to whatever
you set to create_dt. It can also automatically set it to the current time when you create
the row, if you want it to. This has other implications; e.g if you set a TTL, versions, etc
in hbase. It can speed up queries, especially those that execute on the most recent data written,
but prior to hbase compaction. Advanced stuff and performance is highly dependent on your
specific use case and hbase compaction settings...

> On Feb 23, 2017, at 7:59 PM, NaHeon Kim <honey.and.sw@gmail.com> wrote:
> 
> Thanks for suggestion.
> 
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of date, so SKIP_SCAN
hint might be not useful.
> 
> 2. I created secondary index on create_dt
>        create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
> 
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
> 
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>    SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND CREATE_DT
< TIMESTAMP '2017-04-01 00:00:00.000')
>    SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
> 
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
> 
> 
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech <jonathaz@gmail.com>:
>> If there are not a large number of distinct values of obj_id, try a SKIP_SCAN hint.
Otherwise, the secondary index should work, make sure it's actually used via explain. Finally,
you might try the ROW_TIMESTAMP feature if it fits your use case.
>> 
>>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim <honey.and.sw@gmail.com> wrote:
>>> 
>>> Hi all,
>>> I've seen performance problem when selecting rows within date range.
>>> 
>>> My table schema is:
>>> 
>>> CREATE TABLE MY_TABLE (
>>>    OBJ_ID varchar(20) not null,
>>>    CREATE_DT timestamp not null,
>>>    KEYWORD varchar(100) not null,
>>>    COUNT integer,
>>>    CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>>> );
>>> 
>>> MY_TABLE has almost 5,200,000 rows,
>>> CREATE_DT has about 6 months range.
>>> 
>>> And executed query:
>>> 
>>> SELECT KEYWORD, SUM(COUNT)
>>> FROM MY_TABLE
>>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>>> GROUP BY KEYWORD;
>>> 
>>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row key.
>>> I created a secondary index on CREATE_DT but there's no improvement.
>>> 
>>> Query plan looks weird:
>>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>>>      SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>>      SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>>> CLIENT MERGE SORT
>>> CLIENT 100 ROW LIMIT
>>> 
>>> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>>> 
>>> Any suggestion? : )
>>> 
>>> Thanks,
>>> NaHeon
>>> 
> 

Mime
View raw message