phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NaHeon Kim <honey.and...@gmail.com>
Subject Re: Select date range performance issue
Date Fri, 24 Feb 2017 10:57:36 GMT
Jonathan,

So much helpful explanations!
Skip scan and index including any column in a query make a big difference.
I'll try row timestamp more and see what happens.

Thanks,
NaHeon

2017-02-24 13:25 GMT+09:00 Jonathan Leech <jonathaz@gmail.com>:

> 1. No, I am not confused. A skip scan would "skip" over entire ranges of
> obj_id and all create_dt values for it. This will only be effective if
> there are many less distinct values of obj_id than there are total rows. If
> there are too many distinct obj_ids then it either wont speed the query up
> at all, or not enough, but it's simple to try it and see.
>
> 2. Your index isnt used because it doesn't contain the other columns used
> in the query; e.g your query is isn't "covered". You get the column(s)
> defined in the index + anything in the rowkey. You can also use the
> "include" keyword to add other columns to the index. Alternatively, you can
> look at "local" indexes, or it may be possible to nest the query with a
> sub-select to fetch the desired primary key values from the index by
> create_dt and the others from the main table.
>
> 3. No, not all. Phoenix will assign the the internal hbase timestamp of
> the row to whatever you set to create_dt. It can also automatically set it
> to the current time when you create the row, if you want it to. This has
> other implications; e.g if you set a TTL, versions, etc in hbase. It can
> speed up queries, especially those that execute on the most recent data
> written, but prior to hbase compaction. Advanced stuff and performance is
> highly dependent on your specific use case and hbase compaction settings...
>
> On Feb 23, 2017, at 7:59 PM, NaHeon Kim <honey.and.sw@gmail.com> wrote:
>
> Thanks for suggestion.
>
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of
> date, so SKIP_SCAN hint might be not useful.
>
> 2. I created secondary index on create_dt
>        create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
>
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
>
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>    SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>    SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
>
>
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech <jonathaz@gmail.com>:
>
>> If there are not a large number of distinct values of obj_id, try a
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
>> if it fits your use case.
>>
>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim <honey.and.sw@gmail.com> wrote:
>>
>> Hi all,
>> I've seen performance problem when selecting rows within date range.
>>
>> My table schema is:
>>
>> CREATE TABLE MY_TABLE (
>>    OBJ_ID varchar(20) not null,
>>    CREATE_DT timestamp not null,
>>    KEYWORD varchar(100) not null,
>>    COUNT integer,
>>    CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>> );
>>
>> MY_TABLE has almost 5,200,000 rows,
>> CREATE_DT has about 6 months range.
>>
>> And executed query:
>>
>> SELECT KEYWORD, SUM(COUNT)
>> FROM MY_TABLE
>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>> GROUP BY KEYWORD;
>>
>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of
>> row key.
>> I created a secondary index on CREATE_DT but there's no improvement.
>>
>> Query plan looks weird:
>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
>> NEWS_KEYWORD_COUNT
>>      SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000'
>> AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>      SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>> CLIENT MERGE SORT
>> CLIENT 100 ROW LIMIT
>>
>> BUT If CREATE_DT comes first of row key, plan says range scan will be
>> done.
>>
>> Any suggestion? : )
>>
>> Thanks,
>> NaHeon
>>
>>
>

Mime
View raw message