phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: high cardinality aggregation query performance
Date Tue, 24 Feb 2015 08:29:32 GMT
Based on your query plan, the skip scan is being done solely based on your
salt bucket while the rest of the filtering is being done by a filter,
which means that you're not filtering based on the leading part of your
primary key. We'll know more once you post your schema, but if NETWORK,
KEYWORD_ID and CUSTOMER_ID formed your primary key constraint, then the
skip scan would work well.

Thanks,
James

On Mon, Feb 23, 2015 at 5:24 PM, James Taylor <jamestaylor@apache.org>
wrote:

> Hi Gary,
> Would you mind posting your schema and query as well?
> Thanks,
> James
>
> On Mon, Feb 23, 2015 at 5:08 PM, Gary Schulte <gschulte@marinsoftware.com>
> wrote:
>
>> I am currently evaluating Phoenix 4.2.2 for an analytic workload.
>> Several of the test cases are simple aggregation over varying cardinality
>> sets.  For example, for one of the test aggregation queries I am seeing
>> response times along the lines of:
>>
>>   fact query cardinality response time using primary index  1k 0.3  3k
>> 0.3  5k 0.4  10k 0.4  15k 0.7  20k 0.7  50k 2.2  100k 2.7  200k 5.3  400k
>> 13.8  800k 20.1  1.5 mil 48.8
>> where the plan looks like:
>> +------------------------------------------+
>> |                   PLAN                   |
>> +------------------------------------------+
>> | CLIENT 81-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
>> PERF.BIG_OLAP_DOC [0,10724,3600,1] - [9,10724,3685,1] |
>> |     SERVER FILTER BY (NETWORK = 'SEARCH' AND true AND CUSTOMER_ID = 545
>> AND KEYWORD_ID IN
>> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,61
>> |
>> |     SERVER AGGREGATE INTO SINGLE ROW     |
>> +------------------------------------------+
>>
>>
>> The performance is great for low cardinality sets, but as the cardinality
>> of the set to be aggregated grows, it appears the response time scales
>> essentially linearly with the size of the set being queried/aggregated.
>> This is a roughly 100 million row 10-bucket salted table on a 10-node
>> cluster
>>
>> I had presumed the value lookup might be the major cost factor, but if I
>> remove all columns from the projection and criteria that are not part of
>> the key, I only see about 20% reduced response time.
>>
>> +------------------------------------------+
>> |                   PLAN                   |
>> +------------------------------------------+
>> | CLIENT 101-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
>> PERF.BIG_OLAP_DOC [0,10724,3000,1] - [9,10724,3685,1] |
>> |     SERVER FILTER BY FIRST KEY ONLY AND (true AND KEYWORD_ID IN
>> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,613239663,613239668,61323969
>> |
>> |     SERVER AGGREGATE INTO SINGLE ROW     |
>> +------------------------------------------+
>>
>> So it seems like the parallel index skip scan is the major factor.  Is
>> this expected?
>>
>> Insight about the expected performance and behavior of high cardinality
>> queries and tips about optimizing for them would be greatly appreciated.
>>
>>
>> TIA
>>
>
>

Mime
View raw message