phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: high cardinality aggregation query performance
Date Tue, 24 Feb 2015 01:24:18 GMT
Hi Gary,
Would you mind posting your schema and query as well?
Thanks,
James

On Mon, Feb 23, 2015 at 5:08 PM, Gary Schulte <gschulte@marinsoftware.com>
wrote:

> I am currently evaluating Phoenix 4.2.2 for an analytic workload.  Several
> of the test cases are simple aggregation over varying cardinality sets.
> For example, for one of the test aggregation queries I am seeing response
> times along the lines of:
>
>   fact query cardinality response time using primary index  1k 0.3  3k 0.3
> 5k 0.4  10k 0.4  15k 0.7  20k 0.7  50k 2.2  100k 2.7  200k 5.3  400k 13.8
> 800k 20.1  1.5 mil 48.8
> where the plan looks like:
> +------------------------------------------+
> |                   PLAN                   |
> +------------------------------------------+
> | CLIENT 81-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
> PERF.BIG_OLAP_DOC [0,10724,3600,1] - [9,10724,3685,1] |
> |     SERVER FILTER BY (NETWORK = 'SEARCH' AND true AND CUSTOMER_ID = 545
> AND KEYWORD_ID IN
> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,61
> |
> |     SERVER AGGREGATE INTO SINGLE ROW     |
> +------------------------------------------+
>
>
> The performance is great for low cardinality sets, but as the cardinality
> of the set to be aggregated grows, it appears the response time scales
> essentially linearly with the size of the set being queried/aggregated.
> This is a roughly 100 million row 10-bucket salted table on a 10-node
> cluster
>
> I had presumed the value lookup might be the major cost factor, but if I
> remove all columns from the projection and criteria that are not part of
> the key, I only see about 20% reduced response time.
>
> +------------------------------------------+
> |                   PLAN                   |
> +------------------------------------------+
> | CLIENT 101-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
> PERF.BIG_OLAP_DOC [0,10724,3000,1] - [9,10724,3685,1] |
> |     SERVER FILTER BY FIRST KEY ONLY AND (true AND KEYWORD_ID IN
> (613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,613239663,613239668,61323969
> |
> |     SERVER AGGREGATE INTO SINGLE ROW     |
> +------------------------------------------+
>
> So it seems like the parallel index skip scan is the major factor.  Is
> this expected?
>
> Insight about the expected performance and behavior of high cardinality
> queries and tips about optimizing for them would be greatly appreciated.
>
>
> TIA
>

Mime
View raw message