phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Schulte <gschu...@marinsoftware.com>
Subject high cardinality aggregation query performance
Date Tue, 24 Feb 2015 01:08:24 GMT
I am currently evaluating Phoenix 4.2.2 for an analytic workload.  Several
of the test cases are simple aggregation over varying cardinality sets.
For example, for one of the test aggregation queries I am seeing response
times along the lines of:

  fact query cardinality response time using primary index  1k 0.3  3k 0.3
5k 0.4  10k 0.4  15k 0.7  20k 0.7  50k 2.2  100k 2.7  200k 5.3  400k 13.8
800k 20.1  1.5 mil 48.8
where the plan looks like:
+------------------------------------------+
|                   PLAN                   |
+------------------------------------------+
| CLIENT 81-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
PERF.BIG_OLAP_DOC [0,10724,3600,1] - [9,10724,3685,1] |
|     SERVER FILTER BY (NETWORK = 'SEARCH' AND true AND CUSTOMER_ID = 545
AND KEYWORD_ID IN
(613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,61
|
|     SERVER AGGREGATE INTO SINGLE ROW     |
+------------------------------------------+


The performance is great for low cardinality sets, but as the cardinality
of the set to be aggregated grows, it appears the response time scales
essentially linearly with the size of the set being queried/aggregated.
This is a roughly 100 million row 10-bucket salted table on a 10-node
cluster

I had presumed the value lookup might be the major cost factor, but if I
remove all columns from the projection and criteria that are not part of
the key, I only see about 20% reduced response time.

+------------------------------------------+
|                   PLAN                   |
+------------------------------------------+
| CLIENT 101-CHUNK PARALLEL 10-WAY SKIP SCAN ON 10 RANGES OVER
PERF.BIG_OLAP_DOC [0,10724,3000,1] - [9,10724,3685,1] |
|     SERVER FILTER BY FIRST KEY ONLY AND (true AND KEYWORD_ID IN
(613213594,613214433,613216496,613216861,613217357,613218757,613218770,613218833,613218871,613218937,613218938,613220926,613239657,613239663,613239668,61323969
|
|     SERVER AGGREGATE INTO SINGLE ROW     |
+------------------------------------------+

So it seems like the parallel index skip scan is the major factor.  Is this
expected?

Insight about the expected performance and behavior of high cardinality
queries and tips about optimizing for them would be greatly appreciated.


TIA

Mime
View raw message