phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Schulte <gschu...@marinsoftware.com>
Subject Re: high cardinality aggregation query performance
Date Tue, 24 Feb 2015 21:49:47 GMT
On Tue, Feb 24, 2015 at 12:29 AM, James Taylor <jamestaylor@apache.org>
 wrote:

> Based on your query plan, the skip scan is being done solely based on your
> salt bucket while the rest of the filtering is being done by a filter,
> which means that you're not filtering based on the leading part of your
> primary key. We'll know more once you post your schema, but if NETWORK,
> KEYWORD_ID and CUSTOMER_ID formed your primary key constraint, then the
> skip scan would work well.
>
>
Thanks for your response James.  Sorry for the slow reply - I had
difficulty finding the exact set of test queries I was using for timings.

The relevant portion of the olap doc schema is:

create table PERF.BIG_OLAP_DOC (
client_id                           integer not null
,customer_id                        integer
,time_id                            integer not null
,conversion_type_id                 integer not null
,device_type                        varchar(16)
,keyword_id                         bigint not null
,creative_id                        bigint not null
,placement_id                       bigint not null
,product_target_id                  bigint not null
,network                            varchar(7)
,impressions                        decimal(18, 4)
,publisher_clicks                   decimal(18, 4)
,publisher_cost                     decimal(18, 4)
,conversions                        decimal(18, 4)
,revenue                            decimal(18, 4)

[ ...additional metric and dimensional colums ... ]

    constraint perf_fact_pk primary key (client_id, time_id,
conversion_type_id, device_type, keyword_id, creative_id, placement_id,
product_target_id))SALT_BUCKETS=10;


I am evaluating a 'stitch' case where results from an external system are
injected either via table or (as in this case) an in-list.  An example of
one of these test agg queries I am using is:

SELECT count(1) cnt,
       coalesce(SUM(impressions), 0.0) AS "impressions",
       coalesce(SUM(publisher_clicks), 0.0) AS "pub_clicks",
       coalesce(SUM(publisher_cost), 0.0) AS "pub_cost",
       coalesce(SUM(conversions), 0.0) AS "conversions",
       coalesce(SUM(revenue), 0.0) AS "revenue"
  FROM perf.big_olap_doc
 WHERE time_id between 3000 and 3700
   AND network in ('SEARCH')
   AND conversion_type_id = 1
   AND client_id = 10724
--   AND device_type in ('MOBILE','DESKTOP','OTHER','TABLET')
   AND keyword_id in (
613214369, 613217307, 613247509, 613248897, 613250382, 613250387,
613252322, 613260252, 613261753, 613261754, 613261759,
613261770, 613261873, 613261884, 613261885, 613261888, 613261889,
613261892, 613261897, 613261913, 613261919, 613261927,
614496021, 843606367, 843606967, 843607021, 843607033, 843607089,
1038731600, 1038731672, 1038731673, 1038731675,
1038731684, 1038731693, 1046990487, 1046990488, 1046990499, 1046990505,
1046990506, 1049724722, 1051109548, 1051311275,
1051311904, 1060574377, 1060574395, 1060574506, 1060574562, 1115915938,
1115915939, 1115915941, 1116310571, 1367495544,
1367495545, 1367497297, 1367497298, 1367497299, 1367497300, 1367497303,
1367497313, 1367497813, 1367497816, 1367497818,
1367497821, 1367497822, 1367497823, 1624976423, 1624976451, 1624976457,
3275636061, 3275640505, 3275645765, 3275645807,
3275649138, 3275651456, 3275651460, 3275651478, 3275651479, 3275654566,
3275654568, 3275654570, 3275654575, 3275659612,
3275659616, 3275659620, 3275668880, 3275669693, 3275675627, 3275675634,
3275677479, 3275677504, 3275678855, 3275679524,
3275679532, 3275680014, 3275682307, 3275682308, 3275682309, 3275682310,
3275682420, 3275682423, 3275682436, 3275682448,
3275682460, 3275682462, 3275682474, 3275684831, 3275688903, 3275694023,
3275694025, 3275694027, 3275695054, 3275695056,
3275695062, 3275699512, 3275699514, 3275699518, 3275701682, 3275701683,
3275701685, 3275701688, 3275703633, 3275703634,
3275703635, 3275703636, 3275703638, 3275703639, 3275704860, 3275704861,
3275764577, 3275797149, 3275798566, 3275798567,
3275798568, 3275798592, 3275931147, 3275942728, 3275945337, 3275945338,
3275945339, 3275945340, 3275945342, 3275945344,
3275946319, 3275946322, 3275946324, 3275946643, 3275949495, 3275949498,
3275949500, 3275950250, 3275955128, 3275955129,
3275955130, 3427017435, 3427017450, 3438304254, 3438304257, 3447068169,
3505227849, 3505227890, 3505556908, 3506351285,
3506351389, 3506351398, 3506351468, 3510037138, 3510038610, 3545590644,
3545594378, 3545595073, 3545595318, 3545595506,
3545597841, 3545598818, 3545599658, 3545599663, 3545601215, 3556080898,
3556080980, 3556080999, 3556081323, 3565122663,
3565122679, 3565122801, 3565122858, 3565122908, 3565122929, 3565122952,
3565122984, 3565123028, 3565123047, 3565123048,
3565123203, 3565123230, 3949988054, 3949988056, 3949988070, 3972992248,
3972992252, 3972992254, 3972992257, 3972992263,
3972992267, 3972992268, 3972992269, 3972992270, 3972992274, 3972992275,
3972992277, 3972992281, 3972992293, 3972992298,
3972992299, 3972992305, 3972992307, 3972992313, 3972992316, 3972992322,
3972992338, 3978471261, 3978471272, 4266318185,
4298107404, 4308853119, 4308853123, 4308853500, 4451174646, 4451174656,
4451174701, 4569827278, 4569827284, 4569827287,
4569827379, 4569827523, 4569827524, 4896589676, 4979049725, 5054587609,
5136433884, 5362640372, 5393109964, 5393405364,
5393405365, 5393405620, 5393405625, 5393405675, 5393405677, 5393405858,
5393405970)


Reading your interpretation of the skip scan, I see that the plan is
indicating it is only using the salt and the first three columns of the
index, client_id, and time_id and conversion_type.  I hadn't considered the
salt - that bit of detail in the plan makes more sense to me now.  It looks
now like the lackluster performance for higher cardinality aggregations is
related to scanning a much larger portion of the key space.  For
aggregations where I am not relying on filtering, I am seeing much better
performance.

So to tune this particular stitch case / skip scan, it looks like I need to
get the 4th index column into the criteria.  There are only four distinct
values in the fourth index column (these can/should probably be something
other than varchar, but this is what I have loaded currently).  In order to
use the keyword_id portion of the index I tried explicitly specifying all
device_types via in-list (the commented portion of the query above), but I
get a peculiar error:

java.lang.IndexOutOfBoundsException: end index (1) must not be less than
start index (2)
at
com.google.common.base.Preconditions.checkPositionIndexes(Preconditions.java:388)
at com.google.common.collect.ImmutableList.subList(ImmutableList.java:362)
at com.google.common.collect.ImmutableList.subList(ImmutableList.java:62)
at
org.apache.phoenix.filter.SkipScanFilter.intersect(SkipScanFilter.java:291)
at
org.apache.phoenix.filter.SkipScanFilter.intersect(SkipScanFilter.java:177)
at org.apache.phoenix.compile.ScanRanges.intersectScan(ScanRanges.java:316)
at
org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(BaseResultIterators.java:464)
at
org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(BaseResultIterators.java:394)
at
org.apache.phoenix.iterate.BaseResultIterators.<init>(BaseResultIterators.java:184)
at
org.apache.phoenix.iterate.ParallelIterators.<init>(ParallelIterators.java:54)
at
org.apache.phoenix.execute.AggregatePlan.newIterator(AggregatePlan.java:173)
at org.apache.phoenix.execute.BaseQueryPlan.iterator(BaseQueryPlan.java:227)
at org.apache.phoenix.execute.BaseQueryPlan.iterator(BaseQueryPlan.java:154)
at
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:226)
at
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:217)
at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
at
org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:216)
at
org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1057)
at sqlline.SqlLine$Commands.execute(SqlLine.java:3673)
at sqlline.SqlLine$Commands.sql(SqlLine.java:3584)
at sqlline.SqlLine.dispatch(SqlLine.java:821)
at sqlline.SqlLine.begin(SqlLine.java:699)
at sqlline.SqlLine.mainWithInputRedirection(SqlLine.java:441)
at sqlline.SqlLine.main(SqlLine.java:424)


I thought perhaps I was hitting an upper limit on the number of elements in
an in-list for a skip scan, and so tried removing the 250 element keyword
in-list entirely and leaving only the device_type in-list, but I still get
the same error.  It happens immediately, even for an explain, so I presume
this is a query parsing problem.  Is there a bug or limitation of skip
scans and/or sub lists involving varchar?

Thx

Mime
View raw message