phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Secondary index row explosion due to N key combos to handle ad-hoc queries?
Date Fri, 28 Mar 2014 23:56:01 GMT

On Fri, Mar 28, 2014 at 7:49 PM, James Taylor <>wrote:

> On Thu, Mar 27, 2014 at 9:47 PM, Otis Gospodnetic <
>> wrote:
>> Hi,
>> On Thu, Mar 27, 2014 at 10:12 PM, James Taylor <>wrote:
>>> On Thu, Mar 27, 2014 at 3:02 PM, Otis Gospodnetic <
>>>> wrote:
>>>> Hi,
>>>> I wanted to extract the following in a separate thread:
>>>>  I was going to ask about partitioning as a way to handle (querying
>>>>> against) large volumes of data.  This is related to my Q above about
>>>>> date-based partitioning.  But I'm wondering if one can go further.
>>>>>  Partitioning by date, partitioning by tenant, but then also partitioning
>>>>> by some other columns, which would be different for each type of data
>>>>> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
>>>>> but then also customerCountry, customerGender, etc.  For performance
>>>>> metrics data maybe it would be date, tenantID, but then also environment
>>>>> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
>>>>> metrics vs. my Tomcat performance metrics), and so on.
>>>> > Essentially, a secondary index is declaring a partitioning. The
>>>> indexed columns make up the row > key which in HBase determines the
>>>> partitioning.
>>>> Aha!  Hmmm.  But, as far as I know, how one constructs the key is....
>>>> the key.  That is, doesn't one typically construct the key based on access
>>>> patterns?
>>>> How would that work in the the scenario I described in my other email -
>>>> unknown number of columns and ad-hoc SQL queries?
>>>> How do you handle the above without having to create all possible
>>>> combinations of columns (to anticipate any sort of query) and having to
>>>> insert N rows in the index table for each 1 row in the primary table?
>>>>  Don't you have to do that in order to handle any ad-hoc query one may
>>>> choose to run?
>>> That's true - you'd want to selectively add indexes, based on
>>> anticipated access patterns. It's similar to the RDBMS world in that regard.
>> Aha!  So that's the problem I was after! :(
>> We have a solution like this already.  But in order to handle any ad-hoc
>> query it requires this row explosion.  And that "doesn't work" if you allow
>> any number of attributes and if you don't control how data will be queried.
>>  To cover all cases you have to create all possible combinations of
>> attributes for composite keys and that means that for every 1 row inserted
>> in the primary table you have many rows inserted in the index table.
>> By "doesn't work" I mean that it's really hard to scale, that it's
>> expensive.
>> Please correct me if you have a feeling I'm wrong or if you see a way to
>> handle this ad-hoc query situation without the massive key explosion.
> One potential solution would be to only add a secondary index when the
> data size for a particular tenant goes beyond a threshold. There's no
> requirement to add an index - queries over non indexed data are still very
> fast until the data sizes get bigger. Not sure what your requirements would
> be for response time, but perhaps a threshold of 10-50M rows. One area were
> working on for our next release is tracking table stats: row counts and
> cardinalities. We'll use these to guide optimizer decisions, but we could
> potentially expose them through some kind of API for use in this scenario
> as well.

Didn't get to write earlier, but that's exactly what I was going to ask
about - data stats and whether they are being tracked.  Although I was
wondering if you could use that to figure out *which* columns to use to
build the keys.  Maybe some columns have low and some high cardinality and
maybe that means that for some one should not create extra keys for the
index?  Is this the right thinking?

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support *

View raw message