phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Secondary index row explosion due to N key combos to handle ad-hoc queries?
Date Fri, 28 Mar 2014 04:47:39 GMT

On Thu, Mar 27, 2014 at 10:12 PM, James Taylor <>wrote:

> On Thu, Mar 27, 2014 at 3:02 PM, Otis Gospodnetic <
>> wrote:
>> Hi,
>> I wanted to extract the following in a separate thread:
>>  I was going to ask about partitioning as a way to handle (querying
>>> against) large volumes of data.  This is related to my Q above about
>>> date-based partitioning.  But I'm wondering if one can go further.
>>>  Partitioning by date, partitioning by tenant, but then also partitioning
>>> by some other columns, which would be different for each type of data being
>>> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
>>> but then also customerCountry, customerGender, etc.  For performance
>>> metrics data maybe it would be date, tenantID, but then also environment
>>> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
>>> metrics vs. my Tomcat performance metrics), and so on.
>> > Essentially, a secondary index is declaring a partitioning. The indexed
>> columns make up the row > key which in HBase determines the partitioning.
>> Aha!  Hmmm.  But, as far as I know, how one constructs the key is.... the
>> key.  That is, doesn't one typically construct the key based on access
>> patterns?
>> How would that work in the the scenario I described in my other email -
>> unknown number of columns and ad-hoc SQL queries?
>> How do you handle the above without having to create all possible
>> combinations of columns (to anticipate any sort of query) and having to
>> insert N rows in the index table for each 1 row in the primary table?
>>  Don't you have to do that in order to handle any ad-hoc query one may
>> choose to run?
> That's true - you'd want to selectively add indexes, based on anticipated
> access patterns. It's similar to the RDBMS world in that regard.

Aha!  So that's the problem I was after! :(
We have a solution like this already.  But in order to handle any ad-hoc
query it requires this row explosion.  And that "doesn't work" if you allow
any number of attributes and if you don't control how data will be queried.
 To cover all cases you have to create all possible combinations of
attributes for composite keys and that means that for every 1 row inserted
in the primary table you have many rows inserted in the index table.

By "doesn't work" I mean that it's really hard to scale, that it's

Please correct me if you have a feeling I'm wrong or if you see a way to
handle this ad-hoc query situation without the massive key explosion.

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support *

View raw message