phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <>
Subject Re: Secondary index row explosion due to N key combos to handle ad-hoc queries?
Date Fri, 28 Mar 2014 23:49:49 GMT
On Thu, Mar 27, 2014 at 9:47 PM, Otis Gospodnetic <> wrote:

> Hi,
> On Thu, Mar 27, 2014 at 10:12 PM, James Taylor <>wrote:
>> On Thu, Mar 27, 2014 at 3:02 PM, Otis Gospodnetic <
>>> wrote:
>>> Hi,
>>> I wanted to extract the following in a separate thread:
>>>  I was going to ask about partitioning as a way to handle (querying
>>>> against) large volumes of data.  This is related to my Q above about
>>>> date-based partitioning.  But I'm wondering if one can go further.
>>>>  Partitioning by date, partitioning by tenant, but then also partitioning
>>>> by some other columns, which would be different for each type of data being
>>>> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
>>>> but then also customerCountry, customerGender, etc.  For performance
>>>> metrics data maybe it would be date, tenantID, but then also environment
>>>> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
>>>> metrics vs. my Tomcat performance metrics), and so on.
>>> > Essentially, a secondary index is declaring a partitioning. The
>>> indexed columns make up the row > key which in HBase determines the
>>> partitioning.
>>> Aha!  Hmmm.  But, as far as I know, how one constructs the key is....
>>> the key.  That is, doesn't one typically construct the key based on access
>>> patterns?
>>> How would that work in the the scenario I described in my other email -
>>> unknown number of columns and ad-hoc SQL queries?
>>> How do you handle the above without having to create all possible
>>> combinations of columns (to anticipate any sort of query) and having to
>>> insert N rows in the index table for each 1 row in the primary table?
>>>  Don't you have to do that in order to handle any ad-hoc query one may
>>> choose to run?
>> That's true - you'd want to selectively add indexes, based on anticipated
>> access patterns. It's similar to the RDBMS world in that regard.
> Aha!  So that's the problem I was after! :(
> We have a solution like this already.  But in order to handle any ad-hoc
> query it requires this row explosion.  And that "doesn't work" if you allow
> any number of attributes and if you don't control how data will be queried.
>  To cover all cases you have to create all possible combinations of
> attributes for composite keys and that means that for every 1 row inserted
> in the primary table you have many rows inserted in the index table.
> By "doesn't work" I mean that it's really hard to scale, that it's
> expensive.
> Please correct me if you have a feeling I'm wrong or if you see a way to
> handle this ad-hoc query situation without the massive key explosion.

One potential solution would be to only add a secondary index when the data
size for a particular tenant goes beyond a threshold. There's no
requirement to add an index - queries over non indexed data are still very
fast until the data sizes get bigger. Not sure what your requirements would
be for response time, but perhaps a threshold of 10-50M rows. One area were
working on for our next release is tracking table stats: row counts and
cardinalities. We'll use these to guide optimizer decisions, but we could
potentially expose them through some kind of API for use in this scenario
as well.


> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support *

View raw message