phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Secondary index row explosion due to N key combos to handle ad-hoc queries?
Date Fri, 28 Mar 2014 23:49:49 GMT
On Thu, Mar 27, 2014 at 9:47 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 10:12 PM, James Taylor <jamestaylor@apache.org>wrote:
>
>> On Thu, Mar 27, 2014 at 3:02 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I wanted to extract the following in a separate thread:
>>>
>>>  I was going to ask about partitioning as a way to handle (querying
>>>> against) large volumes of data.  This is related to my Q above about
>>>> date-based partitioning.  But I'm wondering if one can go further.
>>>>  Partitioning by date, partitioning by tenant, but then also partitioning
>>>> by some other columns, which would be different for each type of data being
>>>> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
>>>> but then also customerCountry, customerGender, etc.  For performance
>>>> metrics data maybe it would be date, tenantID, but then also environment
>>>> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
>>>> metrics vs. my Tomcat performance metrics), and so on.
>>>>
>>>
>>> > Essentially, a secondary index is declaring a partitioning. The
>>> indexed columns make up the row > key which in HBase determines the
>>> partitioning.
>>>
>>> Aha!  Hmmm.  But, as far as I know, how one constructs the key is....
>>> the key.  That is, doesn't one typically construct the key based on access
>>> patterns?
>>>
>>> How would that work in the the scenario I described in my other email -
>>> unknown number of columns and ad-hoc SQL queries?
>>>
>>> How do you handle the above without having to create all possible
>>> combinations of columns (to anticipate any sort of query) and having to
>>> insert N rows in the index table for each 1 row in the primary table?
>>>  Don't you have to do that in order to handle any ad-hoc query one may
>>> choose to run?
>>>
>>
>> That's true - you'd want to selectively add indexes, based on anticipated
>> access patterns. It's similar to the RDBMS world in that regard.
>>
>
> Aha!  So that's the problem I was after! :(
> We have a solution like this already.  But in order to handle any ad-hoc
> query it requires this row explosion.  And that "doesn't work" if you allow
> any number of attributes and if you don't control how data will be queried.
>  To cover all cases you have to create all possible combinations of
> attributes for composite keys and that means that for every 1 row inserted
> in the primary table you have many rows inserted in the index table.
>
> By "doesn't work" I mean that it's really hard to scale, that it's
> expensive.
>
> Please correct me if you have a feeling I'm wrong or if you see a way to
> handle this ad-hoc query situation without the massive key explosion.
>

One potential solution would be to only add a secondary index when the data
size for a particular tenant goes beyond a threshold. There's no
requirement to add an index - queries over non indexed data are still very
fast until the data sizes get bigger. Not sure what your requirements would
be for response time, but perhaps a threshold of 10-50M rows. One area were
working on for our next release is tracking table stats: row counts and
cardinalities. We'll use these to guide optimizer decisions, but we could
potentially expose them through some kind of API for use in this scenario
as well.

Thanks,
James

>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Mime
View raw message