phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Secondary index row explosion due to N key combos to handle ad-hoc queries?
Date Sat, 29 Mar 2014 00:08:52 GMT
Yes, that sounds right - the stats would be able to tell you which columns
have high cardinality. For low cardinality columns, our skip scan (
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html)
already helps quite a bit. Another slightly different approach would be to
develop a secondary indexing recommender that would use the explain plan
(once we add cost information to it) to determine when to add an index.
Another approach, that could potentially be done now would be to time query
executions (as a kind of poor man's stats) and when things start to slow
down, your index recommender would dynamically add an index. Haven't really
thought this through, but I've heard of other big data systems doing
something similar.



On Fri, Mar 28, 2014 at 4:56 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Ha!
>
> On Fri, Mar 28, 2014 at 7:49 PM, James Taylor <jamestaylor@apache.org>wrote:
>
>>
>>
>>
>> On Thu, Mar 27, 2014 at 9:47 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> On Thu, Mar 27, 2014 at 10:12 PM, James Taylor <jamestaylor@apache.org>wrote:
>>>
>>>> On Thu, Mar 27, 2014 at 3:02 PM, Otis Gospodnetic <
>>>> otis.gospodnetic@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I wanted to extract the following in a separate thread:
>>>>>
>>>>>  I was going to ask about partitioning as a way to handle (querying
>>>>>> against) large volumes of data.  This is related to my Q above about
>>>>>> date-based partitioning.  But I'm wondering if one can go further.
>>>>>>  Partitioning by date, partitioning by tenant, but then also partitioning
>>>>>> by some other columns, which would be different for each type of
data being
>>>>>> inserted. e.g. for sales data maybe the partitions would be date,
tenantID,
>>>>>> but then also customerCountry, customerGender, etc.  For performance
>>>>>> metrics data maybe it would be date, tenantID, but then also environment
>>>>>> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
>>>>>> metrics vs. my Tomcat performance metrics), and so on.
>>>>>>
>>>>>
>>>>> > Essentially, a secondary index is declaring a partitioning. The
>>>>> indexed columns make up the row > key which in HBase determines the
>>>>> partitioning.
>>>>>
>>>>> Aha!  Hmmm.  But, as far as I know, how one constructs the key is....
>>>>> the key.  That is, doesn't one typically construct the key based on access
>>>>> patterns?
>>>>>
>>>>> How would that work in the the scenario I described in my other email
>>>>> - unknown number of columns and ad-hoc SQL queries?
>>>>>
>>>>> How do you handle the above without having to create all possible
>>>>> combinations of columns (to anticipate any sort of query) and having
to
>>>>> insert N rows in the index table for each 1 row in the primary table?
>>>>>  Don't you have to do that in order to handle any ad-hoc query one may
>>>>> choose to run?
>>>>>
>>>>
>>>> That's true - you'd want to selectively add indexes, based on
>>>> anticipated access patterns. It's similar to the RDBMS world in that regard.
>>>>
>>>
>>> Aha!  So that's the problem I was after! :(
>>> We have a solution like this already.  But in order to handle any ad-hoc
>>> query it requires this row explosion.  And that "doesn't work" if you allow
>>> any number of attributes and if you don't control how data will be queried.
>>>  To cover all cases you have to create all possible combinations of
>>> attributes for composite keys and that means that for every 1 row inserted
>>> in the primary table you have many rows inserted in the index table.
>>>
>>> By "doesn't work" I mean that it's really hard to scale, that it's
>>> expensive.
>>>
>>> Please correct me if you have a feeling I'm wrong or if you see a way to
>>> handle this ad-hoc query situation without the massive key explosion.
>>>
>>
>> One potential solution would be to only add a secondary index when the
>> data size for a particular tenant goes beyond a threshold. There's no
>> requirement to add an index - queries over non indexed data are still very
>> fast until the data sizes get bigger. Not sure what your requirements would
>> be for response time, but perhaps a threshold of 10-50M rows. One area were
>> working on for our next release is tracking table stats: row counts and
>> cardinalities. We'll use these to guide optimizer decisions, but we could
>> potentially expose them through some kind of API for use in this scenario
>> as well.
>>
>
> Didn't get to write earlier, but that's exactly what I was going to ask
> about - data stats and whether they are being tracked.  Although I was
> wondering if you could use that to figure out *which* columns to use to
> build the keys.  Maybe some columns have low and some high cardinality and
> maybe that means that for some one should not create extra keys for the
> index?  Is this the right thinking?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Mime
View raw message