phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Slow query help
Date Fri, 16 Mar 2018 18:23:06 GMT
Thanks everybody for the help.
I'm just curios to understand why the first query didn't complete. The
query is quite complex but the available memory should be more than enough.

MYTABLE has 222,547,674 rows. On Parquet it takes 15 GB, while uncompressed
(in memory during the download of the data) about 600 GB.
There are 144,149,639 rows where SOMEFIELD is not null and it has 37,449,596
distinct values (the table field is an unlimited VARCHAR but all values, if
any, has a max length of 20 chars).
HBase has 3 region servers with 64 GB each.
I've tried to give up to 32 GB of memory but the "sloppy" query wasn't able
to finish.
I thought that the count of distinct values on a single field shouldn't be
that heavy in the end (despite the source table is very big).
Am I wrong?

Any help is welcome.
Flavio

On Fri, Mar 16, 2018 at 7:08 PM, Samarth Jain <samarth.jain@gmail.com>
wrote:

> A less resource intensive approach would be to use approx count distinct -
> https://phoenix.apache.org/language/functions.html#approx_count_distinct
>
> You would still need the secondary index though, as James suggested, if
> you want it to run fast.
>
> On Fri, Mar 16, 2018 at 10:26 AM Flavio Pompermaier <pompermaier@okkam.it>
> wrote:
>
>> Thanks for the tip James. I didn't know that syntax for doing the count
>> on a distinct value!
>> This version is able to end, the first one wasn't able to finish even
>> giving a huge amount of memory to HBase (the cardinality of SOMEFIELD is
>> very big indeed).
>>
>> Thanks a lot,
>> Flavio
>>
>> On Fri, Mar 16, 2018 at 5:45 PM, James Taylor <jamestaylor@apache.org>
>> wrote:
>>
>>> Hi Flavio,
>>> You'll need to add a secondary index to SOMEFIELD (or SOMEFIELD + VALID)
>>> to speed that up. You can write it more simply as SELECT COUNT(DISTINCT
>>> SOMEFIELD) FROM TEST.MYTABLE WHERE VALID AND SOMEFIELD IS NOT NULL.
>>>
>>> Otherwise, you'll end up doing a full table scan (and use a fair amount
>>> of memory depending on the cardinality of SOMEFIELD). The above with a
>>> secondary index would skip to the distinct values instead.
>>>
>>> Thanks,
>>> James
>>>
>>> On Fri, Mar 16, 2018 at 8:30 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Hi to all,
>>>> I'm running a query like this one on my Phoenix 4.13 (on CDH 5.11.2):
>>>>
>>>> SELECT COUNT(*) FROM (
>>>>   SELECT DISTINCT(SOMEFIELD)
>>>>   FROM TEST.MYTABLE
>>>>   WHERE VALID = TRUE AND SOMEFIELD IS NOT NULL
>>>> )
>>>>
>>>> Unfortunately the query timeouts (timeout is 10 min).... Any suggestion
>>>> about how to tune my installation?
>>>>
>>>> Best,
>>>> Flavi
>>>>
>>>
>>

Mime
View raw message