phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ciureanu, Constantin (GfK)" <>
Subject RE: Update statistics made query 2-3x slower
Date Mon, 16 Feb 2015 09:31:06 GMT
Hi Mujtaba,

1.     Update statistics - was executed after loading the data in the table (and after first
select count(*)… of course).

2.     6 region servers x ~50 regions each = 310 regions (too many?)

3.       Here it is:
explain select count(*) from tableX;
|                   PLAN                   |
3 rows selected (0.287 seconds)

Thank you,

From: Mujtaba Chohan []
Sent: Friday, February 13, 2015 7:21 PM
Subject: Re: Update statistics made query 2-3x slower

Hi Constantin,
This in useful info. Just to clarify slowdown that you see after update statistics case, was
the update statistics executed after initial data load or after you upserted it again?

2. How many regions/region servers (RS) did the data end up on?
3. 30/60 seconds for count(*) seems really high. Do you see lots of disk I/O during your count
query? How wide are your rows and how much memory is available on your RS/HBase heap?

3. Can you also send output of explain select count(*) from tablex for this case?


On Fri, Feb 13, 2015 at 12:34 AM, Ciureanu, Constantin (GfK) <<>>
Hello Mujtaba,

Don’t worry – it was just the select count(*) from tableX that was slowed down in a more
than visible way.
I presume all the regular queries do actually benefit from using the STATS.

Some other cases where I saw slowdown for “select count(*) from tableX”:

-       First time after loading 6 M records – the time to obtain the count was ~30 sec

-       After loading the same 6 M records again – the time almost doubled ☹ I imagine
the data is doubled, not yet compacted in HBase

-       After deleting the 6M rows (delete from …. , not truncate) and loading the 6M rows
again – the same double time – same comment as above

-       After update statistics tableX – the time was around 2x the original time (~60 seconds)
– this I couldn’t really explain (perhaps the fleet I use is undersized)

I need to mention that I’m using 4.2.2 but I can’t wait for 4.3 to be released (as it
will fix some issues I have. Eg. one with SKIP SCAN:
[1st part of PK between A and B] or [first part of PK between C and D] or [….] was understood
as a full table scan = painfully slow -> but this worked today after I used a hint in the
SELECT /*+ SKIP_SCAN */ which shouldn’t be mandatory in my opinion).


From: Mujtaba Chohan [<>]
Sent: Thursday, February 12, 2015 9:20 PM

Subject: Re: Update statistics made query 2-3x slower

Constantin - If possible can you please share your schema, approx. row/columns width, number
of region servers in your cluster plus their heap size, HBase/Phoenix version and any default
property overrides so we can identify why stats are slowing things down in your case.


On Thu, Feb 12, 2015 at 12:56 AM, Ciureanu, Constantin (GfK) <<>>
It worked!
Without stats it’s again faster (2-3x times) – but I do understand that all other normal
queries might benefit from the stats.

Thank you Mujtaba for the info,
Thank you Vasudevan for the explanations, I already used HBase and I agree it’s hard to
have a counter for the table rows (especially if the tombstones for deleted rows are still
there – ie. not compacted yet).


From: Mujtaba Chohan [<>]
Sent: Wednesday, February 11, 2015 8:54 PM
Subject: Re: Update statistics made query 2-3x slower

To compare performance without stats, try deleting related rows from SYSTEM.STATS or an easier
way, just truncate SYSTEM.STATS table from HBase shell and restart your region servers.

On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S <<>>
Hi Constantin

Before I could explain on the slowness part let me answer your 2nd question,

Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data that is residing
inside logical entities called regions are spread across different nodes (region servers).
 There is nothing like a table that is in one location where you can keep updating the count
of rows that is getting inserted.

Which means that when you need  count(*) you may have to aggregate the count from every region
distributed across region servers. So in other words a table is not a single entity it is
a collection of regions.

Coming to your slowness in query, the update statistics query allows you to parallelize the
query into logical chunks on a single region.  Suppose there are 100K rows in a region the
statistics collected would allow you to run a query parallely for eg say execute parallely
on 10 equal chunks of 10000 rows within that region.

Have you modified any of the parameters related to statistics like this one ‘phoenix.stats.guidepost.width’.

From: Ciureanu, Constantin (GfK) [<>]
Sent: Wednesday, February 11, 2015 2:51 PM
Subject: Update statistics made query 2-3x slower

Hello all,

1.     Is there a good explanation why updating the statistics:
update statistics tableX;

made this query 2x times slower?   (it was 27 seconds before, now it’s somewhere between
60 – 90 seconds)
select count(*) from tableX;
|                 COUNT(1)                 |
| 5786227                                  |
1 row selected (62.718 seconds)

(If possible ☺ ) how can I “drop” those statistics?

2. Why there is nothing (like a counter / attribute for the table) to obtain the number of
rows in one table fast?

Thank you,

View raw message