phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Update statistics made query 2-3x slower
Date Tue, 03 Mar 2015 01:20:12 GMT
Constantin,
I've filed PHOENIX-1693 for this issue, as we seem to be seeing a similar
phenomena too. It seems to only occur if we've never run a major compaction
on the table, though. Is that the case for you as well?
Thanks,
James

On Mon, Feb 16, 2015 at 8:44 AM, Vasudevan, Ramkrishna S <
ramkrishna.s.vasudevan@intel.com> wrote:

>  Without update statistics – if we run select count(*) what is the PLAN
> that it executes?  One of the RS has got more data I believe.
>
>
>
> Regards
>
> Ram
>
>
>
> *From:* Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com]
> *Sent:* Monday, February 16, 2015 3:17 PM
>
> *To:* user@phoenix.apache.org
> *Subject:* RE: Update statistics made query 2-3x slower
>
>
>
> Hello,
>
>
>
> I forgot to add this info:
>
>
>
> 3. There are just a few requests while running select count(*) – no other
> query was running concurrently.
>
>
>
> *ServerName*
>
> *Request Per Second*
>
> *Read Request Count*
>
> *Write Request Count*
>
> ip-10-2-3-11.lps.stage,60020,1424079136831
> <http://ip-10-2-3-11.lps.stage:60030/>
>
> 3
>
> 0
>
> 0
>
> ip-10-2-3-12.lps.stage,60020,1423492976623
> <http://ip-10-2-3-12.lps.stage:60030/>
>
> 5
>
> 571301
>
> 106
>
> ip-10-2-3-13.lps.stage,60020,1423493014906
> <http://ip-10-2-3-13.lps.stage:60030/>
>
> 1
>
> 18515
>
> 3
>
> ip-10-2-4-11.lps.stage,60020,1423493039310
> <http://ip-10-2-4-11.lps.stage:60030/>
>
> 1
>
> 31514
>
> 5
>
> ip-10-2-4-12.lps.stage,60020,1423493067346
> <http://ip-10-2-4-12.lps.stage:60030/>
>
> 2
>
> 80751
>
> 3
>
> ip-10-2-4-13.lps.stage,60020,1423493620630
> <http://ip-10-2-4-13.lps.stage:60030/>
>
> 3
>
> 13120
>
> 226
>
>
>
> I tried to recreate the statistics but this time it failed L
>
>
>
> update statistics tableX;
>
> Error: ERROR 6000 (TIM01): Operation timed out . Query couldn't be
> completed in the alloted time: 600000 ms (state=TIM01,code=6000)
>
>
>
>
>
> Thank you,
>
>    Constantin
>
>
>
> *From:* Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com
> <Constantin.Ciureanu@gfk.com>]
> *Sent:* Monday, February 16, 2015 10:31 AM
> *To:* user@phoenix.apache.org
> *Subject:* RE: Update statistics made query 2-3x slower
>
>
>
> Hi Mujtaba,
>
>
>
> 1.     Update statistics - was executed after loading the data in the
> table (and after first select count(*)… of course).
>
> 2.     6 region servers x ~50 regions each = 310 regions (too many?)
>
> 3.       Here it is:
>
> explain select count(*) from tableX;
>
> +------------------------------------------+
>
> |                   PLAN                   |
>
> +------------------------------------------+
>
> | CLIENT 24-CHUNK PARALLEL 24-WAY FULL SCAN OVER tableX |
>
> |     SERVER FILTER BY FIRST KEY ONLY      |
>
> |     SERVER AGGREGATE INTO SINGLE ROW     |
>
> +------------------------------------------+
>
> 3 rows selected (0.287 seconds)
>
>
>
> Thank you,
>
>   Constantin
>
>
>
> *From:* Mujtaba Chohan [mailto:mujtaba@apache.org <mujtaba@apache.org>]
> *Sent:* Friday, February 13, 2015 7:21 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Update statistics made query 2-3x slower
>
>
>
> Hi Constantin,
>
> This in useful info. Just to clarify slowdown that you see after update
> statistics case, was the *update statistics* executed after initial data
> load or after you upserted it again?
>
> 2. How many regions/region servers (RS) did the data end up on?
>
> 3. 30/60 seconds for count(*) seems really high. Do you see lots of disk
> I/O during your count query? How wide are your rows and how much memory is
> available on your RS/HBase heap?
>
>
>
> 3. Can you also send output of *explain select count(*) from tablex* for
> this case?
>
>
> Thanks,
>
> Mujtaba
>
>
>
> On Fri, Feb 13, 2015 at 12:34 AM, Ciureanu, Constantin (GfK) <
> Constantin.Ciureanu@gfk.com> wrote:
>
> Hello Mujtaba,
>
>
>
> Don’t worry – it was just the *select count(*) from tableX* that was
> slowed down in a more than visible way.
>
> I presume all the regular queries do actually benefit from using the STATS.
>
>
>
> Some other cases where I saw slowdown for “*select count(*) from tableX*”:
>
> -       First time after loading 6 M records – the time to obtain the
> count was ~30 sec
>
> -       After loading the *same* 6 M records again – the time almost
> doubled L I imagine the data is doubled, not yet compacted in HBase
>
> -       After deleting the 6M rows (delete from …. , not truncate) and
> loading the 6M rows again – the same double time – same comment as above
>
> -       After update statistics tableX – the time was around 2x the
> original time (~60 seconds) – this I couldn’t really explain (perhaps the
> fleet I use is undersized)
>
>
>
> I need to mention that I’m using 4.2.2 but I can’t wait for 4.3 to be
> released (as it will fix some issues I have. Eg. one with SKIP SCAN:
>
> [1st part of PK between A and B] or [first part of PK between C and D] or
> [….] was understood as a full table scan = painfully slow -> but this
> worked today after I used a hint in the SELECT /*+ SKIP_SCAN */ which
> shouldn’t be mandatory in my opinion).
>
>
>
> Regards,
>
>   Constantin
>
>
>
> *From:* Mujtaba Chohan [mailto:mujtaba@apache.org]
> *Sent:* Thursday, February 12, 2015 9:20 PM
>
>
> *To:* user@phoenix.apache.org
> *Subject:* Re: Update statistics made query 2-3x slower
>
>
>
> Constantin - If possible can you please share your schema, approx.
> row/columns width, number of region servers in your cluster plus their heap
> size, HBase/Phoenix version and any default property overrides so we can
> identify why stats are slowing things down in your case.
>
>
>
> Thanks,
>
> Mujtaba
>
>
>
> On Thu, Feb 12, 2015 at 12:56 AM, Ciureanu, Constantin (GfK) <
> Constantin.Ciureanu@gfk.com> wrote:
>
> It worked!
>
> Without stats it’s again faster (2-3x times) – but I do understand that
> all other normal queries might benefit from the stats.
>
>
>
> Thank you Mujtaba for the info,
>
> Thank you Vasudevan for the explanations, I already used HBase and I agree
> it’s hard to have a counter for the table rows (especially if the
> tombstones for deleted rows are still there – ie. not compacted yet).
>
>
>
> Constantin
>
>
>
>
>
>
>
> *From:* Mujtaba Chohan [mailto:mujtaba@apache.org]
> *Sent:* Wednesday, February 11, 2015 8:54 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Update statistics made query 2-3x slower
>
>
>
> To compare performance without stats, try deleting related rows from
> SYSTEM.STATS or an easier way, just truncate SYSTEM.STATS table from HBase
> shell and restart your region servers.
>
> //mujtaba
>
>
>
> On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S <
> ramkrishna.s.vasudevan@intel.com> wrote:
>
> Hi Constantin
>
>
>
> Before I could explain on the slowness part let me answer your 2nd
> question,
>
>
>
> Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data
> that is residing inside logical entities called regions are spread across
> different nodes (region servers).  There is nothing like a table that is in
> one location where you can keep updating the count of rows that is getting
> inserted.
>
>
>
> Which means that when you need  count(*) you may have to aggregate the
> count from every region distributed across region servers. So in other
> words a table is not a single entity it is a collection of regions.
>
>
>
> Coming to your slowness in query, the update statistics query allows you
> to parallelize the query into logical chunks on a single region.  Suppose
> there are 100K rows in a region the statistics collected would allow you to
> run a query parallely for eg say execute parallely on 10 equal chunks of
> 10000 rows within that region.
>
>
>
> Have you modified any of the parameters related to statistics like this
> one ‘phoenix.stats.guidepost.width’.
>
>
>
>
>
> Regards
>
> Ram
>
> *From:* Ciureanu, Constantin (GfK) [mailto:Constantin.Ciureanu@gfk.com]
> *Sent:* Wednesday, February 11, 2015 2:51 PM
> *To:* user@phoenix.apache.org
> *Subject:* Update statistics made query 2-3x slower
>
>
>
> Hello all,
>
>
>
> 1.     Is there a good explanation why updating the statistics:
>
> *update statistics tableX;*
>
>
>
> made this query 2x times slower?   (it was 27 seconds before, now it’s
> somewhere between 60 – 90 seconds)
>
> *select count(*) from tableX;*
>
> +------------------------------------------+
>
> |                 COUNT(1)                 |
>
> +------------------------------------------+
>
> | 5786227                                  |
>
> +------------------------------------------+
>
> 1 row selected (62.718 seconds)
>
>
>
> (If possible J ) how can I “drop” those statistics?
>
>
>
> 2. Why there is nothing (like a counter / attribute for the table) to
> obtain the number of rows in one table fast?
>
>
>
> Thank you,
>
>    Constantin
>
>
>
>
>
>
>

Mime
View raw message