phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject RE: Help tuning for bursts of high traffic?
Date Fri, 04 Dec 2015 20:20:53 GMT
James,

2 quick followups, for whatever they're worth:

1 - There is nothing phoenix-related in /tmp

2 - I added a ton of logging, and played with the properties a bit, and I think I see a pattern:

Watching the logging and the system profiler side-by-side, I see that, periodically - maybe
every 60 or 90 seconds - all of my CPUs (there are 8 on this machine) go from mildly busy
to almost totally pegged.

They USUALLY stay pegged for 5-10 seconds, and then calm down.

However, occasionally, they stay pegged for around a minute. When this happens, I get the
very slow queries. I added logic so that when I get a very slow response (> 1 second),
I pause for 30 seconds.

This 'fixes' everything, in the sense that I'm usually able to get a couple thousand good
queries before the whole pattern repeats.

For reference, there's nothing external that should be causing those CPU spikes, so I'm guessing
that it's maybe java GC (?) or perhaps something that the phoenix client is doing ?

Can you guess at what Phoenix might do periodically that would peg the CPUs - and in such
a way that a query has to wait as much as 2 minutes to execute (I'm guessing from the pattern
that it's not actually the query that is slow, but a very long between when it gets queued
and when it actually gets executed).

Oh and the methods you mentioned aren't in my version of PhoenixRuntime, evidently. I'm on
4.2.2.something.

Thanks for any further feedback you can provide on this. Hopefully the conversation is helpful
to the whole Phoenix community.

From: Riesland, Zack
Sent: Friday, December 04, 2015 1:36 PM
To: user@phoenix.apache.org
Cc: Geoff.haisty@sensus.com
Subject: RE: Help tuning for bursts of high traffic?

Thanks, James

I'll work on gathering more information.

In the meantime, answers to a few of your questions inline below just narrow the scope a bit:

________________________________
From: James Taylor [jamestaylor@apache.org]
Sent: Friday, December 04, 2015 12:21 PM
To: user
Subject: Re: Help tuning for bursts of high traffic?
Zack,
Thanks for reporting this and for the detailed description. Here's a bunch of questions and
some things you can try in addition to what Andrew suggested:
1) Is this reproducible in a test environment (perhaps through Pherf: https://phoenix.apache.org/pherf.html)
so you can experiment more?
-Will check

2) Do you get a sense of whether the bottleneck is on the client or the server? CPU, IO, or
network? How many clients are you running and have you tried increasing this? Do you think
your network is saturated by the data being returned?
-I'm no expert on this. When I look at the HBase dashboard on Ambari, everything looks good.
When I look at the stats on the machine running the java code, it also looks good. Certainly
no bottleneck related to memory or CPU. Network wise, the box is on the same rack as the cluster,
with 10GB switches everywhere, so I'd be surprised if network latency were an issue.

3) From your description, it sounds like you're querying the data as your ingesting. When
it gets slow, have you tried running a major compaction to see if that helps? Perhaps queries
are getting slower because of the number of HFiles that need to be merged.
-Rereading my original email, I see where you get that. But actually, there is nothing being
ingested by HBase during this process. At the end of the process, I generate a CSV file that
is then consumed and altered by Pentaho, then consumed by Hive, and THEN some of the Hive
data is send to HBase/Phoenix. So this is part of the ingest process, but a precursor to the
cluster Ingesting any data.

4) If you bounce your cluster when it gets slow, does this have any impact?
-Can check. What should I expect to happen if I restart HBase-related services while trying
to query Phoenix? Will the query just wait until everything is back up? Will I get strange
exceptions? (Of course I'll go find this out myself)

5) What kinds of queries are running? Aggregation? Joins? Or just plain single table selects?
Any ORDER BY clauses? Are you using secondary indexes, and if so, what kind?
-Very simple query: select x, y, z from my_table where key = 'my_key' and sample_point <=
upper_range and sample_point >= lower_range. x, y, and z are integers.

6) Are you seeing GC pauses on the server during times of slowness (correlate time of slowness
with your server logs)?
-Can look

7) Sounds like your queries are returning a lot of data. On the client-side, Phoenix will
keep phoenix.query.spoolThresholdBytes in memory and then spool to disk as parallel execution
happens. Are you seeing many spool files on the client side your /tmp directory (this is where
Phoenix puts these by default with a name of ResultSpoolerXXX.bin). Try increasing this spool
threshold if that's the case.
-I'll look into this

8) For the data ingest, are you using UPSERT VALUES? How big of batches are you committing?
That's one thing to tune, especially if you're using secondary indexing.
-Again, nothing ingesting

9) Have you tried tuning the level of parallelization that Phoenix is doing for queries? This
is controlled by the server-side phoenix.stats.guidepost.width parameter (assuming you haven't
set the phoenix.stats.guidepost.per.region parameter) and defaults to 300MB. Try increasing
it (you'll need to run a major compaction for this to take effect, and there's 15min lag to
when the client sees it).
-This sounds interesting. I'll have to learn more about it.

10) If you're doing aggregation or join queries, try increasing the phoenix.query.maxGlobalMemorySize
property on the server side. Both hash joins and aggregation are done in memory, up to this
% limit. If the limit is reached, then on the server side, Phoenix will wait phoenix.query.maxGlobalMemoryWaitMs
time for the usage to go below the limit (and then throw an exception if it doesn't). You
can try tuning this wait time down to see if it has an impact.
-Not applicable

11) There a bunch of client-side metrics you can collect (but little documentation yet - keep
your eye on PHOENIX-2486) that might help you diagnose this. See PhoenixRuntime.getGlobalPhoenixClientMetrics(),
PhoenixRuntime.getOverAllReadRequestMetrics(), and other methods with Metrics in the name.
-I'll look into this

12) There's also tracing, which is end-to-end client/server, but it's in a bit on the raw
side still: https://phoenix.apache.org/tracing.htmlhttps://phoenix.apache.org/tracing.html
-OK

There's more information on these tuning parameters here: https://phoenix.apache.org/tuning.html
and you should take a look at Andrew's excellent tuning presentation here: https://phoenix.apache.org/resources.html.

Thanks,
James


On Fri, Dec 4, 2015 at 8:28 AM, Andrew Purtell <apurtell@apache.org<mailto:apurtell@apache.org>>
wrote:
Kumar - I believe you mentioned you are seeing this in a cluster of ~20 regionservers.

Zack - Yours is smaller yet, at 9.

These clusters are small enough to make getting stack dumps through the HBase debug servlet
during periods of unusually slow response possible. Perhaps you can write a script that queries
all of the debug servlets (can use curl) and dumps the received output into per-regionserver
files? Scrape every 10 or so seconds during the observed periods of slowness? Then compress
them and make them available for Phoenix devs up on S3? Consider it a poor man's sampler.
I don't know what we might find, but this could prove very helpful.


On Fri, Dec 4, 2015 at 8:11 AM, Kumar Palaniappan <kpalaniappan@marinsoftware.com<mailto:kpalaniappan@marinsoftware.com>>
wrote:
I'm in the same exact position as Zack described. Appreciate your feedback.

So far we tried the call queue n the handlers, nope. Planned to try off-heap cache.

Kumar Palaniappan<http://about.me/kumar.palaniappan>   <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>

On Dec 4, 2015, at 6:45 AM, Riesland, Zack <Zack.Riesland@sensus.com> wrote:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks Satish,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
To clarify: I'm not looking up single rows. I'm looking up the history of each widget, which
returns hundreds-to-thousands of results per widget (per query). <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Each query is a range scan, it's just that I'm performing thousands of them.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
From: Satish Iyengar [mailto:satysh@gmail.com]
Sent: Friday, December 04, 2015 9:43 AM
To: user@phoenix.apache.org
Subject: Re: Help tuning for bursts of high traffic?<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Hi Zack,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Did you consider avoiding hitting hbase for every single row by doing that step in an offline
mode? I was thinking if you could have some kind of daily export of hbase table and then use
pig to perform join (co-group perhaps) to do the same. Obviously this would work only when
your hbase table is not maintained by stream based system. Hbase is really good at range scans
and may not be ideal for single row (large number of).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Satish<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
On Fri, Dec 4, 2015 at 9:09 AM, Riesland, Zack <Zack.Riesland@sensus.com> wrote:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
SHORT EXPLANATION: a much higher percentage of queries to phoenix return exceptionally slow
after querying very heavily for several minutes.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
LONGER EXPLANATION:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
I've been using Pheonix for about a year as a data store for web-based reporting tools and
it works well.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Now, I'm trying to use the data in a different (much more request-intensive) way and encountering
some issues.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The scenario is basically this:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Daily, ingest very large CSV files with data for widgets. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Each input file has hundreds of rows of data for each widget, and tens of thousands of unique
widgets.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
As a first step, I want to de-duplicate this data against my Phoenix-based DB (I can't rely
on just upserting the data for de-dup because it will go through several ETL steps before
being stored into Phoenix/HBase).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
So, per-widget, I perform a query against Phoenix (the table is keyed against the unique widget
ID + sample point). I get all the data for a given widget id, within a certain period of time,
and then I only ingest rows for that widget that are new to me.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
I'm doing this in Java in a single step: I loop through my input file and perform one query
per widget, using the same Connection object to Phoenix.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
THE ISSUE:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
What I'm finding is that for the first several thousand queries, I almost always get a very
fast (less than 10 ms) response (good).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
But after 15-20 thousand queries, the response starts to get MUCH slower. Some queries respond
as expected, but many take as many as 2-3 minutes, pushing the total time to prime the data
structure into the 12-15 hour range, when it would only take 2-3 hours if all the queries
were fast.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The same exact queries, when run manually and not part of this bulk process, return in the
(expected) < 10 ms. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
So it SEEMS like the burst of queries puts Phoenix into some sort of busy state that causes
it to respond far too slowly. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The connection properties I'm setting are: <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phoenix.query.timeoutMs: 90000<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phoenix.query.keepAliveMs: 90000<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phenix.query.threadPoolSize: 256<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Our cluster is 9 (beefy) region servers and the table I'm referencing is 511 regions. We went
through a lot of pain to get the data split extremely well, and I don't think Schema design
is the issue here. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Can anyone help me understand how to make this better? Is there a better approach I could
take? A better set of configuration parameters? Is our cluster just too small for this?<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks!<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>


<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
-- <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Satish Iyengar

"Anyone who has never made a mistake has never tried anything new."
Albert Einstein<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>


<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
-- <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>

Mime
View raw message