phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject RE: Help tuning for bursts of high traffic?
Date Thu, 10 Dec 2015 20:15:32 GMT
Thanks Jain,

The 3800 seconds was just for the executeQUery()

I've seen it as high as 5900
________________________________________
From: Samarth Jain [samarth.jain@gmail.com]
Sent: Thursday, December 10, 2015 2:31 PM
To: user@phoenix.apache.org
Subject: Re: Help tuning for bursts of high traffic?

Thanks for the additional information, Zack. Looking at the numbers it looks like the bottle-neck
is probably not coming from the phoenix thread pool.

For request level metrics:

TASK_QUEUE_WAIT_TIME - represents the length of time (wall clock)  phoenix scans had to wait
in the thread pool's queue before they were picked up for execution. A higher value of this
would mean that you likely need to increase your thread pool size.

TASK_EXECUTION_TIME - represents the time taken by scans to complete.

TASK_END_TO_END_TIME - is roughly task_queue_wait_time + task_execution_time

Did you get 3800ms for stmt.executeQuery() itself or did that time include time spent in retrieving
records via resultSet.next() too?


On Thu, Dec 10, 2015 at 7:38 AM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
Thanks,

I did some experimenting.

Now, anytime I get a query that lasts longer than 500 ms, I call getRequestReadMetrics() and
print everything.

The output suggests to me that Phoenix thinks that the query is still reasonably fast:

TASK_QUEUE_WAIT_TIME: 0
TASK_EXECUTION_TIME: 80
TASK_END_TO_END_TIME: 80

Should I read this as: “The query did not have to wait at all and took 80 ms to execute”
?

Comparing System.currentTimeMillis() immediately before and after calling stmt.executeQuery()
is about 3,800 ms for this particular example.

The other (non-zero) counters, from getGlobalPhoenixClientMetrics() at this time, are:

QUERY_TIME: 347802
NUM_PARALLEL_SCANS: 5638
SCAN_BYTES: 3074163200<tel:3074163200>
TASK_QUEUE_WAIT_TIME: 12129
TASK_END_TO_END_TIME: 851524
TASK_EXECUTION_TIME: 839347
MUTATION_SQL_COUNTER: 93
SELECT_SQL_COUNTER: 5610 (this is the number of  total queries on this connection)

Can you please help me understand how to interpret all this and possibly understand how to
configure this scenario to make it faster?

Right now, I’m only averaging about 5 queries/second, even though I’m querying by the
primary key.

Before I upgraded, I was getting a lot closer to 100.

Thanks!



From: Samarth Jain [mailto:samarth@apache.org<mailto:samarth@apache.org>]
Sent: Wednesday, December 09, 2015 1:59 PM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Help tuning for bursts of high traffic?

Zack,

These stats are collected continuously and at the global client level. So collecting them
only when the query takes more than 1 second won't work. A better alternative for you would
be to report stats at a request level. You could then conditionally report the metrics for
queries that exceed 1 second execution. The metric you want to make that decision on is WALL_CLOCK_TIME_MS.

Request level metric collection isn't enabled by default. You can turn it on/off for a PhoenixConnection
by doing the following:

Properties props = new Properties();
props.put(QueryServices.COLLECT_REQUEST_LEVEL_METRICS, String.valueOf(true));
Connection conn = DriverManager.getConnection(url, props);

Then see PhoenixRuntime.getRequestReadMetrics() on how to get hold of these metrics.

I have a JIRA filed - https://issues.apache.org/jira/browse/PHOENIX-2486 for adding the documentation
for these metrics. Will get to it shortly.






On Wed, Dec 9, 2015 at 10:38 AM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
Thanks a lot, James.

And now that I’m using 4.6, I have access to some statistics.

I added code to collect stats whenever a query takes more than 1 second.

The first time this happened was after about 51,000 queries, and the output is below, for
whatever it’s worth.

Can you point me to any documentation to describe these values, or to utilize this information?

I have a hunch that SCAN_BYTES and TASK_QUEUE_WAIT_TIME are the interesting numbers here?

PHOENIX STATS:
Batch sizes of mutations | MUTATION_BATCH_SIZE: 0, Number of samples: 0 | 0
Size of mutations in bytes | MUTATION_BYTES: 0, Number of samples: 0 | 0
Time it took to commit mutations | MUTATION_COMMIT_TIME: 0, Number of samples: 0 | 0
Query times | QUERY_TIME: 2424030, Number of samples: 51555 | 2424030
Number of scans that were executed in parallel | NUM_PARALLEL_SCANS: 51769, Number of samples:
51554 | 51769
Number of bytes read by scans | SCAN_BYTES: 28193431552, Number of samples: 220261184 | 28193431552
Size of spool files created in bytes | SPOOL_FILE_SIZE: 0, Number of samples: 0 | 0
Number of bytes allocated by the memory manager | MEMORY_CHUNK_BYTES: 0, Number of samples:
0 | 0
Number of milliseconds threads needed to wait for memory to be allocated through memory manager
| MEMORY_WAIT_TIME: 0, Number of samples: 0 | 0
Time in milliseconds tasks had to wait in the queue of the thread pool executor | TASK_QUEUE_WAIT_TIME:
114349, Number of samples: 2235211 | 114349
Time in milliseconds spent by tasks from creation to completion | TASK_END_TO_END_TIME: 6030797,
Number of samples: 2235211 | 6030797
Time in milliseconds tasks took to execute | TASK_EXECUTION_TIME: 5916067, Number of samples:
2235211 | 5916067
Counter for number of mutation sql statements | MUTATION_SQL_COUNTER: 93, Number of samples:
93 | 93
Counter for number of sql queries | SELECT_SQL_COUNTER: 51555, Number of samples: 51555 |
51555
Counter for number of tasks submitted to the thread pool executor | TASK_EXECUTED_COUNTER:
2235211, Number of samples: 2235211 | 2235211
Counter for number of tasks that were rejected by the thread pool executor | TASK_REJECTED_COUNTER:
0, Number of samples: 0 | 0
Number of times query timed out | QUERY_TIMEOUT_COUNTER: 0, Number of samples: 0 | 0
Number of times query failed | QUERY_FAILED_COUNTER: 0, Number of samples: 0 | 0
Number of spool files created | SPOOL_FILE_COUNTER: 0, Number of samples: 0 | 0

From: James Taylor [mailto:jamestaylor@apache.org<mailto:jamestaylor@apache.org>]
Sent: Tuesday, December 08, 2015 7:10 PM

To: user
Subject: Re: Help tuning for bursts of high traffic?

Thanks for clarifying, Zack. For profiling on your client machine, you can just use VisualVM
as it's built into the JDK (https://dzone.com/articles/best-kept-secret-jdk-visualvm).

    James

On Tue, Dec 8, 2015 at 6:31 AM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
Thanks for your reply, James.

I apologize if I’m not being very clear.

To clarify, NO DATA is being modified (upserted or deleted) in HBase/Phoenix during this use
case. Only queries during the day time.

Throughout the day, different files are consumed by this java process (performing the queries),
and resulting files are generated.

Then, at night, long after this code is finished running, a series of batch processes take
these files and eventually ingest their data into Hive. THEN, the Hive data is sent to Phoenix
via CsvBulkUpload.

These processes are completely independent and do not run at the same time. Data in Phoenix
is ONLY modified at night.

Finally, yes, the pattern that I’m seeing regarding CPU utilization and queries taking minutes
instead of milliseconds is on a  CLIENT machine – NOT on a data node. My theory is that
the Phoenix client is somehow getting behind or doing garbage collection in a way that causes
the queries to get backed up.

Some comments (replies inline):
- 4.2.2 is a pretty old release at this point - we've up to 4.6.0 as our last stable release.
I'd recommend upgrading.
--> Agreed. I’m going to follow the instructions here and try to upgrade to 4.6.0 today:
https://phoenix.apache.org/upgrading.html

- If the problem is on the client-side, you should be able to profile the Java application
during the period of slowness and get us the details we need to diagnose this.
--> Do you have a recommendation for a tool that I should go learn about for this?

- Phoenix executes a single query using many threads already, so your ~2000 thread calculations
wouldn't be accurate. For a little bit more info on how Phoenix parallelized queries, see
https://phoenix.apache.org/tuning.html#Parallelization. At a minimum, Phoenix will execute
one thread per region involved in your query. You can control the level of parallelization
through the phoenix.stats.guidepost.width config parameter. If you want to minimize the parallelization
that Phoenix does, you can set this config parameter to a very larger value in the server
side hbase-site.xml. The default is 104857600 (or 10MB). If you set it to your MAX_FILESIZE
(the size you allow a region to grow to before it splits - default 20GB), then you're essentially
disabling it. You could also try increasing it somewhere in between to maybe 5 or 10GB.
--> Thank you. This is helpful. I will experiment with this.

- Running 8 simultaneous instances on the same client will overload that machine. Run them
on separate client machines - but start with a single one on one machine.
--> I only have one machine for this (for now), but it is a very beefy machine (8 very
fast CPUs and almost 1 TB of RAM). I need to perform this input/dedup operation for about
500 files each day (one per customer), and each file will require 5,000-100,000 queries, depending
on the size of the customer. Supposing an average of 10K queries/customer, that’s about
5 million queries per day. At 10ms per query, that’s about 14 hours, which I need to improve
considerably, especially as our data grows. Based on some experimenting, I’ve found that
I can get an overall average much closer to 2-3ms / query if I run several instances in parallel.
However, this is when I start to encounter the thread limit issue. So I’ll continue to experiment
with this and I appreciate any feedback or recommendations this community can provide.

Thanks!


From: James Taylor [mailto:jamestaylor@apache.org<mailto:jamestaylor@apache.org>]
Sent: Monday, December 07, 2015 6:34 PM

To: user
Subject: Re: Help tuning for bursts of high traffic?

Thanks for the additional information, Zack. I'd like to confirm I understand how you're using
Phoenix:
- Data is not being upserted at the same time it is being queried. But I'm confused by your
statement of "and THEN some of the Hive data is send to HBase/Phoenix". How is the data being
sent? How much data? Is querying occurring while this data is being ingested?
- Querying happens *after* any data ingest. "So, per-widget, I perform a query against Phoenix".
Are you determining duplicates through these queries and then issuing a Phoenix DELETE of
the duplicates?
- The CPUs you mention being pegged are the CPUs on the *client* machine: "I see that, periodically
– maybe every 60 or 90 seconds – all of my CPUs (there are 8 on this machine) go from
mildly busy to almost totally pegged".

If you could confirm the above, that'd be super helpful. I can't think of any period process
in Phoenix that runs every 60 or 90 seconds.

Some comments:
- 4.2.2 is a pretty old release at this point - we've up to 4.6.0 as our last stable release.
I'd recommend upgrading.
- If the problem is on the client-side, you should be able to profile the Java application
during the period of slowness and get us the details we need to diagnose this.
- Phoenix executes a single query using many threads already, so your ~2000 thread calculations
wouldn't be accurate. For a little bit more info on how Phoenix parallelized queries, see
https://phoenix.apache.org/tuning.html#Parallelization. At a minimum, Phoenix will execute
one thread per region involved in your query. You can control the level of parallelization
through the phoenix.stats.guidepost.width config parameter. If you want to minimize the parallelization
that Phoenix does, you can set this config parameter to a very larger value in the server
side hbase-site.xml. The default is 104857600 (or 10MB). If you set it to your MAX_FILESIZE
(the size you allow a region to grow to before it splits - default 20GB), then you're essentially
disabling it. You could also try increasing it somewhere in between to maybe 5 or 10GB.
- Running 8 simultaneous instances on the same client will overload that machine. Run them
on separate client machines - but start with a single one on one machine.

On Mon, Dec 7, 2015 at 8:32 AM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
Also, and somewhat related:

I’m trying to running 8 simultaneous instances of this code (on 8 separate input files),
since I have 8 CPUs on the machine.

When I try this, I get java.lang.RuntimeException: java.lang.OutOfMemory: unable to create
a new native thread

My phoenix connection has the “phoenix.query.threadPoolSize” set to “256”, which should
result in 8x256 = ~2,000 threads being spawned by Phoenix.

Is that correct?

I’m running RH Linux 6, with my ulimit set to “unlimited”, so I should be able to handle
thousands of threads.

Any ideas?

From: Andrew Purtell [mailto:andrew.purtell@gmail.com<mailto:andrew.purtell@gmail.com>]
Sent: Friday, December 04, 2015 4:24 PM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Cc: Haisty, Geoffrey
Subject: Re: Help tuning for bursts of high traffic?

Any chance of stack dumps from the debug servlet? Impossible to get anywhere with 'pegged
the CPU' otherwise. Thanks.

On Dec 4, 2015, at 12:20 PM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
James,

2 quick followups, for whatever they’re worth:

1 – There is nothing phoenix-related in /tmp

2 – I added a ton of logging, and played with the properties a bit, and I think I see a
pattern:

Watching the logging and the system profiler side-by-side, I see that, periodically – maybe
every 60 or 90 seconds – all of my CPUs (there are 8 on this machine) go from mildly busy
to almost totally pegged.

They USUALLY stay pegged for 5-10 seconds, and then calm down.

However, occasionally, they stay pegged for around a minute. When this happens, I get the
very slow queries. I added logic so that when I get a very slow response (> 1 second),
I pause for 30 seconds.

This ‘fixes’ everything, in the sense that I’m usually able to get a couple thousand
good queries before the whole pattern repeats.

For reference, there’s nothing external that should be causing those CPU spikes, so I’m
guessing that it’s maybe java GC (?) or perhaps something that the phoenix client is doing
?

Can you guess at what Phoenix might do periodically that would peg the CPUs – and in such
a way that a query has to wait as much as 2 minutes to execute (I’m guessing from the pattern
that it’s not actually the query that is slow, but a very long between when it gets queued
and when it actually gets executed).

Oh and the methods you mentioned aren’t in my version of PhoenixRuntime, evidently. I’m
on 4.2.2.something.

Thanks for any further feedback you can provide on this. Hopefully the conversation is helpful
to the whole Phoenix community.

From: Riesland, Zack
Sent: Friday, December 04, 2015 1:36 PM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Cc: Geoff.haisty@sensus.com<mailto:Geoff.haisty@sensus.com>
Subject: RE: Help tuning for bursts of high traffic?

Thanks, James

I'll work on gathering more information.

In the meantime, answers to a few of your questions inline below just narrow the scope a bit:

________________________________
From: James Taylor [jamestaylor@apache.org<mailto:jamestaylor@apache.org>]
Sent: Friday, December 04, 2015 12:21 PM
To: user
Subject: Re: Help tuning for bursts of high traffic?
Zack,
Thanks for reporting this and for the detailed description. Here's a bunch of questions and
some things you can try in addition to what Andrew suggested:
1) Is this reproducible in a test environment (perhaps through Pherf: https://phoenix.apache.org/pherf.html)
so you can experiment more?
-Will check

2) Do you get a sense of whether the bottleneck is on the client or the server? CPU, IO, or
network? How many clients are you running and have you tried increasing this? Do you think
your network is saturated by the data being returned?
-I'm no expert on this. When I look at the HBase dashboard on Ambari, everything looks good.
When I look at the stats on the machine running the java code, it also looks good. Certainly
no bottleneck related to memory or CPU. Network wise, the box is on the same rack as the cluster,
with 10GB switches everywhere, so I'd be surprised if network latency were an issue.

3) From your description, it sounds like you're querying the data as your ingesting. When
it gets slow, have you tried running a major compaction to see if that helps? Perhaps queries
are getting slower because of the number of HFiles that need to be merged.
-Rereading my original email, I see where you get that. But actually, there is nothing being
ingested by HBase during this process. At the end of the process, I generate a CSV file that
is then consumed and altered by Pentaho, then consumed by Hive, and THEN some of the Hive
data is send to HBase/Phoenix. So this is part of the ingest process, but a precursor to the
cluster Ingesting any data.

4) If you bounce your cluster when it gets slow, does this have any impact?
-Can check. What should I expect to happen if I restart HBase-related services while trying
to query Phoenix? Will the query just wait until everything is back up? Will I get strange
exceptions? (Of course I'll go find this out myself)

5) What kinds of queries are running? Aggregation? Joins? Or just plain single table selects?
Any ORDER BY clauses? Are you using secondary indexes, and if so, what kind?
-Very simple query: select x, y, z from my_table where key = 'my_key' and sample_point <=
upper_range and sample_point >= lower_range. x, y, and z are integers.

6) Are you seeing GC pauses on the server during times of slowness (correlate time of slowness
with your server logs)?
-Can look

7) Sounds like your queries are returning a lot of data. On the client-side, Phoenix will
keep phoenix.query.spoolThresholdBytes in memory and then spool to disk as parallel execution
happens. Are you seeing many spool files on the client side your /tmp directory (this is where
Phoenix puts these by default with a name of ResultSpoolerXXX.bin). Try increasing this spool
threshold if that's the case.
-I'll look into this

8) For the data ingest, are you using UPSERT VALUES? How big of batches are you committing?
That's one thing to tune, especially if you're using secondary indexing.
-Again, nothing ingesting

9) Have you tried tuning the level of parallelization that Phoenix is doing for queries? This
is controlled by the server-side phoenix.stats.guidepost.width parameter (assuming you haven't
set the phoenix.stats.guidepost.per.region parameter) and defaults to 300MB. Try increasing
it (you'll need to run a major compaction for this to take effect, and there's 15min lag to
when the client sees it).
-This sounds interesting. I'll have to learn more about it.

10) If you're doing aggregation or join queries, try increasing the phoenix.query.maxGlobalMemorySize
property on the server side. Both hash joins and aggregation are done in memory, up to this
% limit. If the limit is reached, then on the server side, Phoenix will wait phoenix.query.maxGlobalMemoryWaitMs
time for the usage to go below the limit (and then throw an exception if it doesn't). You
can try tuning this wait time down to see if it has an impact.
-Not applicable

11) There a bunch of client-side metrics you can collect (but little documentation yet - keep
your eye on PHOENIX-2486) that might help you diagnose this. See PhoenixRuntime.getGlobalPhoenixClientMetrics(),
PhoenixRuntime.getOverAllReadRequestMetrics(), and other methods with Metrics in the name.
-I'll look into this

12) There's also tracing, which is end-to-end client/server, but it's in a bit on the raw
side still: https://phoenix.apache.org/tracing.htmlhttps://phoenix.apache.org/tracing.html
-OK

There's more information on these tuning parameters here: https://phoenix.apache.org/tuning.html
and you should take a look at Andrew's excellent tuning presentation here: https://phoenix.apache.org/resources.html.

Thanks,
James


On Fri, Dec 4, 2015 at 8:28 AM, Andrew Purtell <apurtell@apache.org<mailto:apurtell@apache.org>>
wrote:
Kumar - I believe you mentioned you are seeing this in a cluster of ~20 regionservers.

Zack - Yours is smaller yet, at 9.

These clusters are small enough to make getting stack dumps through the HBase debug servlet
during periods of unusually slow response possible. Perhaps you can write a script that queries
all of the debug servlets (can use curl) and dumps the received output into per-regionserver
files? Scrape every 10 or so seconds during the observed periods of slowness? Then compress
them and make them available for Phoenix devs up on S3? Consider it a poor man's sampler.
I don't know what we might find, but this could prove very helpful.


On Fri, Dec 4, 2015 at 8:11 AM, Kumar Palaniappan <kpalaniappan@marinsoftware.com<mailto:kpalaniappan@marinsoftware.com>>
wrote:
I'm in the same exact position as Zack described. Appreciate your feedback.

So far we tried the call queue n the handlers, nope. Planned to try off-heap cache.

Kumar Palaniappan<http://about.me/kumar.palaniappan>   <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>

On Dec 4, 2015, at 6:45 AM, Riesland, Zack <Zack.Riesland@sensus.com> wrote:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks Satish,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
To clarify: I’m not looking up single rows. I’m looking up the history of each widget,
which returns hundreds-to-thousands of results per widget (per query). <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Each query is a range scan, it’s just that I’m performing thousands of them.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
From: Satish Iyengar [mailto:satysh@gmail.com]
Sent: Friday, December 04, 2015 9:43 AM
To: user@phoenix.apache.org
Subject: Re: Help tuning for bursts of high traffic?<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Hi Zack,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Did you consider avoiding hitting hbase for every single row by doing that step in an offline
mode? I was thinking if you could have some kind of daily export of hbase table and then use
pig to perform join (co-group perhaps) to do the same. Obviously this would work only when
your hbase table is not maintained by stream based system. Hbase is really good at range scans
and may not be ideal for single row (large number of).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks,<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Satish<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
On Fri, Dec 4, 2015 at 9:09 AM, Riesland, Zack <Zack.Riesland@sensus.com> wrote:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
SHORT EXPLANATION: a much higher percentage of queries to phoenix return exceptionally slow
after querying very heavily for several minutes.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
LONGER EXPLANATION:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
I’ve been using Pheonix for about a year as a data store for web-based reporting tools and
it works well.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Now, I’m trying to use the data in a different (much more request-intensive) way and encountering
some issues.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The scenario is basically this:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Daily, ingest very large CSV files with data for widgets. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Each input file has hundreds of rows of data for each widget, and tens of thousands of unique
widgets.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
As a first step, I want to de-duplicate this data against my Phoenix-based DB (I can’t rely
on just upserting the data for de-dup because it will go through several ETL steps before
being stored into Phoenix/HBase).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
So, per-widget, I perform a query against Phoenix (the table is keyed against the unique widget
ID + sample point). I get all the data for a given widget id, within a certain period of time,
and then I only ingest rows for that widget that are new to me.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
I’m doing this in Java in a single step: I loop through my input file and perform one query
per widget, using the same Connection object to Phoenix.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
THE ISSUE:<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
What I’m finding is that for the first several thousand queries, I almost always get a very
fast (less than 10 ms) response (good).<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
But after 15-20 thousand queries, the response starts to get MUCH slower. Some queries respond
as expected, but many take as many as 2-3 minutes, pushing the total time to prime the data
structure into the 12-15 hour range, when it would only take 2-3 hours if all the queries
were fast.<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The same exact queries, when run manually and not part of this bulk process, return in the
(expected) < 10 ms. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
So it SEEMS like the burst of queries puts Phoenix into some sort of busy state that causes
it to respond far too slowly. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
The connection properties I’m setting are: <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phoenix.query.timeoutMs: 90000<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phoenix.query.keepAliveMs: 90000<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Phenix.query.threadPoolSize: 256<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Our cluster is 9 (beefy) region servers and the table I’m referencing is 511 regions. We
went through a lot of pain to get the data split extremely well, and I don’t think Schema
design is the issue here. <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Can anyone help me understand how to make this better? Is there a better approach I could
take? A better set of configuration parameters? Is our cluster just too small for this?<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Thanks!<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>


<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
-- <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Satish Iyengar

"Anyone who has never made a mistake has never tried anything new."
Albert Einstein<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>


<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
-- <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)<https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>
 <https://twitter.com/intent/follow?original_referer=https://twitter.com/about/resources/buttons&region=follow_link&screen_name=megamda&source=followbutton&variant=2.0>






Mime
View raw message