phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject RE: Help tuning for bursts of high traffic?
Date Fri, 04 Dec 2015 14:45:36 GMT
Thanks Satish,

To clarify: I’m not looking up single rows. I’m looking up the history of each widget,
which returns hundreds-to-thousands of results per widget (per query).

Each query is a range scan, it’s just that I’m performing thousands of them.

From: Satish Iyengar [mailto:satysh@gmail.com]
Sent: Friday, December 04, 2015 9:43 AM
To: user@phoenix.apache.org
Subject: Re: Help tuning for bursts of high traffic?

Hi Zack,

Did you consider avoiding hitting hbase for every single row by doing that step in an offline
mode? I was thinking if you could have some kind of daily export of hbase table and then use
pig to perform join (co-group perhaps) to do the same. Obviously this would work only when
your hbase table is not maintained by stream based system. Hbase is really good at range scans
and may not be ideal for single row (large number of).

Thanks,
Satish





On Fri, Dec 4, 2015 at 9:09 AM, Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
SHORT EXPLANATION: a much higher percentage of queries to phoenix return exceptionally slow
after querying very heavily for several minutes.

LONGER EXPLANATION:

I’ve been using Pheonix for about a year as a data store for web-based reporting tools and
it works well.

Now, I’m trying to use the data in a different (much more request-intensive) way and encountering
some issues.

The scenario is basically this:

Daily, ingest very large CSV files with data for widgets.

Each input file has hundreds of rows of data for each widget, and tens of thousands of unique
widgets.

As a first step, I want to de-duplicate this data against my Phoenix-based DB (I can’t rely
on just upserting the data for de-dup because it will go through several ETL steps before
being stored into Phoenix/HBase).

So, per-widget, I perform a query against Phoenix (the table is keyed against the unique widget
ID + sample point). I get all the data for a given widget id, within a certain period of time,
and then I only ingest rows for that widget that are new to me.

I’m doing this in Java in a single step: I loop through my input file and perform one query
per widget, using the same Connection object to Phoenix.

THE ISSUE:

What I’m finding is that for the first several thousand queries, I almost always get a very
fast (less than 10 ms) response (good).

But after 15-20 thousand queries, the response starts to get MUCH slower. Some queries respond
as expected, but many take as many as 2-3 minutes, pushing the total time to prime the data
structure into the 12-15 hour range, when it would only take 2-3 hours if all the queries
were fast.

The same exact queries, when run manually and not part of this bulk process, return in the
(expected) < 10 ms.

So it SEEMS like the burst of queries puts Phoenix into some sort of busy state that causes
it to respond far too slowly.

The connection properties I’m setting are:

Phoenix.query.timeoutMs: 90000
Phoenix.query.keepAliveMs: 90000
Phenix.query.threadPoolSize: 256

Our cluster is 9 (beefy) region servers and the table I’m referencing is 511 regions. We
went through a lot of pain to get the data split extremely well, and I don’t think Schema
design is the issue here.

Can anyone help me understand how to make this better? Is there a better approach I could
take? A better set of configuration parameters? Is our cluster just too small for this?


Thanks!













--
Satish Iyengar

"Anyone who has never made a mistake has never tried anything new."
Albert Einstein
Mime
View raw message