phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Satish Iyengar <sat...@gmail.com>
Subject Re: Help tuning for bursts of high traffic?
Date Fri, 04 Dec 2015 14:43:08 GMT
Hi Zack,

Did you consider avoiding hitting hbase for every single row by doing that
step in an offline mode? I was thinking if you could have some kind of
daily export of hbase table and then use pig to perform join (co-group
perhaps) to do the same. Obviously this would work only when your hbase
table is not maintained by stream based system. Hbase is really good at
range scans and may not be ideal for single row (large number of).

Thanks,
Satish





On Fri, Dec 4, 2015 at 9:09 AM, Riesland, Zack <Zack.Riesland@sensus.com>
wrote:

> SHORT EXPLANATION: a much higher percentage of queries to phoenix return
> exceptionally slow after querying very heavily for several minutes.
>
>
>
> LONGER EXPLANATION:
>
>
>
> I’ve been using Pheonix for about a year as a data store for web-based
> reporting tools and it works well.
>
>
>
> Now, I’m trying to use the data in a different (much more
> request-intensive) way and encountering some issues.
>
>
>
> The scenario is basically this:
>
>
>
> Daily, ingest very large CSV files with data for widgets.
>
>
>
> Each input file has hundreds of rows of data for each widget, and tens of
> thousands of unique widgets.
>
>
>
> As a first step, I want to de-duplicate this data against my Phoenix-based
> DB (I can’t rely on just upserting the data for de-dup because it will go
> through several ETL steps before being stored into Phoenix/HBase).
>
>
>
> So, per-widget, I perform a query against Phoenix (the table is keyed
> against the unique widget ID + sample point). I get all the data for a
> given widget id, within a certain period of time, and then I only ingest
> rows for that widget that are new to me.
>
>
>
> I’m doing this in Java in a single step: I loop through my input file and
> perform one query per widget, using the same Connection object to Phoenix.
>
>
>
> THE ISSUE:
>
>
>
> What I’m finding is that for the first several thousand queries, I almost
> always get a very fast (less than 10 ms) response (good).
>
>
>
> But after 15-20 thousand queries, the response starts to get MUCH slower.
> Some queries respond as expected, but many take as many as 2-3 minutes,
> pushing the total time to prime the data structure into the 12-15 hour
> range, when it would only take 2-3 hours if all the queries were fast.
>
>
>
> The same exact queries, when run manually and not part of this bulk
> process, return in the (expected) < 10 ms.
>
>
>
> So it SEEMS like the burst of queries puts Phoenix into some sort of busy
> state that causes it to respond far too slowly.
>
>
>
> The connection properties I’m setting are:
>
>
>
> Phoenix.query.timeoutMs: 90000
>
> Phoenix.query.keepAliveMs: 90000
>
> Phenix.query.threadPoolSize: 256
>
>
>
> Our cluster is 9 (beefy) region servers and the table I’m referencing is
> 511 regions. We went through a lot of pain to get the data split extremely
> well, and I don’t think Schema design is the issue here.
>
>
>
> Can anyone help me understand how to make this better? Is there a better
> approach I could take? A better set of configuration parameters? Is our
> cluster just too small for this?
>
>
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



-- 
Satish Iyengar

"Anyone who has never made a mistake has never tried anything new."
Albert Einstein

Mime
View raw message