phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject Guidance on how many regions to plan for
Date Mon, 18 Jan 2016 17:57:29 GMT
In the past, my struggles with hbase/phoenix have been related to data ingest.

Each night, we ingest lots of data via CsvBulkUpload.

After lots of trial and error trying to get our largest table to cooperate, I found a primary
key that distributes well if I specify the split criteria on table creation.

That table now has ~15 billion rows representing about 300GB of data across 513 regions (on
9 region servers).

Life was good for a while.

Now, I have a new use case where I need another table very similar, but rather than serving
UI-based reports, this table will be queried programmatically and VERY heavily (millions of
queries per day).

I have asked about this in the past, but got derailed to other things, so I'm trying to zoom
out a bit and make sure I approach this problem correctly.

My simplified use case is basically: de-dup input files against Phoenix before passing them
on to the rest of our ingest process. This will result in tens of thousands of queries to
Phoenix per input file.

I noted in the past that after 5-10K rapid-fire queries, the response time drops dramatically.
And I think we established that this is because there is one thread being spawned per 20-mb
chunk of data in each region (?)

More generally, it seems that the more regions there are in my table, the more resource-intensive
phoenix queries become?

Is that correct?

I estimate that my table will contain about 500GB of data by the end of 2016.

The rows are pretty small (like 6 or 8 small columns). I have 9 region servers - soon to be
12.

The distribution is usually 2,000-5,000 rows per primary key, which is about 0.5 - 3 MB of
data.

Given that information, is there a good rule of thumb for how many regions I should try to
target with my schema/primary key design?

I experimented using salt buckets (presumably letting Phoenix choose how to split everything)
but I keep getting errors when I try to bulk load data into salted tables ("Import job on
table blah failed due to exception: java.io.IOException: Trying to load more than 32 hfiles
to one family of one region").

Are there HBase configuration tweaks I should focus on? My current memstore size is set to
256mb.

Thanks for any guidance or tips here.




Mime
View raw message