phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <maghamraviki...@gmail.com>
Subject Re: Guidance on how many regions to plan for
Date Mon, 18 Jan 2016 18:23:59 GMT
Hi Zack,
   The limitation of 32 HFiles is due to this configuration property
MAX_FILES_PER_REGION_PER_FAMILY
which defaults to 32 in LoadIncrementalHFiles.
  You can give it a try updating your configuration with a larger value and
see if it works.


https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L116


Thanks
Ravi

On Mon, Jan 18, 2016 at 9:57 AM, Riesland, Zack <Zack.Riesland@sensus.com>
wrote:

> In the past, my struggles with hbase/phoenix have been related to data
> ingest.
>
>
>
> Each night, we ingest lots of data via CsvBulkUpload.
>
>
>
> After lots of trial and error trying to get our largest table to
> cooperate, I found a primary key that distributes well if I specify the
> split criteria on table creation.
>
>
>
> That table now has ~15 billion rows representing about 300GB of data
> across 513 regions (on 9 region servers).
>
>
>
> Life was good for a while.
>
>
>
> Now, I have a new use case where I need another table very similar, but
> rather than serving UI-based reports, this table will be queried
> programmatically and VERY heavily (millions of queries per day).
>
>
>
> I have asked about this in the past, but got derailed to other things, so
> I’m trying to zoom out a bit and make sure I approach this problem
> correctly.
>
>
>
> My simplified use case is basically: de-dup input files against Phoenix
> before passing them on to the rest of our ingest process. This will result
> in tens of thousands of queries to Phoenix per input file.
>
>
>
> I noted in the past that after 5-10K rapid-fire queries, the response time
> drops dramatically. And I think we established that this is because there
> is one thread being spawned per 20-mb chunk of data in each region (?)
>
>
>
> More generally, it seems that the more regions there are in my table, the
> more resource-intensive phoenix queries become?
>
>
>
> Is that correct?
>
>
>
> I estimate that my table will contain about 500GB of data by the end of
> 2016.
>
>
>
> The rows are pretty small (like 6 or 8 small columns). I have 9 region
> servers – soon to be 12.
>
>
>
> The distribution is usually 2,000-5,000 rows per primary key, which is
> about 0.5 – 3 MB of data.
>
>
>
> Given that information, is there a good rule of thumb for how many regions
> I should try to target with my schema/primary key design?
>
>
>
> I experimented using salt buckets (presumably letting Phoenix choose how
> to split everything) but I keep getting errors when I try to bulk load data
> into salted tables (“Import job on table blah failed due to exception:
> java.io.IOException: Trying to load more than 32 hfiles to one family of
> one region”).
>
>
>
> Are there HBase configuration tweaks I should focus on? My current
> memstore size is set to 256mb.
>
>
>
> Thanks for any guidance or tips here.
>
>
>
>
>
>
>

Mime
View raw message