phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vamsi Krishna <vamsi.attl...@gmail.com>
Subject Re: how to tune phoenix CsvBulkLoadTool job
Date Wed, 16 Mar 2016 18:15:03 GMT
Thanks Gabriel,
Please find the job counters attached.

Would increasing the splitting affect the reads?
I assume a simple read would be benefitted by increased splitting as it
increases the parallelism.
But, how would it impact the aggregate queries?

Vamsi Attluri

On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid <gabriel.reid@gmail.com> wrote:

> Hi Vamsi,
>
> The first thing that I notice looking at the info that you've posted
> is that you have 13 nodes and 13 salt buckets (which I assume also
> means that you have 13 regions).
>
> A single region is the unit of parallelism that is used for reducers
> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
> currently you're only getting an average of a single reduce process
> per node on your cluster. Assuming that you have multiple cores in
> each of those nodes, you will probably get a decent improvement in
> performance by further splitting your destination table so that it has
> multiple regions per node (thereby triggering multiple reduce tasks
> per node).
>
> Would you also be able to post the full set of job counters that are
> shown after the job is completed? This would also be helpful in
> pinpointing things that can be (possibly) tuned.
>
> - Gabriel
>
>
> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <vamsi.attluri@gmail.com>
> wrote:
> > Hi,
> >
> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase
> table.
> >
> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
> > CSV file size: 97.6 GB
> > No. of records: 1,439,000,238
> > Cluster: 13 node
> > Phoenix table salt-buckets: 13
> > Phoenix table compression: snappy
> > HBase table size after loading: 26.6 GB
> >
> > The job completed in 1hrs, 39mins, 43sec.
> > Average Map Time         5mins, 25sec
> > Average Shuffle Time 47mins, 46sec
> > Average Merge Time 12mins, 22sec
> > Average Reduce Time 32mins, 9sec
> >
> > I'm looking for an opportunity to tune this job.
> > Could someone please help me with some pointers on how to tune this job?
> > Please let me know if you need to know any cluster configuration
> parameters
> > that I'm using.
> >
> > This is only a performance test. My PRODUCTION data file is 7x bigger.
> >
> > Thanks,
> > Vamsi Attluri
> >
> > --
> > Vamsi Attluri
>
-- 
Vamsi Attluri

Mime
View raw message