Please find the job counters attached.
Would increasing the splitting affect the reads?
I assume a simple read would be benefitted by increased splitting as it increases the parallelism.
But, how would it impact the aggregate queries?
The first thing that I notice looking at the info that you've posted
is that you have 13 nodes and 13 salt buckets (which I assume also
means that you have 13 regions).
A single region is the unit of parallelism that is used for reducers
in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
currently you're only getting an average of a single reduce process
per node on your cluster. Assuming that you have multiple cores in
each of those nodes, you will probably get a decent improvement in
performance by further splitting your destination table so that it has
multiple regions per node (thereby triggering multiple reduce tasks
Would you also be able to post the full set of job counters that are
shown after the job is completed? This would also be helpful in
pinpointing things that can be (possibly) tuned.
On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <email@example.com> wrote:
> I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table.
> HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
> CSV file size: 97.6 GB
> No. of records: 1,439,000,238
> Cluster: 13 node
> Phoenix table salt-buckets: 13
> Phoenix table compression: snappy
> HBase table size after loading: 26.6 GB
> The job completed in 1hrs, 39mins, 43sec.
> Average Map Time 5mins, 25sec
> Average Shuffle Time 47mins, 46sec
> Average Merge Time 12mins, 22sec
> Average Reduce Time 32mins, 9sec
> I'm looking for an opportunity to tune this job.
> Could someone please help me with some pointers on how to tune this job?
> Please let me know if you need to know any cluster configuration parameters
> that I'm using.
> This is only a performance test. My PRODUCTION data file is 7x bigger.
> Vamsi Attluri
> Vamsi Attluri