phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cmbendre <>
Subject Large CSV bulk load stuck
Date Fri, 02 Jun 2017 18:13:53 GMT

I need some help in understanding how CsvBulkLoadTool works. I am trying to
load data ~ 200 GB (There are 100 files of 2 GB each) from hdfs to Phoenix
with 1 master and 4 region-servers. These region servers have 32 GB RAM and
16 cores each. Total HDFS disk space is 4 TB.  

The table is salted with 16. So 4 regions per regionservers. There are 400
columns and more than 30 local indexes.

Here is the command i am using - 
hadoop jar /usr/lib/phoenix/phoenix-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000
--table TABLE_SNAPSHOT --input /user/table/*.csv/

The job proceeds normally but gets stuck at reduce phase around 90 %. I also
observed that initially it was using full resource of the cluster but it
uses much less resources near completion. (10 percent of RAM and cores).

What exactly is happening behind the scenes ? How i can tune it to work
faster ? I am using HBase + HDFS deployed on YARN on AWS.

Any help is appreciated.


View this message in context:
Sent from the Apache Phoenix User List mailing list archive at

View raw message