phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Behdad Forghani" <beh...@exapackets.com>
Subject RE: Help Tuning CsvBulkImport MapReduce
Date Tue, 01 Sep 2015 01:04:41 GMT
Hi,

 

In my experience the fastest way to load data is directly write to HFile. I
have measured a performance gain of 10x. Also, if you have binary data or
need to escape characters HBase bulk loader does not escape characters.  For
my use case, I create HFiles and load the HFIle. Then, I create a view on
HBase table.

 

Behdad

 

From: Riesland, Zack [mailto:Zack.Riesland@sensus.com] 
Sent: Monday, August 31, 2015 6:20 AM
To: user@phoenix.apache.org
Subject: Help Tuning CsvBulkImport MapReduce

 

I'm looking for some pointers on speeding up CsvBulkImport.

 

Here's an example:

 

I took about 2 billion rows from hive and exported them to CSV. 

 

HDFS decided to translate this to 257 files, each about 1 GB.

 

Running the CsvBulkImport tool against this folder results in 1,835 mappers
and then 1 reducer per region on the HBase table.

 

The whole process takes something like 2 hours, the bulk of which is spent
on mappers.

 

Any suggestions on how to possibly make this faster?

 

When I create the CSV files, I'm doing a pretty simple select statement from
hive. The results tend to be mostly sorted.

 

I honestly don't know this space well enough to know whether that's good,
bad, or neutral.

 

Thanks for any feedback!

 


Mime
View raw message