I thought I should have explained my use case after I sent the email. This is not for the case where your data is already in CSV format rather than if your application has a choice of writing to HBase or dumping the records to CSV and bulk loading the resulting CSV files. In my case, my application writes protocol traces and network switch logs to HBase. My choices were to write the records to HFile directly, or, create CSV files and then bulk loading these CSV files. As you can imagine, there is at least an extra write and read to and from CSV file in the second case. Then, there is the issue of handling binary data as TSV/CSV files are designed to be text. 

I found writing to HFile took less time than writing to CSV files. In my measurement, it took about 504 seconds just to create the TSV files and 378 seconds to create and load the HFiles. In both cases, I was using 16 parallel threads to write about 45 million records. I attributed the speed of writing to HFiles being faster to SNAPPY and the fact that I could directly write the binary data.


On Tue, Sep 1, 2015 at 2:14 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
On Tue, Sep 1, 2015 at 3:04 AM, Behdad Forghani <behdad@exapackets.com> wrote:

> In my experience the fastest way to load data is directly write to HFile. I
> have measured a performance gain of 10x. Also, if you have binary data or
> need to escape characters HBase bulk loader does not escape characters.  For
> my use case, I create HFiles and load the HFIle. Then, I create a view on
> HBase table.

The CSV bulk import tool[1] does write to HFiles in a MapReduce job.
Are you saying that you've gotten 10x better performance than this
tool? If so, it would certainly be interesting to hear about how you
able to get such good performance. Or were you comparing to bulk
loading via PSQL?

1. http://phoenix.apache.org/bulk_dataload.html