phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Behdad Forghani <beh...@exapackets.com>
Subject Re: Help Tuning CsvBulkImport MapReduce
Date Tue, 01 Sep 2015 08:09:11 GMT
Hi,

I thought I should have explained my use case after I sent the email. This
is not for the case where your data is already in CSV format rather than if
your application has a choice of writing to HBase or dumping the records to
CSV and bulk loading the resulting CSV files. In my case, my application
writes protocol traces and network switch logs to HBase. My choices were to
write the records to HFile directly, or, create CSV files and then bulk
loading these CSV files. As you can imagine, there is at least an extra
write and read to and from CSV file in the second case. Then, there is the
issue of handling binary data as TSV/CSV files are designed to be text.

I found writing to HFile took less time than writing to CSV files. In my
measurement, it took about 504 seconds just to create the TSV files and 378
seconds to create and load the HFiles. In both cases, I was using 16
parallel threads to write about 45 million records. I attributed the speed
of writing to HFiles being faster to SNAPPY and the fact that I could
directly write the binary data.

Regards,
Behdad

On Tue, Sep 1, 2015 at 2:14 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> On Tue, Sep 1, 2015 at 3:04 AM, Behdad Forghani <behdad@exapackets.com>
> wrote:
>
> > In my experience the fastest way to load data is directly write to
> HFile. I
> > have measured a performance gain of 10x. Also, if you have binary data or
> > need to escape characters HBase bulk loader does not escape characters.
> For
> > my use case, I create HFiles and load the HFIle. Then, I create a view on
> > HBase table.
>
> The CSV bulk import tool[1] does write to HFiles in a MapReduce job.
> Are you saying that you've gotten 10x better performance than this
> tool? If so, it would certainly be interesting to hear about how you
> able to get such good performance. Or were you comparing to bulk
> loading via PSQL?
>
> 1. http://phoenix.apache.org/bulk_dataload.html
>

Mime
View raw message