phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 丁桂涛 <dinggui...@baixing.com>
Subject Re: Help Tuning CsvBulkImport MapReduce
Date Tue, 01 Sep 2015 01:30:14 GMT
BTW, since hive represents NULL values as \\N in the textfile, how do you
handle those NULL values when using the CsvBulkImport tool?

On Tue, Sep 1, 2015 at 9:04 AM, Behdad Forghani <behdad@exapackets.com>
wrote:

> Hi,
>
>
>
> In my experience the fastest way to load data is directly write to HFile.
> I have measured a performance gain of 10x. Also, if you have binary data or
> need to escape characters HBase bulk loader does not escape characters.
> For my use case, I create HFiles and load the HFIle. Then, I create a view
> on HBase table.
>
>
>
> Behdad
>
>
>
> *From:* Riesland, Zack [mailto:Zack.Riesland@sensus.com]
> *Sent:* Monday, August 31, 2015 6:20 AM
> *To:* user@phoenix.apache.org
> *Subject:* Help Tuning CsvBulkImport MapReduce
>
>
>
> I’m looking for some pointers on speeding up CsvBulkImport.
>
>
>
> Here’s an example:
>
>
>
> I took about 2 billion rows from hive and exported them to CSV.
>
>
>
> HDFS decided to translate this to 257 files, each about 1 GB.
>
>
>
> Running the CsvBulkImport tool against this folder results in 1,835
> mappers and then 1 reducer per region on the HBase table.
>
>
>
> The whole process takes something like 2 hours, the bulk of which is spent
> on mappers.
>
>
>
> Any suggestions on how to possibly make this faster?
>
>
>
> When I create the CSV files, I’m doing a pretty simple select statement
> from hive. The results tend to be mostly sorted.
>
>
>
> I honestly don’t know this space well enough to know whether that’s good,
> bad, or neutral.
>
>
>
> Thanks for any feedback!
>
>
>

Mime
View raw message