phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Help Tuning CsvBulkImport MapReduce
Date Mon, 31 Aug 2015 20:38:39 GMT
If the bulk of the time is being spent in the map phase, then there
probably isn't all that much that can be done in terms of tuning that will
make a huge difference. However, there may be a few things to look at.

You mentioned that HDFS decided to translate the hive export to 257 files
-- do you mean blocks, or are there actually 257 files on HDFS? If so, it's
Hive (and/or MapReduce) that's responsible for the 257 files, but that's
probably just a detail and not all that important.

How long are each of the map tasks taking? If they're only taking something
like 30 seconds or so, then it would be worth trying to have each task
process more data. This is most easily accomplished by using a bigger block
size on HDFS, as each HDFS block typically results in a single map task.
However, you'll want to first check how long each map task is taking -- if
they're each taking 3-5 minutes (or more), then you won't gain much by
increasing the block size.

A second thing to look at is the number of spills compared to the number of
map output records -- you can find this information in the job counters. If
the number of spills in the map phase are two (or more) times the number of
map output records, you'll likely get an increase in performance by upping
the mapreduce.task.io.sort.mb setting (or some other sort settings).
However, before getting into this you'll want to see if the spills are an
issue.

As I said above though, most likely if the map phase is taking up the most
time, it's probably CPU-bound on the conversion of CSV data to HBase
KeyValues. This is likely if you're dealing with really wide rows. How many
columns are you importing into your table?

- Gabriel

On Mon, Aug 31, 2015 at 3:20 PM Riesland, Zack <Zack.Riesland@sensus.com>
wrote:

> I’m looking for some pointers on speeding up CsvBulkImport.
>
>
>
> Here’s an example:
>
>
>
> I took about 2 billion rows from hive and exported them to CSV.
>
>
>
> HDFS decided to translate this to 257 files, each about 1 GB.
>
>
>
> Running the CsvBulkImport tool against this folder results in 1,835
> mappers and then 1 reducer per region on the HBase table.
>
>
>
> The whole process takes something like 2 hours, the bulk of which is spent
> on mappers.
>
>
>
> Any suggestions on how to possibly make this faster?
>
>
>
> When I create the CSV files, I’m doing a pretty simple select statement
> from hive. The results tend to be mostly sorted.
>
>
>
> I honestly don’t know this space well enough to know whether that’s good,
> bad, or neutral.
>
>
>
> Thanks for any feedback!
>
>
>

Mime
View raw message