phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Riesland, Zack" <Zack.Riesl...@sensus.com>
Subject RE: Help Tuning CsvBulkImport MapReduce
Date Tue, 01 Sep 2015 09:29:12 GMT
Thanks Gabriel,

That is extremely helpful.

One clarification:

You say I can find information about spills in the job counters. Are you talking about “failed”
map tasks, or is there something else that will help me identify spill scenarios?

From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
Sent: Monday, August 31, 2015 4:39 PM
To: user@phoenix.apache.org
Subject: Re: Help Tuning CsvBulkImport MapReduce

If the bulk of the time is being spent in the map phase, then there probably isn't all that
much that can be done in terms of tuning that will make a huge difference. However, there
may be a few things to look at.

You mentioned that HDFS decided to translate the hive export to 257 files -- do you mean blocks,
or are there actually 257 files on HDFS? If so, it's Hive (and/or MapReduce) that's responsible
for the 257 files, but that's probably just a detail and not all that important.

How long are each of the map tasks taking? If they're only taking something like 30 seconds
or so, then it would be worth trying to have each task process more data. This is most easily
accomplished by using a bigger block size on HDFS, as each HDFS block typically results in
a single map task. However, you'll want to first check how long each map task is taking --
if they're each taking 3-5 minutes (or more), then you won't gain much by increasing the block
size.

A second thing to look at is the number of spills compared to the number of map output records
-- you can find this information in the job counters. If the number of spills in the map phase
are two (or more) times the number of map output records, you'll likely get an increase in
performance by upping the mapreduce.task.io.sort.mb setting (or some other sort settings).
However, before getting into this you'll want to see if the spills are an issue.

As I said above though, most likely if the map phase is taking up the most time, it's probably
CPU-bound on the conversion of CSV data to HBase KeyValues. This is likely if you're dealing
with really wide rows. How many columns are you importing into your table?

- Gabriel
On Mon, Aug 31, 2015 at 3:20 PM Riesland, Zack <Zack.Riesland@sensus.com<mailto:Zack.Riesland@sensus.com>>
wrote:
I’m looking for some pointers on speeding up CsvBulkImport.

Here’s an example:

I took about 2 billion rows from hive and exported them to CSV.

HDFS decided to translate this to 257 files, each about 1 GB.

Running the CsvBulkImport tool against this folder results in 1,835 mappers and then 1 reducer
per region on the HBase table.

The whole process takes something like 2 hours, the bulk of which is spent on mappers.

Any suggestions on how to possibly make this faster?

When I create the CSV files, I’m doing a pretty simple select statement from hive. The results
tend to be mostly sorted.

I honestly don’t know this space well enough to know whether that’s good, bad, or neutral.

Thanks for any feedback!

Mime
View raw message