phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: Java Out of Memory Errors with CsvBulkLoadTool
Date Fri, 18 Dec 2015 15:44:39 GMT
On Fri, Dec 18, 2015 at 4:31 PM, Riesland, Zack
<> wrote:
> We are able to ingest MUCH larger sets of data (hundreds of GB) using the CSVBulkLoadTool.
> However, we have found it to be a huge memory hog.
> We dug into the source a bit and found that HFileOutputFormat.configureIncrementalLoad(),
in using TotalOrderPartitioner and KeyValueReducer, ultimately keeps a TreeSet of all the
key/value pairs before finally writing the HFiles.
> So if the size of your data exceeds the memory allocated on the client calling the MapReduce
job, it will eventually fail.

I think (or at least hope!) that the situation isn't quite as bad as that.

The HFileOutputFormat.configureIncrementalLoad call will load the
start keys of all regions, and configure those for use by the
TotalOrderPartitioner. This will grow as the number of regions for the
output table grows.

The KeyValueSortReducer does indeed use a TreeSet to store KeyValues,
but this is a TreeSet per distinct row key. The size of this TreeSet
will grow as the number of columns per row grows. The memory usage is
typically higher than expected because each single column value is
stored as a row key, which contains the full row key of the row.

- Gabriel

View raw message