phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: CsvBulkLoadTool with ~75GB file
Date Thu, 18 Aug 2016 09:15:39 GMT
Hi Aaron,

I'll answered your questions directly first, but please see the bottom
part of this mail for important additional details.

You can specify the
"hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily" parameter
(referenced from your StackOverflow link) on the command line of you
CsvBulkLoadTool command -- my understanding is that this is a purely
client-side parameter. You would provide it via -D as follows:

    hadoop jar phoenix-<version>-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool
-Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 <other
command-line parameters>

The important point in the above example is that config-based
parameters specified with -D are given before the application-level
parameters, and after the class name to be run.

>From my read of the HBase code, in this context you can also specify
the "hbase.hregion.max.filesize" parameter in the same way (in this
context it's a client-side parameter).

As far as speeding things up, the main points to consider are:
- ensure that compression is enabled for map-reduce jobs on your
cluster -- particularly map-output (intermediate) compression - see
https://datameer.zendesk.com/hc/en-us/articles/204258750-How-to-Use-Intermediate-and-Final-Output-Compression-MR1-YARN-
for a good overview
- check the ratio of map output records vs spilled records in the
counters on the import job. If the spilled records are higher than map
output records (e.g. twice as high or three times as high), then you
will probably benefit from raising the mapreduce.task.io.sort.mb
setting (see https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml)

Now those are the answers to your questions, but I'm curious about why
you're getting more than 32 HFiles in a single column family of a
single region. I assume that this means that you're loading large
amounts of data into a small number of regions. This is probably not a
good thing -- it may impact performance of HBase in general (because
each region has such a large amount of data), and will also have a
very negative impact on the running time of your import job (because
part of the parallelism of the import job is determined by the number
of regions being written to). I don't think you mentioned how many
regions you have on your table that you're importing to, but
increasing the number of regions will likely resolve several problems
for you. Another reason to do this is the fact that HBase will likely
start splitting your regions after this import due to their size.

- Gabriel


On Thu, Aug 18, 2016 at 3:47 AM, Aaron Molitor
<amolitor@splicemachine.com> wrote:
> Hi all I'm running the CsvBulkLoadTool trying to pull in some data.  The MapReduce Job
appears to complete, and gives some promising information:
>
>
> ################################################################################
>         Phoenix MapReduce Import
>                 Upserts Done=600037902
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=79657289180
>         File Output Format Counters
>                 Bytes Written=176007436620
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xa24982f
connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181
sessionTimeout=1200000 watcher=hconnection-0xa24982f0x0, quorum=stl-colo-srv073.splicemachine.colo:2181,
baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181.
Will not attempt to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server
stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 0x15696476bf90484, negotiated
timeout = 40000
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for TPCH.LINEITEM
from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
> 16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed connection cannot be
used for bulkload. Creating unmanaged connection.
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x456a0752
connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181
sessionTimeout=1200000 watcher=hconnection-0x456a07520x0, quorum=stl-colo-srv073.splicemachine.colo:2181,
baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181.
Will not attempt to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server
stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 0x15696476bf90485, negotiated
timeout = 40000
> 16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
> ################################################################################
>
> and eventually errors out with this exception.
>
> ################################################################################
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda
first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728
first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0
first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f
first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9
first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34
first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
> 16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32
hfiles to family 0 of region with start key
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing master
protocol: MasterService
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper
sessionid=0x15696476bf90485
> 16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 closed
> 16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
> Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to
one family of one region
>         at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420)
>         at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314)
>         at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355)
>         at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332)
>         at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270)
>         at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> ################################################################################
>
> a count of the table showa 0 rows:
> 0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
> +-----------+
> | COUNT(1)  |
> +-----------+
> | 0         |
> +-----------+
>
> Some quick googling gives an hbase param that could be tweaked (http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region).
>
> Main Questions:
> - Will the CsvBulkLoadTool pick up these params, or will I need to put them in hbase-site.xml?
> - Is there anything else I can tune to make this run quicker? It took 5 hours for it
to fail with the error above.
>
> This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 4.8.0-HBase-1.1
> Ambari default settings except for:
> - HBase RS heap size is set to 24GB
> - hbase.rpc.timeout set to 20 min
> - phoenix.query.timeoutMs set to 60 min
>
> all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM
>
>
>
>

Mime
View raw message