Hi all I'm running the CsvBulkLoadTool trying to pull in some data. The MapReduce Job appears
to complete, and gives some promising information:
################################################################################
Phoenix MapReduce Import
Upserts Done=600037902
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=79657289180
File Output Format Counters
Bytes Written=176007436620
16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c
16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xa24982f
connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181
16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181
sessionTimeout=1200000 watcher=hconnection-0xa24982f0x0, quorum=stl-colo-srv073.splicemachine.colo:2181,
baseZNode=/hbase-unsecure
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181.
Will not attempt to authenticate using SASL (unknown error)
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
initiating session
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
sessionid = 0x15696476bf90484, negotiated timeout = 40000
16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for TPCH.LINEITEM from
/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed connection cannot be used
for bulkload. Creating unmanaged connection.
16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x456a0752
connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181
16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181
sessionTimeout=1200000 watcher=hconnection-0x456a07520x0, quorum=stl-colo-srv073.splicemachine.colo:2181,
baseZNode=/hbase-unsecure
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181.
Will not attempt to authenticate using SASL (unknown error)
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
initiating session
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
sessionid = 0x15696476bf90485, negotiated timeout = 40000
16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
################################################################################
and eventually errors out with this exception.
################################################################################
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda
first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728
first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0
first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f
first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9
first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34
first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32 hfiles
to family 0 of region with start key
16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing master
protocol: MasterService
16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper
sessionid=0x15696476bf90485
16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 closed
16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one
family of one region
at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420)
at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314)
at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355)
at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332)
at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270)
at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
################################################################################
a count of the table showa 0 rows:
0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
+-----------+
| COUNT(1) |
+-----------+
| 0 |
+-----------+
Some quick googling gives an hbase param that could be tweaked (http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region).
Main Questions:
- Will the CsvBulkLoadTool pick up these params, or will I need to put them in hbase-site.xml?
- Is there anything else I can tune to make this run quicker? It took 5 hours for it to fail
with the error above.
This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 4.8.0-HBase-1.1
Ambari default settings except for:
- HBase RS heap size is set to 24GB
- hbase.rpc.timeout set to 20 min
- phoenix.query.timeoutMs set to 60 min
all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM
|