phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: CsvBulkLoadTool with ~75GB file
Date Fri, 19 Aug 2016 21:29:49 GMT
Maybe this will help?
http://phoenix.apache.org/bulk_dataload.html#Permissions_issues_when_uploading_HFiles

bq. I struggle to understand how to use split points in the create
statement.

You can't always use split points - it depends on your schema and the
knowledge you have about the data being loaded. In the example you gave,
you could potentially split on a prefix of the first PK column, the host in
this case.

CREATE TABLE stats.prod_metrics ( host char(50) not null, created_date date
not null,
    txn_count bigint CONSTRAINT pk PRIMARY KEY (host, created_date)
   DATA_BLOCK_ENCODING='NONE',VERSIONS=5,MAX_FILESIZE=2000000 split on (?,
?, ?)
)

For example you could have a SPLIT ON ('e','g','m','v', ...)

Thanks,
James

On Fri, Aug 19, 2016 at 12:22 PM, John Leach <jleach4@gmail.com> wrote:

> Aaron,
>
> Looks like a permission issue?
>
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint threw
> java.lang.IllegalStateException: Failed to get FileSystem instance
> java.lang.IllegalStateException: Failed to get FileSystem instance
> at org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.start(
> SecureBulkLoadEndpoint.java:153)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost$
> Environment.startup(CoprocessorHost.java:411)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.
> loadInstance(CoprocessorHost.java:253)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.
> loadSystemCoprocessors(CoprocessorHost.java:156)
> at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(
> RegionCoprocessorHost.java:231)
> at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:723)
> at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:631)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> newHRegion(HRegion.java:5978)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> createDaughterRegionFromSplits(HRegion.java:6397)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.
> stepsBeforePONR(SplitTransactionImpl.java:435)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.
> createDaughters(SplitTransactionImpl.java:278)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(
> SplitTransactionImpl.java:561)
> at org.apache.hadoop.hbase.regionserver.SplitRequest.
> doSplitting(SplitRequest.java:82)
> at org.apache.hadoop.hbase.regionserver.SplitRequest.run(
> SplitRequest.java:154)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission
> denied. user=splice is not the owner of inode=staging
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(
> FSPermissionChecker.java:250)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:227)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:190)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1780)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1764)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkOwner(FSDirectory.java:1733)
> at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.
> setPermission(FSDirAttrOp.java:63)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> setPermission(FSNamesystem.java:1653)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(
> NameNodeRpcServer.java:793)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.setPermission(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:461)
> at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:73)
> at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:2499)
> at org.apache.hadoop.hdfs.DistributedFileSystem$30.
> doCall(DistributedFileSystem.java:1531)
> at org.apache.hadoop.hdfs.DistributedFileSystem$30.
> doCall(DistributedFileSystem.java:1527)
> at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
> at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(
> DistributedFileSystem.java:1527)
> at org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.start(
> SecureBulkLoadEndpoint.java:140)
> ... 20 more
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.
> hadoop.security.AccessControlException): Permission denied. user=splice
> is not the owner of inode=staging
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(
> FSPermissionChecker.java:250)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:227)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:190)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1780)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1764)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkOwner(FSDirectory.java:1733)
> at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.
> setPermission(FSDirAttrOp.java:63)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> setPermission(FSNamesystem.java:1653)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(
> NameNodeRpcServer.java:793)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.setPermission(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:461)
> at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1426)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.
> invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy15.setPermission(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat
> orPB.setPermission(ClientNamenodeProtocolTranslatorPB.java:366)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:256)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy16.setPermission(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:2497)
> ... 25 more
> 2016-08-19 15:32:41,658 FATAL [regionserver/stl-colo-srv081.
> splicemachine.colo/10.1.1.181:16020-splits-1471620760791]
> regionserver.HRegionServer: ABORTING region server
> stl-colo-srv081.splicemachine.colo,16020,1471533848542: The coprocessor
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint threw
> java.lang.IllegalStateException: Failed to get FileSystem instance
> java.lang.IllegalStateException: Failed to get FileSystem instance
> at org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.start(
> SecureBulkLoadEndpoint.java:153)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost$
> Environment.startup(CoprocessorHost.java:411)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.
> loadInstance(CoprocessorHost.java:253)
> at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.
> loadSystemCoprocessors(CoprocessorHost.java:156)
> at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(
> RegionCoprocessorHost.java:231)
> at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:723)
> at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:631)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> newHRegion(HRegion.java:5978)
> at org.apache.hadoop.hbase.regionserver.HRegion.
> createDaughterRegionFromSplits(HRegion.java:6397)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.
> stepsBeforePONR(SplitTransactionImpl.java:435)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.
> createDaughters(SplitTransactionImpl.java:278)
> at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(
> SplitTransactionImpl.java:561)
> at org.apache.hadoop.hbase.regionserver.SplitRequest.
> doSplitting(SplitRequest.java:82)
> at org.apache.hadoop.hbase.regionserver.SplitRequest.run(
> SplitRequest.java:154)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission
> denied. user=splice is not the owner of inode=staging
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(
> FSPermissionChecker.java:250)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:227)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:190)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1780)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1764)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkOwner(FSDirectory.java:1733)
> at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.
> setPermission(FSDirAttrOp.java:63)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> setPermission(FSNamesystem.java:1653)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(
> NameNodeRpcServer.java:793)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.setPermission(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:461)
> at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:73)
> at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:2499)
> at org.apache.hadoop.hdfs.DistributedFileSystem$30.
> doCall(DistributedFileSystem.java:1531)
> at org.apache.hadoop.hdfs.DistributedFileSystem$30.
> doCall(DistributedFileSystem.java:1527)
> at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
> at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(
> DistributedFileSystem.java:1527)
> at org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint.start(
> SecureBulkLoadEndpoint.java:140)
> ... 20 more
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.
> hadoop.security.AccessControlException): Permission denied. user=splice
> is not the owner of inode=staging
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(
> FSPermissionChecker.java:250)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:227)
> at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.
> checkPermission(FSPermissionChecker.java:190)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1780)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkPermission(FSDirectory.java:1764)
> at org.apache.hadoop.hdfs.server.namenode.FSDirectory.
> checkOwner(FSDirectory.java:1733)
> at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.
> setPermission(FSDirAttrOp.java:63)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> setPermission(FSNamesystem.java:1653)
> at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(
> NameNodeRpcServer.java:793)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.setPermission(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:461)
> at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1709)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1426)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.
> invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy15.setPermission(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat
> orPB.setPermission(ClientNamenodeProtocolTranslatorPB.java:366)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:256)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy16.setPermission(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:2497)
> ... 25 more
>
> Cheers,
> John
>
>
>
> On Aug 19, 2016, at 12:15 PM, Aaron Molitor <amolitor@splicemachine.com>
> wrote:
>
> Got to the point where data was loaded, but I could not get count of
> LINEITEM to complete.  I realized that I should run compaction, kicking off
> compaction knocked over two regionservers.
>
> I've attached the logs, could this be related to
> the hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily to 48 change
> I made?
>
> <hbase-regionserver.srv78.log>
> <hbase-regionserver.srv81.log>
>
> On Aug 19, 2016, at 09:40, John Leach <jleach4@gmail.com> wrote:
>
> Gabriel,
>
> Thanks for the response I appreciate it.
>
> I struggle to understand how to use split points in the create statement.
>
> (1) Creating a table with Split Points:
>
> CREATE TABLE stats.prod_metrics ( host char(50) not null, created_date
> date not null,
>     txn_count bigint CONSTRAINT pk PRIMARY KEY (host, created_date)
>    DATA_BLOCK_ENCODING='NONE',VERSIONS=5,MAX_FILESIZE=2000000 split on
> (?, ?, ?)
> )
>
> Do you have to know the data distribution in the import files first?
> Do you pass in Hexadecimal bytes in the split on clause?  Is there a
> utility to generate them for you based on the primary key?  I think I am
> missing something here.
>
> As for the aggregation query, so large insert into queries would be faster
> in map reduce vs. phoenix?
>
> Would they be stable in Phoenix at scale?
>
> Regards,
> John
>
>
>
> On Aug 19, 2016, at 9:52 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>
> Hi John,
>
> You can actually pre-split a table when creating it, either by
> specifying split points in the CREATE TABLE statement[1] or by using
> salt buckets[2]. In my current use cases I always use salting, but
> take a look at the salting documentation[2] for the pros and cons of
> this.
>
> Your approaches as outlined is basically in line with how I would
> upload a large amount of data into Phoenix, taking into account that
> the tables would be split into multiple regions at creation time using
> salt buckets.
>
> About doing ETL to create an aggregate table: personally, if I were
> trying to create an aggregate table, I would probably do the
> aggregation outside of Phoenix (using MapReduce or Pig or something
> similar). This is of course assuming that the aggregate table is the
> only thing that is needed (i.e. that the un-aggregated table only
> exists to feed the aggregate table). This isn't to say that Phoenix
> can't be used to create the aggregate table, but I'm pretty sure that
> using HBase as a backing store for large scale ETL/aggregation like
> this just won't be as fast as MapReduce.
>
> However, if your use-case does require using Phoenix for the
> ETL/aggregation step, then your approach sounds correct.
>
> - Gabriel
>
> 1. http://phoenix.apache.org/language/index.html#create_table
> 2. http://phoenix.apache.org/salted.html
>
> On Fri, Aug 19, 2016 at 3:32 PM, John Leach <jleach4@gmail.com> wrote:
>
> Gabriel,
>
> Do you guys provide pre-split mechanisms (sampling of import/query data,
> splitting policies, etc.) or does the admin have to determine the split
> points?
>
> I guess that begs the question of how you would do a basic ETL operation
> in Phoenix?
>
> How would you do the following on a 100 gigs of data in the import table
> to 50 gigs in the aggregate table?
>
> (1) Create import table.
> (2) Import data into that table.
> (3) create an aggregate table.
> (4) Insert data into aggregate table based on an aggregate of the imported
> table.
>
> Here is what I am gathering from the conversation...
>
> (1) Create Import Table.
> (2) Perform a pre-split based on some information about the data or via
> some split mechanism (If not, import via MapReduce does not scale).
> (3) Run MapReduce job for importing data (Other mechanism does not scale?)
> (4) Compaction/Statistics Operation on Import Table (If not, query will
> not scale?)
> (5) Create Aggregate Table
> (6) Perform a pre-split based on some information about the Aggregate data
> or via some split mechanism. (If not, insert will not scale?).
> (7) Run insert query?  If you had to run import on map reduce to scale, I
> suspect the insert does MapReduce as well or is there some other mechanism
> that would scale?
> (8) Compaction/Statistics Operation on Aggregate Table
>
> I really appreciate all the support.  We are trying to run a Phoenix TPCH
> benchmark and are struggling a bit to understand the process.
>
> Regards,
> John Leach
>
> On Aug 19, 2016, at 2:09 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>
> Hi Aaron,
>
> How many regions are there in the LINEITEM table? The fact that you
> needed to bump the
> hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily setting up to
> 48 suggests that the amount of data going into a single region of that
> table is probably pretty large.
>
> Along the same line, I believe regions are the initial unit of
> parallelism (in the absence of statistics[1]) the when running
> aggregate queries in Phoenix. This means that if you have oversized
> regions, you will have poor parallelism when running aggregate
> queries, which could lead to RPC timeouts.
>
> From what I see in the log info that you provided, your count query
> started at 14:14:06, and errored out at 14:34:15, which appears to be
> in line with the 20 minute HBase RPC timeout. This appears to indicate
> that a scan over a single region is taking more than 20 minutes, which
> again looks to me to be an indicator of an oversized region. If
> possible, I would look into splitting up your LINEITEM table into more
> (smaller) regions, which should improve both import and query
> performance.
>
> - Gabriel
>
> 1. http://phoenix.apache.org/update_statistics.html
>
> On Thu, Aug 18, 2016 at 5:22 PM, Aaron Molitor
> <amolitor@splicemachine.com> wrote:
>
> Gabriel,
>
> Thanks for the help, it's good to know that those params can be passed
> from the command line and that the order is important.
>
> I am trying to load the 100GB TPC-H data set and ultimately run the TPC-H
> queries.  All of the tables loaded relatively easily except LINEITEM (the
> largest) required me to increase the hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily
> to 48.  After that the file loaded.
>
> This brings me to my next question though.  What settings do I need to
> change in order to count the [LINEITEM] table? At this point I have changed:
> - hbase.rpc.timeout set to 20 min
> - phoenix.query.timeoutMs set to 60 min
>
> I am still getting an error, it appears to be an RPC timeout, as I have
> mentioned I have already moved to an uncomfortably high setting.  Is there
> some other settings I should be moving and not necessarily the rpc.timeout?
>
> For reference, here's the full sqlline interaction, including the error:
> ############################################################
> ####################
> Latest phoenix error:
> splice@stl-colo-srv073 ~]$ /opt/phoenix/default/bin/sqlline.py
> $(hostname):2181:/hbase-unsecure
> Setting property: [incremental, false]
> Setting property: [isolation, TRANSACTION_READ_COMMITTED]
> issuing: !connect jdbc:phoenix:stl-colo-srv073.splicemachine.colo:2181:/hbase-unsecure
> none none org.apache.phoenix.jdbc.PhoenixDriver
> Connecting to jdbc:phoenix:stl-colo-srv073.splicemachine.colo:2181:/
> hbase-unsecure
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/opt/phoenix/apache-
> phoenix-4.8.0-HBase-1.1-bin/phoenix-4.8.0-HBase-1.1-
> client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-
> 258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/
> StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 16/08/18 14:14:06 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/08/18 14:14:08 WARN shortcircuit.DomainSocketFactory: The
> short-circuit local reads feature cannot be used because libhadoop cannot
> be loaded.
> Connected to: Phoenix (version 4.8)
> Driver: PhoenixEmbeddedDriver (version 4.8)
> Autocommit status: true
> Transaction isolation: TRANSACTION_READ_COMMITTED
> Building list of tables and columns for tab-completion (set fastconnect to
> true to skip)...
> 147/147 (100%) Done
> Done
> sqlline version 1.1.9
> 0: jdbc:phoenix:stl-colo-srv073.splicemachine> select count(*) from
> TPCH.LINEITEM;
> Error: org.apache.phoenix.exception.PhoenixIOException: Failed after
> attempts=36, exceptions:
> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713,
> seqNum=17 (state=08000,code=101)
> org.apache.phoenix.exception.PhoenixIOException:
> org.apache.phoenix.exception.PhoenixIOException: Failed after
> attempts=36, exceptions:
> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>
>       at org.apache.phoenix.util.ServerUtil.parseServerException(
> ServerUtil.java:111)
>       at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> BaseResultIterators.java:774)
>       at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> BaseResultIterators.java:720)
>       at org.apache.phoenix.iterate.ConcatResultIterator.getIterators(
> ConcatResultIterator.java:50)
>       at org.apache.phoenix.iterate.ConcatResultIterator.currentIterator(
> ConcatResultIterator.java:97)
>       at org.apache.phoenix.iterate.ConcatResultIterator.next(
> ConcatResultIterator.java:117)
>       at org.apache.phoenix.iterate.BaseGroupedAggregatingResultIt
> erator.next(BaseGroupedAggregatingResultIterator.java:64)
>       at org.apache.phoenix.iterate.UngroupedAggregatingResultIter
> ator.next(UngroupedAggregatingResultIterator.java:39)
>       at org.apache.phoenix.jdbc.PhoenixResultSet.next(
> PhoenixResultSet.java:778)
>       at sqlline.BufferedRows.<init>(BufferedRows.java:37)
>       at sqlline.SqlLine.print(SqlLine.java:1649)
>       at sqlline.Commands.execute(Commands.java:833)
>       at sqlline.Commands.sql(Commands.java:732)
>       at sqlline.SqlLine.dispatch(SqlLine.java:807)
>       at sqlline.SqlLine.begin(SqlLine.java:681)
>       at sqlline.SqlLine.start(SqlLine.java:398)
>       at sqlline.SqlLine.main(SqlLine.java:292)
> Caused by: java.util.concurrent.ExecutionException:
> org.apache.phoenix.exception.PhoenixIOException: Failed after
> attempts=36, exceptions:
> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:202)
>       at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> BaseResultIterators.java:769)
>       ... 15 more
> Caused by: org.apache.phoenix.exception.PhoenixIOException: Failed after
> attempts=36, exceptions:
> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>
>       at org.apache.phoenix.util.ServerUtil.parseServerException(
> ServerUtil.java:111)
>       at org.apache.phoenix.iterate.TableResultIterator.initScanner(
> TableResultIterator.java:174)
>       at org.apache.phoenix.iterate.TableResultIterator.next(
> TableResultIterator.java:124)
>       at org.apache.phoenix.iterate.SpoolingResultIterator.<init>(
> SpoolingResultIterator.java:139)
>       at org.apache.phoenix.iterate.SpoolingResultIterator.<init>(
> SpoolingResultIterator.java:97)
>       at org.apache.phoenix.iterate.SpoolingResultIterator.<init>(
> SpoolingResultIterator.java:69)
>       at org.apache.phoenix.iterate.SpoolingResultIterator$
> SpoolingResultIteratorFactory.newIterator(SpoolingResultIterator.java:92)
>       at org.apache.phoenix.iterate.ParallelIterators$1.call(
> ParallelIterators.java:114)
>       at org.apache.phoenix.iterate.ParallelIterators$1.call(
> ParallelIterators.java:106)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask.run(
> JobManager.java:183)
>       at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed after attempts=36, exceptions:
> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>
>       at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>       at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:199)
>       at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> ScannerCallableWithReplicas.java:59)
>       at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithoutRetries(RpcRetryingCaller.java:200)
>       at org.apache.hadoop.hbase.client.ClientScanner.call(
> ClientScanner.java:320)
>       at org.apache.hadoop.hbase.client.ClientScanner.
> nextScanner(ClientScanner.java:295)
>       at org.apache.hadoop.hbase.client.ClientScanner.
> initializeScannerInConstruction(ClientScanner.java:160)
>       at org.apache.hadoop.hbase.client.ClientScanner.<init>(
> ClientScanner.java:155)
>       at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
>       at org.apache.phoenix.iterate.TableResultIterator.initScanner(
> TableResultIterator.java:170)
>       ... 12 more
> Caused by: java.net.SocketTimeoutException: callTimeout=60000,
> callDuration=1200310: row '' on table 'TPCH.LINEITEM' at
> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9.,
> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>       at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:159)
>       at org.apache.hadoop.hbase.client.ResultBoundedCompletionService
> $QueueingFuture.run(ResultBoundedCompletionService.java:64)
>       ... 3 more
> Caused by: java.io.IOException: Call to stl-colo-srv076.splicemachine.
> colo/10.1.1.176:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27,
> waitTime=1200001, operationTimeout=1200000 expired.
>       at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(
> AbstractRpcClient.java:278)
>       at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
> RpcClientImpl.java:1239)
>       at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(
> AbstractRpcClient.java:217)
>       at org.apache.hadoop.hbase.ipc.AbstractRpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(
> AbstractRpcClient.java:318)
>       at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32831)
>       at org.apache.hadoop.hbase.client.ScannerCallable.
> openScanner(ScannerCallable.java:373)
>       at org.apache.hadoop.hbase.client.ScannerCallable.call(
> ScannerCallable.java:200)
>       at org.apache.hadoop.hbase.client.ScannerCallable.call(
> ScannerCallable.java:62)
>       at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithoutRetries(RpcRetryingCaller.java:200)
>       at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$
> RetryingRPC.call(ScannerCallableWithReplicas.java:350)
>       at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$
> RetryingRPC.call(ScannerCallableWithReplicas.java:324)
>       at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:126)
>       ... 4 more
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27,
> waitTime=1200001, operationTimeout=1200000 expired.
>       at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
>       at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
> RpcClientImpl.java:1213)
>       ... 14 more
> 0: jdbc:phoenix:stl-colo-srv073.splicemachine>
> ############################################################
> ####################
>
> On Aug 18, 2016, at 02:15, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>
> Hi Aaron,
>
> I'll answered your questions directly first, but please see the bottom
> part of this mail for important additional details.
>
> You can specify the
> "hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily" parameter
> (referenced from your StackOverflow link) on the command line of you
> CsvBulkLoadTool command -- my understanding is that this is a purely
> client-side parameter. You would provide it via -D as follows:
>
>  hadoop jar phoenix-<version>-client.jar
> org.apache.phoenix.mapreduce.CsvBulkLoadTool
> -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 <other
> command-line parameters>
>
> The important point in the above example is that config-based
> parameters specified with -D are given before the application-level
> parameters, and after the class name to be run.
>
> From my read of the HBase code, in this context you can also specify
> the "hbase.hregion.max.filesize" parameter in the same way (in this
> context it's a client-side parameter).
>
> As far as speeding things up, the main points to consider are:
> - ensure that compression is enabled for map-reduce jobs on your
> cluster -- particularly map-output (intermediate) compression - see
> https://datameer.zendesk.com/hc/en-us/articles/204258750-
> How-to-Use-Intermediate-and-Final-Output-Compression-MR1-YARN-
> for a good overview
> - check the ratio of map output records vs spilled records in the
> counters on the import job. If the spilled records are higher than map
> output records (e.g. twice as high or three times as high), then you
> will probably benefit from raising the mapreduce.task.io.sort.mb
> setting (see https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-
> client/hadoop-mapreduce-client-core/mapred-default.xml)
>
> Now those are the answers to your questions, but I'm curious about why
> you're getting more than 32 HFiles in a single column family of a
> single region. I assume that this means that you're loading large
> amounts of data into a small number of regions. This is probably not a
> good thing -- it may impact performance of HBase in general (because
> each region has such a large amount of data), and will also have a
> very negative impact on the running time of your import job (because
> part of the parallelism of the import job is determined by the number
> of regions being written to). I don't think you mentioned how many
> regions you have on your table that you're importing to, but
> increasing the number of regions will likely resolve several problems
> for you. Another reason to do this is the fact that HBase will likely
> start splitting your regions after this import due to their size.
>
> - Gabriel
>
>
> On Thu, Aug 18, 2016 at 3:47 AM, Aaron Molitor
> <amolitor@splicemachine.com> wrote:
>
> Hi all I'm running the CsvBulkLoadTool trying to pull in some data.  The
> MapReduce Job appears to complete, and gives some promising information:
>
>
> ############################################################
> ####################
>      Phoenix MapReduce Import
>              Upserts Done=600037902
>      Shuffle Errors
>              BAD_ID=0
>              CONNECTION=0
>              IO_ERROR=0
>              WRONG_LENGTH=0
>              WRONG_MAP=0
>              WRONG_REDUCE=0
>      File Input Format Counters
>              Bytes Read=79657289180
>      File Output Format Counters
>              Bytes Written=176007436620
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles
> from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process
> identifier=hconnection-0xa24982f connecting to ZooKeeper
> ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection,
> connectString=stl-colo-srv073.splicemachine.colo:2181
> sessionTimeout=1200000 watcher=hconnection-0xa24982f0x0,
> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to
> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not
> attempt to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established
> to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment
> complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
> sessionid = 0x15696476bf90484, negotiated timeout = 40000
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for
> TPCH.LINEITEM from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
> 16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed
> connection cannot be used for bulkload. Creating unmanaged connection.
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process
> identifier=hconnection-0x456a0752 connecting to ZooKeeper
> ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection,
> connectString=stl-colo-srv073.splicemachine.colo:2181
> sessionTimeout=1200000 watcher=hconnection-0x456a07520x0,
> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to
> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not
> attempt to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established
> to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment
> complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181,
> sessionid = 0x15696476bf90485, negotiated timeout = 40000
> 16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
> ############################################################
> ####################
>
> and eventually errors out with this exception.
>
> ############################################################
> ####################
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> 88b40cbbc4c841f99eae906af3b93cda first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04
> last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> de309e5c7b3841a6b4fd299ac8fa8728 first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01
> last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> e7ed8bc150c9494b8c064a022b3609e0 first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04
> last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> c35e01b66d85450c97da9bb21bfc650f first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04
> last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> b5904451d27d42f0bcb4c98a5b14f3e9 first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01
> last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/
> 66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/
> 9d26e9a00e5149cabcb415c6bb429a34 first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04
> last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
> 16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load
> more than 32 hfiles to family 0 of region with start key
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation:
> Closing master protocol: MasterService
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation:
> Closing zookeeper sessionid=0x15696476bf90485
> 16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485
> closed
> 16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
> Exception in thread "main" java.io.IOException: Trying to load more than
> 32 hfiles to one family of one region
>      at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.
> doBulkLoad(LoadIncrementalHFiles.java:420)
>      at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.
> doBulkLoad(LoadIncrementalHFiles.java:314)
>      at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.
> completebulkload(AbstractBulkLoadTool.java:355)
>      at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(
> AbstractBulkLoadTool.java:332)
>      at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(
> AbstractBulkLoadTool.java:270)
>      at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(
> AbstractBulkLoadTool.java:183)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>      at org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(
> CsvBulkLoadTool.java:101)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>      at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>      at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:606)
>      at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>      at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> ############################################################
> ####################
>
> a count of the table showa 0 rows:
> 0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
> +-----------+
> | COUNT(1)  |
> +-----------+
> | 0         |
> +-----------+
>
> Some quick googling gives an hbase param that could be tweaked (
> http://stackoverflow.com/questions/24950393/trying-to-
> load-more-than-32-hfiles-to-one-family-of-one-region).
>
> Main Questions:
> - Will the CsvBulkLoadTool pick up these params, or will I need to put
> them in hbase-site.xml?
> - Is there anything else I can tune to make this run quicker? It took 5
> hours for it to fail with the error above.
>
> This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix
> 4.8.0-HBase-1.1
> Ambari default settings except for:
> - HBase RS heap size is set to 24GB
> - hbase.rpc.timeout set to 20 min
> - phoenix.query.timeoutMs set to 60 min
>
> all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM
>
>
>
>
>
>
>
>
>
>

Mime
View raw message