Hi Guys,

I have been testing out the Phoenix Local Indexes and I'm facing an issue after restart the entire HBase cluster. 

Scenario: I'm using Phoenix 4.4 and HBase 1.1.1. My test cluster contains 10 machines and the main table contains 300 pre-split regions which implies 300 regions on local index table as well and to configure Phoenix I followed thistutorial.

When I start a fresh cluster everything is just fine, the local index is created and I can insert data and query it using the proper indexes. The problem comes when I perform a full restart of the cluster to update some configurations in that moment I'm not able to restart the cluster anymore. I should do a proper rolling restart but it looks that Ambari is not doing it in some situations.

Most of the servers are throwing exceptions like:

INFO  [htable-pool7-t1] client.AsyncProcess: #5, table=_LOCAL_IDX_BIDDING_EVENTS, attempt=27/350 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region _LOCAL_IDX_BIDDING_EVENTS,57e4b17e4b17e4ac,1451943466164.253bdee3695b566545329fa3ac86d05e. is not online on ip-10-5-4-24.ec2.internal,16020,1451996088952
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1991)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
 on ip-10-5-4-24.ec2.internal,16020,1451942002174, tracking started null, retrying after=20001ms, replay=1ops
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t1] client.AsyncProcess: #3, waiting for 2  actions to finish
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t2] client.AsyncProcess: #4, waiting for 2  actions to finish
 
It looks that they are getting into a state where some region servers are waiting for other regions that are not available yet in other servers. 

On HBase UI I can see servers stuck on this messages:

Description: Replaying edits from hdfs://.../recovered.edits/0000000000000464197
Status: Running pre-WAL-restore hook in coprocessors (since 48mins, 45sec ago)

Another interesting thing that I noticed is the empty coprocessor list for the servers that are stuck with 0 regions assigned.

HBase master goes down after logging some of these messages:

GeneralBulkAssigner: Failed bulking assigning N regions

I was able to perform full restarts before start using local indexes and everything worked fine. This can probably be a misconfiguration from my side but I have checked different properties and approaches to restart the cluster and I'm unable to do it.

My understanding about local indexes on phoenix (please correct me if I'm wrong) is that they are normal HBase tables and phoenix places the regions to ensure the proper data locality. Is the data locality fully maintained when we lose N region servers and/or the regions are moved?

Any insights would be very helpful.

Thank you
Cheers
Pedro