Try setting hbase.regionserver.lease.period on the region servers to a higher value like 6000000.

Also PHOENIX-2357 (currently under review) will make this unnecessary, and James was spot on about drop column once we do the column name indirection (PHOENIX-1598 and PHOENIX-1940), though you'd likely still want to put delete markers on each column cell, but you could do it asynchronously.

Thanks,
James

On Wednesday, November 11, 2015, James Heather <james.heather@mendeley.com> wrote:
I don't know what the answer is to your question, but I have hit this before.

It seems that adding a column is a lazy operation, and results in changing just the metadata, so it returns almost immediately; but dropping a column is not. In fact, if you add a column and then immediately drop it, it takes ages to do the drop, presumably because Phoenix has to check each row to see if there's anything it needs to remove.

I don't know if it would be possible to implement a lazy drop, so that the data isn't really removed from the row until the row is accessed. Obviously some care would be needed if a column was added with the same name before the previous one had been completely removed.

I suspect that this will be much improved if the Phoenix crew manage to implement the level of indirection that @JT mentioned for column names. This would mean that the columns in HBase would have uniquely generated names, and the Phoenix names would be used to map to these HBase names. Lazy dropping would be easier in that world, because the column couldn't ever be accessed after it had been dropped, and any write to the column could be set to remove any data from columns that no longer exist.

James

On 11/11/15 15:08, Lukáš Lalinský wrote:
When running "ALTER TABLE xxx DROP COLUMN yyy" on a table with about
6M rows (which I considered small enough), it's always timing out and
I can't see how to get it execute at least once successfully.

I was getting some internal Phoenix timeouts, but after setting the
following properties, it changed:

hbase.client.scanner.timeout.period=6000000
phoenix.query.timeoutMs=6000000
hbase.rpc.timeout=6000000

Now it fails with errors like this:

Wed Nov 11 13:44:25 UTC 2015,
RpcRetryingCaller{globalStartTime=1447246894248, pause=100,
retries=35}, java.io.IOException: Call to XXX/XXX:16020 failed on
local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call id=1303, waitTime=60001, operationTimeout=60000 expired.
Wed Nov 11 13:45:45 UTC 2015,
RpcRetryingCaller{globalStartTime=1447246894248, pause=100,
retries=35}, java.io.IOException: Call to XXX/XXX:16020 failed on
local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call id=1341, waitTime=60001, operationTimeout=60000 expired.

at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
... 3 more
Caused by: java.io.IOException: Call to XXX/XXX:16020 failed on local
exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=1341, waitTime=60001, operationTimeout=60000 expired.
at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1232)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1200)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:369)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:343)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=1341, waitTime=60001, operationTimeout=60000 expired.
at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
... 14 more

While it's still running, I see log entries like this on region servers:

2015-11-11 15:00:49,059 WARN
[B.defaultRpcServer.handler=9,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA
2015-11-11 15:00:49,259 WARN
[B.defaultRpcServer.handler=12,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA
2015-11-11 15:00:49,537 WARN
[B.defaultRpcServer.handler=9,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA
2015-11-11 15:00:49,766 WARN
[B.defaultRpcServer.handler=12,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA
2015-11-11 15:00:49,960 WARN
[B.defaultRpcServer.handler=9,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA
2015-11-11 15:00:50,212 WARN
[B.defaultRpcServer.handler=12,queue=0,port=16020]
coprocessor.UngroupedAggregateRegionObserver: Committing bactch of
1000 mutations for MEDIA

Any ideas how to solve this? I'd be even fine with having just a way
to remove the column from the Phoenix metadata and keep the values in
HBase, but I don't see how to do it except for running DROP COLUMN and
waiting for it to time out.

Lukas