flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soo Kim <philipj...@gmail.com>
Subject Re: Cause and solution for org.apache.flume.EventDeliveryException: NettyAvroRpcClient: Handshake timed out after 20000ms
Date Fri, 05 Oct 2012 08:21:02 GMT
Hari, thanks for the advices.

To verify if HBase is the root cause of problems, I changed server B's
sink configuration from HBaseSink to logger.
If nothing goes wrong for a couple of days with changed configuration,
it is clear that HBase is the root cause as you mentioned.

Also, I'll test with AsyncHBaseSink and see the problem occurs again.

Let me share the test results soon here.

Thanks,
Soo P.J. Kim

On Thu, Oct 4, 2012 at 12:15 PM, Hari Shreedharan
<hshreedharan@cloudera.com> wrote:
>
> Hi Soo,
>
>
> The Handshake timeout is only there in place to work around
> https://issues.apache.org/jira/browse/AVRO-1122. As a result of this
> bug, Avro handshake can deadlock and thus cause the sink to hang. If
> this is happening very rarely it is probably ok, but if it is
> happening consistently, try setting the "connect-timeout" parameter
> for the Avro Sink to a higher value (default is 20000 ms - this param
> is in ms).
>
>
> The second set of HBase Exceptions seems to be because HBase is unable
> to communicate with the servers in question. Can you verify that your
> cluster is healthy and region servers are not dead/machines not
> offline? Also, you should probably use the AsyncHBaseSink which
> provides the same guarantees but gives far better performance.
>
>
> Regards,
>
> Hari
>
>
> On Wed, Oct 3, 2012 at 7:57 PM, Soo Kim <philipjkim@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm using Flume NG (1.2.0), and having some trouble with timeout exception.
> >
> > I have 2 servers. A log file is appended at server A. And server B has HBase.
> > So I installed Flume NG to server A as a source, and to server B as a sink.
> >
> > flume.conf for server A is:
> > src_agent.sources = tail
> > src_agent.channels = memoryChannel
> > src_agent.sinks = avroSink
> >
> > src_agent.sources.tail.type = exec
> > src_agent.sources.tail.command = tail -F /home/www/crash_report/log/crash.log
> > src_agent.sources.tail.channels = memoryChannel
> >
> > src_agent.sinks.avroSink.channel = memoryChannel
> > src_agent.sinks.avroSink.type = avro
> > src_agent.sinks.avroSink.hostname = serverb.hostname.com
> > src_agent.sinks.avroSink.port = 10001
> >
> > src_agent.channels.memoryChannel.type = memory
> > src_agent.channels.memoryChannel.capacity = 10000
> >
> > And flume.conf for server B is:
> > dst_agent.sources = avro
> > dst_agent.channels = memoryChannel
> > dst_agent.sinks = hbaseSink
> >
> > dst_agent.sources.avro.type = avro
> > dst_agent.sources.avro.bind = 0.0.0.0
> > dst_agent.sources.avro.port = 10001
> > dst_agent.sources.avro.channels = memoryChannel
> >
> > dst_agent.sinks.hbaseSink.channel = memoryChannel
> > dst_agent.sinks.hbaseSink.type = org.apache.flume.sink.hbase.HBaseSink
> > dst_agent.sinks.hbaseSink.table = crash_report_flume
> > dst_agent.sinks.hbaseSink.columnFamily = f
> > dst_agent.sinks.hbaseSink.column = foo
> > dst_agent.sinks.hbaseSink.serializer =
> > org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
> > dst_agent.sinks.hbaseSink.serializer.payloadColumn = pcol
> > dst_agent.sinks.hbaseSink.serializer.incrementColumn = icol
> >
> > dst_agent.channels.memoryChannel.type = memory
> > dst_agent.channels.memoryChannel.capacity = 10000
> >
> > Flume startup script for server A is:
> > $FLUME_HOME/bin/flume-ng agent --conf $FLUME_HOME/conf --conf-file
> > $FLUME_HOME/conf/flume-hbase-src.conf --name src_agent
> > -Dflume.root.logger=INFO,console 1>>flume.log 2>>flume.log &
> >
> > Script for server B:
> > $FLUME_HOME/bin/flume-ng agent --conf $FLUME_HOME/conf --conf-file
> > $FLUME_HOME/conf/flume-hbase-dst.conf --name dst_agent
> > -Dflume.root.logger=INFO,console 1>>flume.log 2>>flume.log &
> >
> > All worked fine after startup. Logs were collected to HBase as
> > expected. But after a couple of hours I found something went wrong.
> >
> > Here's the flume error logs in server A:
> > 2012-10-02 20:28:59,641
> > (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
> > org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)]
> > Unable to deliver event. Exception follows.
> > org.apache.flume.EventDeliveryException: Failed to send events
> >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
> >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> >         at java.lang.Thread.run(Thread.java:636)
> > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
> > { host: serverb.hostname.com, port: 10001 }: Failed to send batch
> >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
> >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
> >         ... 3 more
> > Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
> > { host: serverb.hostname.com, port: 10001 }: Handshake timed out after
> > 20000ms
> >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
> >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
> >         ... 4 more
> > Caused by: java.util.concurrent.TimeoutException
> >         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
> >         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
> >         at org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
> >         ... 5 more
> >
> > There were lots of timeout exceptions, and the following exception
> > occured sparsely:
> > 2012-10-04 09:44:06,408
> > (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
> > org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)]
> > Unable to deliver event. Exception follows.
> > org.apache.flume.EventDeliveryException: Failed to send events
> >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
> >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> >         at java.lang.Thread.run(Thread.java:636)
> > Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host:
> > host: serverb.hostname.com, port: 10001 }: RPC connection error
> >         at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
> >         at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
> >         at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
> >         at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
> >         at org.apache.flume.sink.AvroSink.createConnection(AvroSink.java:182)
> >         at org.apache.flume.sink.AvroSink.verifyConnection(AvroSink.java:222)
> >         at org.apache.flume.sink.AvroSink.process(AvroSink.java:282)
> >         ... 3 more
> > Caused by: java.io.IOException: Error connecting to host:
> > serverb.hostname.com/192.168.1.2:10001
> >         at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:249)
> >         at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:198)
> >         at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:147)
> >         at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:106)
> >         ... 9 more
> > Caused by: java.nio.channels.ClosedChannelException
> >         at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$1.operationComplete(NioClientSocketPipelineSink.java:157)
> >         at org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:381)
> >         at org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:367)
> >         at org.jboss.netty.channel.DefaultChannelFuture.setSuccess(DefaultChannelFuture.java:316)
> >         at org.jboss.netty.channel.AbstractChannel$ChannelCloseFuture.setClosed(AbstractChannel.java:351)
> >         at org.jboss.netty.channel.AbstractChannel.setClosed(AbstractChannel.java:188)
> >         at org.jboss.netty.channel.socket.nio.NioSocketChannel.setClosed(NioSocketChannel.java:146)
> >         at org.jboss.netty.channel.socket.nio.NioWorker.close(NioWorker.java:592)
> >         at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.close(NioClientSocketPipelineSink.java:415)
> >         at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processConnectTimeout(NioClientSocketPipelineSink.java:379)
> >         at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:299)
> >         at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> >         at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
> >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >         ... 1 more
> >
> > I'd like to know what's the root cause for those errors, and how I can
> > solve them.
> >
> > FYI here's error logs in server B:
> > 2012-10-02 23:13:41,479
> > (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
> > org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:221)]
> > Failed to commit transaction.Transaction rolled back.
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> > Failed 100 actions: servers with issues: 192.168.1.3:60020,
> >         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1424)
> >         at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:624)
> >         at org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:207)
> >         at org.apache.flume.sink.hbase.HBaseSink.process(HBaseSink.java:200)
> >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> >         at java.lang.Thread.run(Thread.java:662)
> > 2012-10-02 23:13:41,480
> > (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
> > org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:228)]
> > Failed to commit transaction.Transaction rolled back.
> > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> > Failed 100 actions: servers with issues: 192.168.1.3:60020,
> >         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1424)
> >         at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:624)
> >         at org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:207)
> >         at org.apache.flume.sink.hbase.HBaseSink.process(HBaseSink.java:200)
> >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> >         at java.lang.Thread.run(Thread.java:662)
> > 2012-10-02 23:13:41,480
> > (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR -
> > org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)]
> > Unable to deliver event. Exception follows.
> > org.apache.flume.EventDeliveryException: Failed to commit
> > transaction.Transaction rolled back.
> >         at org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:230)
> >         at org.apache.flume.sink.hbase.HBaseSink.process(HBaseSink.java:200)
> >         at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
> >         at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
> >         at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> > Failed 100 actions: servers with issues: 192.168.1.3:60020,
> >         at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1424)
> >         at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:624)
> >         at org.apache.flume.sink.hbase.HBaseSink.putEventsAndCommit(HBaseSink.java:207)
> >         ... 4 more
> >
> > --
> > Thanks,
> > Soo Philip Kim
> > Email : philipjkim@gmail.com
> > Phone : +82-10-7222-6078

Mime
View raw message