flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Neal <cwn...@gmail.com>
Subject Re: flume.EventDeliveryException: Failed to send events
Date Tue, 16 Apr 2013 20:05:23 GMT
Yeah, I can script something up.  Maybe increasing the timeouts was all it
needed and you'll never see another reply to this thread... ;)  But, if
not, I'll post the thread dump next.

Thanks again for all the suggestions.
Chris


On Tue, Apr 16, 2013 at 2:57 PM, Brock Noland <brock@cloudera.com> wrote:

> Any chance of getting a thread dump (or a few) on both the source and
> destination agent during an incident? :) It'd be a little work, but a
> script could look for "Unable to deliver event. Exception follows." and
> do the thread dumps.
>
>
> On Tue, Apr 16, 2013 at 2:49 PM, Chris Neal <cwneal@gmail.com> wrote:
>
>> Thanks for all the input guys. :)
>>
>> @Hari:
>> The FileChannel at the AvroSource is on a SAN disk, so I don't think it
>> is the bottleneck in my case.  It is the same disk for both checkpoint and
>> data.  My queue depth remains relatively stable around 2000, which doesn't
>> bother me because of my batch size.
>>
>> On the AvroSink side, I have 126 ExecSources spread across 12 JVMs at 1GB
>> heap each.  Each VM has 4 AvroSinks across 2 separate servers, load
>> balanced and round robined (2 connections to each).  Is that a small enough
>> number of connections to remove the thread parameter on the AvroSource?
>>
>> @Brock
>> Each of my VMs on the downstream agents are 4GB heaps.  I watch the
>> MBeans pretty closely via jconsole, and they sit around 3GB used.
>>
>> Thanks again for all the help!
>>
>>
>> On Tue, Apr 16, 2013 at 2:26 PM, Brock Noland <brock@cloudera.com> wrote:
>>
>>> Another possibility is that the downstream agent is near capacity from a
>>> memory perspective. What is your heap size for these agents?
>>>
>>>
>>> On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan <
>>> hshreedharan@cloudera.com> wrote:
>>>
>>>>  One possibility is that you are hitting the file channel disk too hard
>>>> - do you have just one disk for checkpoint and data? It might be getting
>>>> slow because of this? Also you should probably just remove the thread limit
>>>> on AvroSource. It usually does not cause too much havoc unless you have a
>>>> massive number of connections causing too many threads.
>>>>
>>>> --
>>>> Hari Shreedharan
>>>>
>>>> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote:
>>>>
>>>> Thanks Hari.
>>>>
>>>> I increased both the connect and request timeouts to 40000ms, and I'm
>>>> testing that now.  I am talking on a LAN though, which is part of the
>>>> reason I'm concerned.  Seems like it might not actually be a network issue,
>>>> but perhaps an overloaded AvroSource on the back end?
>>>>
>>>>
>>>> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan <
>>>> hshreedharan@cloudera.com> wrote:
>>>>
>>>>  Looks like you are hitting Avro IPC timeouts - you should probably
>>>> increase it, especially if you are talking over WAN.
>>>>
>>>> --
>>>> Hari Shreedharan
>>>>
>>>> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote:
>>>>
>>>> I'm seeing the same thing :)
>>>>
>>>> Mine is all on a local LAN though, so the fact that an RPC call doesn't
>>>> reply in 10000ms or 20000ms is quite odd.  My configuration is for the most
>>>> part the same as Denis' configuration.  Two tiered system, ExecSources
>>>> running tail -F on log files to an AvroSink, to an AvroSource, loading into
>>>> HDFS on the back tier.
>>>>
>>>> I, too, see this on the AvroSink
>>>>
>>>> Either (A):
>>>> [2013-04-15 23:57:14.827]
>>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>>> event. Attempting next sink if available.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>>         at
>>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:619)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after
>>>> 20000ms
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at
>>>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)
>>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278)
>>>>         ... 5 more
>>>>
>>>> or (B):
>>>> [2013-04-15 19:49:01.135]
>>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN]
>>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] []
>>>>  (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume
>>>> event. Attempting next sink if available.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:324)
>>>>         at
>>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:619)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:308)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>>         ... 6 more
>>>>
>>>> The only thing I see on the AvroSource tier is the disconnect/reconnect
>>>> happening:
>>>>
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND
>>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED
>>>> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-10] []  (NettyServer.java:channelClosed:209) Connection to
/
>>>> 138.113.127.4:34481 disconnected.
>>>> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-10-thread-1] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN
>>>> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: /
>>>> 138.113.127.72:10000
>>>> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO]
>>>> [pool-11-thread-13] []  (NettyServer.java:handleUpstream:171) [id:
>>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED:
/
>>>> 138.113.127.4:62442
>>>>
>>>> Is there some way to determine exactly where this bottleneck is?  The
>>>> config options for AvroSource/Sink are quite short, so there is not much
>>>> tuning to do.  Here is what I have:
>>>>
>>>> # avro-hadoopjt01-sink properties
>>>> udprodae01.sinks.avro-hadoopjt01-sink.type = avro
>>>> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com
>>>> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000
>>>> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100
>>>>
>>>> # avro-hadoopjt01-source properties
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000
>>>> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64
>>>>
>>>> I can try increasing the AvroSink timeout values, but they seem quite
>>>> adequate at the defaults.  Maybe more threads on the AvroSource?
>>>>
>>>> Any advice would be much appreciated!
>>>> Chris
>>>>
>>>>
>>>> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <denis.lowe@gmail.com>wrote:
>>>>
>>>> I noticed that the logs in the destination server were reporting
>>>> dropped connections from the upstream server:
>>>>
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND
>>>> NettyServer$NettyServerAvroHandler.handleUpstream:171)  - [id:
>>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED
>>>> NettyServer$NettyServerAvroHandler.channelClosed:209)  - Connection to
>>>> /SOURCE_HOST:43599 disconnected.
>>>>
>>>> The other thing I observed is that these errors only ever occur from
>>>> our EU servers (connecting to our centralized downstream collectors in US
>>>> West) - We are running Flume in Amazon EC2
>>>>
>>>> I can see in the log that the connection is restored quite quickly.
>>>>
>>>> I gather that the network latency between Europe and US West is causing
>>>> the connection between the 2 servers to 'appear' lost, thus resulting in
>>>> the above errors?
>>>>
>>>> Are there any recommended config settings to compensate for this?
>>>>
>>>> On 6 February 2013 00:21, Juhani Connolly <
>>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>>
>>>>  Is there anything unusual in the logs for the destination(avroSource)
>>>> server
>>>>
>>>> Since the error is happening in the AvroSink, no data is getting lost.
>>>> The failed data will get rolled back, removal from the local channel is
>>>> cancelled, and it will attempt to resend it.
>>>>
>>>>
>>>> On 02/06/2013 03:23 PM, Denis Lowe wrote:
>>>>
>>>> We are running Flume-NG 1.3.1 and have noticed periodically the
>>>> following ERROR occurring (a few times daily):
>>>>
>>>>  We are using the File Channel connecting to 2 downstream collector
>>>> agents in 'round_robin' mode, using avro source/sinks.
>>>>
>>>>  We are using the config described below to deliver 5 different log
>>>> types (to 5 different ports downstream) and have observed the below
>>>> error occurring randomly across all the ports.
>>>>
>>>>  We tried doubling the connect-timeout to 40000 (from the default of
>>>> 20000) with no success.
>>>> The agent appears to recover and keep on processing data.
>>>>
>>>>  My question is:
>>>> Has this data been lost? or will flume eventually retry until a
>>>> successfull delivery has been made?
>>>> Are there any other config changes I can make to prevent/reduce
>>>> this occurring in the future?
>>>>
>>>>  05 Feb 2013 23:12:21,650 ERROR
>>>> [SinkRunner-PollingRunner-DefaultSinkProcessor]
>>>> (org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver
>>>> event. Exception follows.
>>>> org.apache.flume.EventDeliveryException: Failed to send events
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:325)
>>>>         at
>>>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>>>         at
>>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>>>         at java.lang.Thread.run(Thread.java:662)
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: collector1, port: 4003 }: Failed to send batch
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236)
>>>>         at org.apache.flume.sink.AvroSink.process(AvroSink.java:309)
>>>>         ... 3 more
>>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient
>>>> { host: collector2, port: 4003 }: RPC request timed out
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224)
>>>>         ... 4 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132)
>>>>         at
>>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310)
>>>>         ... 6 more
>>>>
>>>>  Below is a snapshot the current config:
>>>>
>>>>  agent.sources.eventdata.command = tail -qn +0 -F
>>>> /var/log/event-logs/live/eventdata.log
>>>>  agent.sources.eventdata.channels = eventdata-avro-channel
>>>> agent.sources.eventdata.batchSize = 10
>>>> agent.sources.eventdata.restart = true
>>>>
>>>>  ## event source interceptor
>>>> agent.sources.eventdata.interceptors.host-interceptor.type =
>>>> org.apache.flume.interceptor.HostInterceptor$Builder
>>>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader =
>>>> source-host
>>>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false
>>>>
>>>>  ## eventdata channel
>>>> agent.channels.eventdata-avro-channel.type = file
>>>> agent.channels.eventdata-avro-channel.checkpointDir =
>>>> /mnt/flume-ng/checkpoint/eventdata-avro-channel
>>>> agent.channels.eventdata-avro-channel.dataDirs =
>>>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel
>>>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000
>>>> agent.channels.eventdata-avro-channel.capacity = 2000000
>>>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000
>>>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000
>>>> agent.channels.eventdata-avro-channel.keep-alive = 20
>>>> agent.channels.eventdata-avro-channel.write-timeout = 20
>>>>
>>>>  ## 2 x downstream click sinks for load balancing and failover
>>>>  agent.sinks.eventdata-avro-sink-1.type = avro
>>>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel
>>>> agent.sinks.eventdata-avro-sink-1.hostname = collector1
>>>> agent.sinks.eventdata-avro-sink-1.port = 4003
>>>> agent.sinks.eventdata-avro-sink-1.batch-size = 100
>>>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000
>>>>
>>>>  agent.sinks.eventdata-avro-sink-2.type = avro
>>>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel
>>>> agent.sinks.eventdata-avro-sink-2.hostname = collector2
>>>> agent.sinks.eventdata-avro-sink-2.port = 4003
>>>> agent.sinks.eventdata-avro-sink-2.batch-size = 100
>>>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000
>>>>
>>>>  ## load balance config
>>>> agent.sinkgroups = eventdata-avro-sink-group
>>>> agent.sinkgroups.eventdata-avro-sink-group.sinks =
>>>> eventdata-avro-sink-1 eventdata-avro-sink-2
>>>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance
>>>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector =
>>>> round_robin
>>>>
>>>>  Denis.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Mime
View raw message