flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Everton" <...@iocaine.org>
Subject Re: Fwd: Possible Bug in AvroEventSource
Date Tue, 26 Jul 2011 01:03:22 GMT

On Mon, 25 Jul 2011 15:42 -0700, "Jonathan Hsieh" <jon@cloudera.com>
wrote:
> Dan,
> 
> Nice catch and I agree with you.  I'll file an issue to clean this up
> once
> we get the new issue tracker up.
> 
> Thanks,
> Jon.

Cool, it would be good to get it fixed so the Avro RPC is at parity with
the Thrift RPC. However, I think I've found the root cause of the
problems I was seeing with sending log messages and it's nothing to do
with Flume code.

We're currently testing Flume in a pre-production environment with about
20 Flume nodes. Our setup is fairly straightforward basically:

srvX.example.com-serverLog-appname serverLogFlow thriftSource(31308)
agentDFOSink("log1", 35853, batchCount=100, batchMillis=30000,
compression=true)
srvX.example.com-accessLog-appname accessLogFlow tailDir("...", "...",
true) agentDFOSink("log1", 35854, batchCount=100, batchMillis=30000,
compression=true)

log1.example.com-serverLog-collector serverLogFlow
collectorSource(35853) collectorSink("...", "...", 300000)
log1.example.com-accessLog-collector accessLogFlow
collectorSource(35854) collectorSink("...", "...", 300000)

The applications use a custom log library (similar to the Log4j Avro
appender in Flume) to write to the local Flume node's thriftSource. This
is a blocking write which waits for the node to respond before letting
the application continue. This works fine and we've tested it up to 3000
events per second per node. The problem is that, as this is a test
environment, some applications are idle and don't write logs. This leads
to the agent's connection to the collector not having data go over it
for many hours. This is when the problems start.

The firewall sees these idle connections and closes them without
informing the agent or the collector that the connection has been
dropped. Now, when the application does start trying to write a log
message it blocks waiting for the agent to forward the event to the
collector. Eventually the nodes internal queue is full because it hasn't
been able to send the events and the application comes to a halt because
all its threads are busy trying to write to the local agent. The agent
is still heartbeating away and appears to be healthy to the master, but
no log traffic comes through.

We've seen this issue before with long lived JDBC connections and we'll
fix it in this case the same way: by increasing the session timeout for
the Flume traffic in the firewall. I'm not sure what a proper solution
would be. Setting SO_KEEPALIVE on the socket might help, but by default
that only pings the connection every two hours. It may not be often
enough to prevent a firewall from considering the connection idle and
killable. Perhaps some sort of keepalive event could be sent over the
agentDFOSink link every so often, but that seems inelegant.

Anyway, just thought I'd write this up in case someone else runs in to
it.

Cheers,
Dan

Mime
View raw message