flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Grimes <frankgrime...@yahoo.com>
Subject Re: Collector node failing with java.net.SocketException: Too many open files
Date Mon, 30 Jan 2012 15:49:26 GMT
I think this bug might be addressed by making use of TCP keepalive on the Thrift server socket.
e.g.

Index: flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java
===================================================================
--- flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(revision
1237721)
+++ flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java	(working copy)
@@ -132,6 +132,7 @@
     }
     try {
       Socket result = serverSocket_.accept();
+      result.setKeepAlive(true); 
       TSocket result2 = new TBufferedSocket(result);
       result2.setTimeout(clientTimeout_);
       return result2;

I believe that on Linux that would force the connections to be closed/cleaned up after 2 hours
by default.
This is likely good enough to prevent the "java.net.SocketException: Too many open files"
from occurring in our case.
Note that it's also configurable as per http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive.

Shall I open up a JIRA case for this and submit a patch?
Should the keepalive be configurable or is it desirable to always have the Flume collector
protected from these kinds of killed connections?
I can't think of any downsides to always having it on...

Cheers,

Frank Grimes


On 2012-01-28, at 12:20 PM, Frank Grimes wrote:

> We believe that we've made some progress in identifying the problem.
> 
> It appears that we have a slow socket connection leak on the Collector node due to sparse
data coming in on some Thrift RPC sources.
> Turns out we're going through a firewall, and we believe that it is killing those inactive
connections.
> 
> The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket timeout
on a subsequent append, but the Collector still has its socket connections open and they don't
appear to ever be timing out and closing.
> 
> I found the following which seems to describe the problem:
> 
>   http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3C1311642202.14311.2155844361@webmail.messagingengine.com%3E
> 
> However, because presumably some other disconnect conditions could trigger the problem
as well, we are still looking for a solution that doesn't require fiddling with firewall settings.
> 
> Is there a way to configure the Collector node to drop/close these inactive connections?

> i.e. either at the Linux network layer or through Java socket APIs within Flume?
> 
> Thanks,
> 
> Frank Grimes
> 
> 
> On 2012-01-26, at 10:51 AM, Frank Grimes wrote:
> 
>> Hi All,
>> 
>> We are using flume-0.9.5 (specifically, http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275)
and occasionally our Collector node accumulates too many open TCP connections and starts madly
logging the following errors:
>> 
>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred during
acceptance of message.
>> org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many
open files
>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>        at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>        at org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>> Caused by: java.net.SocketException: Too many open files
>>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>>        at org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>        ... 2 more
>> 
>> This quickly fills up the disk as the log file grows to multiple gigabytes in size.
>> 
>> After some investigation, it appears that even though the Agent nodes show single
open connections to the Collector, the Collector node appears to have a bunch of zombie TCP
connections open back to the Agent nodes.
>> i.e.
>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>> However, the Collector node shows hundreds of established connections for that same
port which don't seem to tie up to any connections I can find on the Agent node.
>> 
>> So we're concluding that the Collector node is somehow leaking connections.
>> 
>> Has anyone seen this kind of thing before?
>> 
>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>> Or could this be a Thrift bug that could be avoided by switching to Avro sources/sinks?
>> 
>> Any hints/tips are most welcome.
>> 
>> Thanks,
>> 
>> Frank Grimes
> 


Mime
View raw message