flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Vachon <vac...@sessionm.com>
Subject Re: Unexplainable Bug in Flume Collectors
Date Fri, 27 Jan 2012 17:56:07 GMT
The answer is this is not a bug but user-error.  For anyone who has 10 logical nodes (besides
the physical named logical node) on a physical node, make sure you are not running default
port settings.  11 ports higher than the start of the source range is the http status page.


On Jan 27, 2012, at 10:56 AM, Thomas Vachon wrote:

> I did a packet capture I was just able to analyze.  I see the TCP handshake, some PSH,ACK's
then the collector send a RST.  This is not a pattern I see in any other flow on that box.
 It also is observed on my 2nd collector.  I have checked file limits/open descriptors, nothing
is above its limits.
> 
> 
> On Jan 27, 2012, at 10:29 AM, Thomas Vachon wrote:
> 
>> I have 10 logical collectors per a collector node.  2 for each log file I monitor
(one for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing
100% of the time.  On a node I see:
>> 
>> 2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-142931487+0000.6301485683111842.00000106
>> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink:
append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012]
{ AckChecksum : (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag
: 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' with error: Append
failed java.net.SocketException: Broken pipe
>> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink:
ThriftEventSink on port 35862 closed
>> 2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink:
ThriftEventSink on port 35862 closed
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink:
ThriftEventSink to admin.internal.sessionm.com:35862 opened
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink:
ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 opened
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator:
Opened BackoffFailover on try 0
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106
is queued to be checked
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of
file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-143151597+0000.6301625792799805.00000106
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106
is queued to be checked
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of
file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager:
opening log file 20120127-151120751+0000.6303994947346458.00000351
>> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351
is queued to be checked
>> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of
file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 
>> 
>> On the collector, I see flow 9 (the HDFS sink flow of the same log file) working
just fine.  I see that it opens the s3n sink for the S3 flow, but no data is being ingested.
 On the node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I
am unable to find a way to fix it.  I confirmed connectivity via telnet to the RCP source
port.
>> 
>> I have exhausted all possible measure of fixing this.  I have unmapped, configured,
and remapped every node to ensure we did not have a weird problem.  The master shows no errors,
and 9 out of the 10 flows are working just as they should.
>> 
>> Does anyone have an idea?
>> 
>> --
>> Thomas Vachon
>> Principal Operations Architect
>> session M
>> vachon@sessionm.com
>> 
>> 
> 


Mime
View raw message