flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: race conditions and WAL shutdown
Date Wed, 17 Aug 2011 08:56:28 GMT
Thanks I'll take a look and see how I get on. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16/08/2011, at 6:17 AM, Mingjie Lai wrote:

> 
> Saw the problem before. The race condition warning really confused me at the very beginning,
but it turned out to be a configure issue. Please take a look at these threads:
> 
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827
> 
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48
> 
> In short, you need to use collector wrapper for agentE2ESink.
> 
> Thanks,
> Mingjie
> 
> 
> On 08/14/2011 03:18 AM, aaron morton wrote:
>> I'm starting to play with Flume and running into some issues. I've found
>> a couple of tickets that talk about similar things, but wanted to check
>> if it was a problem on my side…
>> 
>> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>> 
>> start a master and bin/flume node_nowatch -n agen and bin/flume
>> node_nowatch -n coll
>> 
>> Then add the following config via the web portal for the master (am
>> using batch / unbatch as a test for aggregating data later)…
>> 
>> agen : text("/my/file") | agentSink("localhost",35853) ;
>> coll : collectorSource(35853) | batch(100) unbatch console ;
>> 
>> On the agen node it goes bang and shuts it's self down…
>> 
>> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
>> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
>> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
>> java.net.ConnectException: Connection refused
>> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
>> state OPENING
>> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
>> localhost:35853 opened
>> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
>> try 1
>> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:40:54,218 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:40:54,219 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 
>> I waited for a while, no retry. So killed both processes and started the
>> coll then the agen process. The lines from the source file were
>> delivered and this is what I see in the agen process…
>> 
>> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:52:43,605 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:52:43,608 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:43,952 INFO WALAckManager:92 - Ack for
>> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
>> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
>> state ACTIVE
>> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
>> making progress
>> 21:52:44,316 INFO WALAckManager:92 - Ack for
>> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
>> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 64093ms
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 64457ms
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60034ms
>> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 
>> I noticed nothing in the logs for coll.
>> 
>> 
>> My questions are:
>> 
>> 1) Am I going about this test the right way ?
>> 
>> 2) For the first case, should the agen process retry ? It sounds a
>> little like this https://issues.apache.org/jira/browse/FLUME-711
>> 
>> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
>> sound ? I'm guessing not because they are INFO.
>> 
>> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
>> ACTIVE but timed out in state OPENING" ?
>> 
>> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
>> checked lsof with the processes still running and saw
>> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
>> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>> 
>> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
>> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
>> https://issues.apache.org/jira/browse/FLUME-637
>> 
>> 7) "INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms" Went on as along as I left the agen process running, there
>> was nothing been displayed on the coll though. Is this also part of
>> FLUME-637 ?
>> 
>> Thanks for any help.
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 


Mime
View raw message