flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: Agent with Thrift rpcSource closing source after receiving new config from master?
Date Thu, 21 Jul 2011 12:51:48 GMT
Jesse,

Cloudera is packaging a 0.9.4 based release in dep and rpm form.  It has
some extra patches that will be part of a 0.9.5 apache flume release.

Jon.

On Thu, Jul 21, 2011 at 5:46 AM, Jesse Shieh <jesse@adku.com> wrote:

> Thanks!  By the way, do you know when 0.9.4 will make it into the ubuntu
> apt repo?
>
>
> On Thu, Jul 21, 2011 at 4:41 AM, Jonathan Hsieh <jon@cloudera.com> wrote:
>
>> [Please subscribe to new flume-user@incubator.apache.org list, bcc
>> flume-user@cloudera.org, cc flume-user@incubator.apache.org]
>>
>> Jesse,
>>
>> There have been a bunch that made it into v0.9.4.
>>
>> FLUME-597, FLUME-595, FLUME-589, FLUME-596 were also part of the patch
>> series related to the two you mentioned.
>>
>> Jon.
>>
>> On Sat, Jul 2, 2011 at 10:22 AM, Jesse Shieh <jesse@adku.com> wrote:
>>
>>> Hi Jon,
>>>
>>> I'm having the same problem and flume never seems to come back (waited
>>> 10 hours).  I found two lifecycle issues, but I don't know flume well
>>> enough to tell if they address the problem =(  Are these the lifecycle
>>> fixes you were referring to?  Are there others?
>>>
>>> https://issues.cloudera.org/browse/FLUME-569
>>> https://issues.cloudera.org/browse/FLUME-593
>>>
>>> Thanks!
>>> Jesse
>>>
>>>
>>>
>>> On Jun 20, 5:29 pm, Jonathan Hsieh <j...@cloudera.com> wrote:
>>> > Chris,
>>> >
>>> > I think this is because the agentSink is trying to flush itself and
>>> doesn't
>>> > return until it flushes or a time out occurs. To guarantee that the
>>> agent
>>> > fill finish, it shuts down the source side to prevent new data from
>>> > entering.  If the logical node hangs for ever, this is a bug which may
>>> have
>>> > been addressed by some of the in review and recently landed lifecycle
>>> fix
>>> > patches.
>>> >
>>> > If it eventually makes progress but eventually is too slow, I think a
>>> > solution might be some combination of:
>>> >
>>> > 1) Having the agentSink shutdown more abubtly and let the agent sink
>>> rely on
>>> > recovery mechanisms to resend data.
>>> > 2) Making config change granularity finer by allowing users to just
>>> change
>>> > sinks (without changing sources), or possibly the ability to
>>> dynamically add
>>> > or swap-out sinks.
>>> >
>>> > Do these sound reasonable?
>>> >
>>> > Jon.
>>> >
>>> > On Wed, Jun 15, 2011 at 3:34 AM, Christopher Lin <
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > powertothepengu...@gmail.com> wrote:
>>> > > I'm finding that agent nodes with a Thrift rpcSource will, upon
>>> > > receiving a new configuration from the master, close the source
>>> server
>>> > > (leading to lots of Thrift TransportExceptions on the part of the
>>> > > sending application) and thereafter become unresponsive to any new
>>> > > configurations.
>>> >
>>> > > I've filed a bug athttps://issues.cloudera.org/browse/FLUME-659and
>>> > > have pasted my description and instructions to reproduce below in
>>> case
>>> > > anyone in this group can shed light on the issue.  Thank you!
>>> >
>>> > > I'm finding that agent nodes with a Thrift rpcSource will, upon
>>> > > receiving a new configuration from the master, close the source
>>> server
>>> > > (leading to lots of Thrift TransportExceptions on the part of the
>>> > > sending application) and thereafter become unresponsive to any new
>>> > > configurations.
>>> >
>>> > > I've filed a bug athttps://issues.cloudera.org/browse/FLUME-659and
>>> > > have pasted my description and instructions to reproduce below in
>>> case
>>> > > anyone in this group can shed light on the issue.  Thank you!
>>> >
>>> > > You can reproduce this problem by following these steps:
>>> >
>>> > > Set up:
>>> >
>>> > >    Master
>>> > >    Agent: rpcSource(35092) | agent*(...) # agent*Sink and agent*Chain
>>> > > all have this problem
>>> > >    Collector: collectorSource(...) | collectorSink(...)
>>> >
>>> > > Start sending events to the agent using Thrift. Then use the flume
>>> > > shell on master to configure the agent – you can even use the exact
>>> > > same config as the agent had in the first place. Make sure the agent
>>> > > receives this configuration while still being sent events. After the
>>> > > agent receives its configuration, it will close its source server for
>>> > > some reason and thereafter become unresponsive to new configurations.
>>> > > This is the sample output from the agent logs:
>>> >
>>> > > 2011-06-15 07:29:04,086 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
>>> on
>>> > > port 35853 closed
>>> > > 2011-06-15 07:29:05,088 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server
>>> on
>>> > > port 35092...
>>> > > 2011-06-15 07:29:05,088 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
>>> > > 4 elements ...
>>> >
>>> > > And of course, the fact that the server is closed results in lots of
>>> > > the following types of errors in the application that's sending
>>> > > events:
>>> >
>>> > > Thrift::TransportException: Broken pipe
>>> > > Thrift::TransportException: Could not connect to localhost:35092:
>>> > > Connection refused - connect(2)
>>> >
>>> > > Another variation to reproduce this type of error is to bring the
>>> > > master down, then bring it back up, at which point it will send its
>>> > > configuration to the agent node. Upon receiving the new
>>> configuration,
>>> > > the agent closes its source server and becomes unresponsive to new
>>> > > configurations. The following is output from an agent that was
>>> > > configured with two logical nodes, one that was rpcSource(35090) |
>>> > > agentE2EChain(...) and one that was rpcSource(35092) |
>>> > > agentBEChain(...)
>>> >
>>> > > 2011-06-15 05:37:46,731 INFO
>>> com.cloudera.flume.agent.ThriftMasterRPC:
>>> > > Connected to master at flume-master:35872
>>> > > 2011-06-15 05:37:51,770 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server
>>> on
>>> > > port 35090...
>>> > > 2011-06-15 05:37:51,771 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
>>> > > 0 elements ...
>>> > > 2011-06-15 05:37:51,787 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink
>>> on
>>> > > port 35853 closed
>>> > > 2011-06-15 05:37:51,868 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server
>>> on
>>> > > port 35090...
>>> > > 2011-06-15 05:37:51,868 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
>>> > > 0 elements ...
>>> > > 2011-06-15 05:37:51,868 WARN
>>> > > com.cloudera.flume.handlers.debug.LazyOpenDecorator: Closing a lazy
>>> > > sink that was not logically opened
>>> > > 2011-06-15 05:37:51,868 INFO com.cloudera.flume.agent.LogicalNode:
>>> > > flume-agent: Connector stopped: LazyOpenSource | LazyOpenDecorator
>>> > > 2011-06-15 05:37:51,875 INFO com.cloudera.flume.agent.LogicalNode:
>>> > > Node config successfully set to
>>> > > com.cloudera.flume.conf.FlumeConfigData@42143753
>>> > > 2011-06-15 05:37:51,880 INFO com.cloudera.flume.agent.LogicalNode:
>>> > > Connector started: LazyOpenSource | LazyOpenDecorator
>>> > > 2011-06-15 05:37:51,881 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Starting
>>> > > blocking thread pool server on port 35090...
>>> > > 2011-06-15 05:37:52,788 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server
>>> on
>>> > > port 35092...
>>> > > 2011-06-15 05:37:52,788 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
>>> > > 6 elements ...
>>> >
>>> > > I once produced an exception using this master-down/master-up
>>> > > procedure:
>>> >
>>> > > 2011-06-15 04:50:45,543 ERROR
>>> > > com.cloudera.flume.core.connector.DirectDriver: Driving src/sink
>>> > > failed! LazyOpenSource | LazyOpenDecorator because NaiveFileWALDeco
>>> > > not open for append
>>> > > java.lang.IllegalStateException: NaiveFileWALDeco not open for append
>>> > > at
>>> com.google.common.base.Preconditions.checkState(Preconditions.java:
>>> > > 145)
>>> > > at
>>> >
>>> > >
>>> com.cloudera.flume.agent.durability.NaiveFileWALDeco.append(NaiveFileWALDec
>>> o.java:
>>> > > 133)
>>> > > at
>>> com.cloudera.flume.core.CompositeSink.append(CompositeSink.java:61)
>>> > > at
>>> > >
>>> com.cloudera.flume.agent.AgentFailChainSink.append(AgentFailChainSink.java:
>>> > > 103)
>>> > > at
>>> > >
>>> com.cloudera.flume.core.EventSinkDecorator.append(EventSinkDecorator.java:
>>> > > 60)
>>> > > at
>>> >
>>> > >
>>> com.cloudera.flume.handlers.debug.LazyOpenDecorator.append(LazyOpenDecorato
>>> r.java:
>>> > > 75)
>>> > > at com.cloudera.flume.core.connector.DirectDriver
>>> > > $PumperThread.run(DirectDriver.java:93)
>>> > > 2011-06-15 04:50:45,544 INFO com.cloudera.flume.agent.LogicalNode:
>>> > > Connector flume-node exited with error NaiveFileWALDeco not open for
>>> > > append
>>> > > 2011-06-15 04:50:46,544 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server
>>> on
>>> > > port 35090...
>>> > > 2011-06-15 04:50:46,545 INFO
>>> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
>>> > > 6 elements ...
>>> > > 2011-06-15 04:50:50,443 INFO
>>> > > com.cloudera.flume.agent.AgentFailChainSink: Setting e2e failover
>>> > > chain to { ackedWriteAhead => { stubbornAppend => { insistentOpen
=>
>>> > > failChain(" %s ","tsink(\"collector1\",35853)","tsink(\"collector2\",
>>> > > 35853)") } } }
>>> > > 2011-06-15 04:50:50,443 INFO
>>> > > com.cloudera.flume.agent.AgentFailChainSink: Setting failover chain
>>> to
>>> > > { ackedWriteAhead => { stubbornAppend => { insistentOpen =>
>>> > > failChain(" %s ","tsink(\"collector2\",35853)","tsink(\"collector2\",
>>> > > 35853)") } } }
>>> >
>>> > --
>>> > // Jonathan Hsieh (shay)
>>> > // Software Engineer, Cloudera
>>> > // j...@cloudera.com
>>>
>>
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>
>>
>>
>
>
> --
> Jesse Shieh | Co-Founder | Adku | www.adku.com | c: 213-537-7379
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
View raw message