flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: Agent with Thrift rpcSource closing source after receiving new config from master?
Date Thu, 21 Jul 2011 11:41:50 GMT
[Please subscribe to new flume-user@incubator.apache.org list, bcc
flume-user@cloudera.org, cc flume-user@incubator.apache.org]

Jesse,

There have been a bunch that made it into v0.9.4.

FLUME-597, FLUME-595, FLUME-589, FLUME-596 were also part of the patch
series related to the two you mentioned.

Jon.

On Sat, Jul 2, 2011 at 10:22 AM, Jesse Shieh <jesse@adku.com> wrote:

> Hi Jon,
>
> I'm having the same problem and flume never seems to come back (waited
> 10 hours).  I found two lifecycle issues, but I don't know flume well
> enough to tell if they address the problem =(  Are these the lifecycle
> fixes you were referring to?  Are there others?
>
> https://issues.cloudera.org/browse/FLUME-569
> https://issues.cloudera.org/browse/FLUME-593
>
> Thanks!
> Jesse
>
>
>
> On Jun 20, 5:29 pm, Jonathan Hsieh <j...@cloudera.com> wrote:
> > Chris,
> >
> > I think this is because the agentSink is trying to flush itself and
> doesn't
> > return until it flushes or a time out occurs. To guarantee that the agent
> > fill finish, it shuts down the source side to prevent new data from
> > entering.  If the logical node hangs for ever, this is a bug which may
> have
> > been addressed by some of the in review and recently landed lifecycle fix
> > patches.
> >
> > If it eventually makes progress but eventually is too slow, I think a
> > solution might be some combination of:
> >
> > 1) Having the agentSink shutdown more abubtly and let the agent sink rely
> on
> > recovery mechanisms to resend data.
> > 2) Making config change granularity finer by allowing users to just
> change
> > sinks (without changing sources), or possibly the ability to dynamically
> add
> > or swap-out sinks.
> >
> > Do these sound reasonable?
> >
> > Jon.
> >
> > On Wed, Jun 15, 2011 at 3:34 AM, Christopher Lin <
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > powertothepengu...@gmail.com> wrote:
> > > I'm finding that agent nodes with a Thrift rpcSource will, upon
> > > receiving a new configuration from the master, close the source server
> > > (leading to lots of Thrift TransportExceptions on the part of the
> > > sending application) and thereafter become unresponsive to any new
> > > configurations.
> >
> > > I've filed a bug athttps://issues.cloudera.org/browse/FLUME-659and
> > > have pasted my description and instructions to reproduce below in case
> > > anyone in this group can shed light on the issue.  Thank you!
> >
> > > I'm finding that agent nodes with a Thrift rpcSource will, upon
> > > receiving a new configuration from the master, close the source server
> > > (leading to lots of Thrift TransportExceptions on the part of the
> > > sending application) and thereafter become unresponsive to any new
> > > configurations.
> >
> > > I've filed a bug athttps://issues.cloudera.org/browse/FLUME-659and
> > > have pasted my description and instructions to reproduce below in case
> > > anyone in this group can shed light on the issue.  Thank you!
> >
> > > You can reproduce this problem by following these steps:
> >
> > > Set up:
> >
> > >    Master
> > >    Agent: rpcSource(35092) | agent*(...) # agent*Sink and agent*Chain
> > > all have this problem
> > >    Collector: collectorSource(...) | collectorSink(...)
> >
> > > Start sending events to the agent using Thrift. Then use the flume
> > > shell on master to configure the agent – you can even use the exact
> > > same config as the agent had in the first place. Make sure the agent
> > > receives this configuration while still being sent events. After the
> > > agent receives its configuration, it will close its source server for
> > > some reason and thereafter become unresponsive to new configurations.
> > > This is the sample output from the agent logs:
> >
> > > 2011-06-15 07:29:04,086 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on
> > > port 35853 closed
> > > 2011-06-15 07:29:05,088 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on
> > > port 35092...
> > > 2011-06-15 07:29:05,088 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
> > > 4 elements ...
> >
> > > And of course, the fact that the server is closed results in lots of
> > > the following types of errors in the application that's sending
> > > events:
> >
> > > Thrift::TransportException: Broken pipe
> > > Thrift::TransportException: Could not connect to localhost:35092:
> > > Connection refused - connect(2)
> >
> > > Another variation to reproduce this type of error is to bring the
> > > master down, then bring it back up, at which point it will send its
> > > configuration to the agent node. Upon receiving the new configuration,
> > > the agent closes its source server and becomes unresponsive to new
> > > configurations. The following is output from an agent that was
> > > configured with two logical nodes, one that was rpcSource(35090) |
> > > agentE2EChain(...) and one that was rpcSource(35092) |
> > > agentBEChain(...)
> >
> > > 2011-06-15 05:37:46,731 INFO com.cloudera.flume.agent.ThriftMasterRPC:
> > > Connected to master at flume-master:35872
> > > 2011-06-15 05:37:51,770 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on
> > > port 35090...
> > > 2011-06-15 05:37:51,771 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
> > > 0 elements ...
> > > 2011-06-15 05:37:51,787 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on
> > > port 35853 closed
> > > 2011-06-15 05:37:51,868 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on
> > > port 35090...
> > > 2011-06-15 05:37:51,868 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
> > > 0 elements ...
> > > 2011-06-15 05:37:51,868 WARN
> > > com.cloudera.flume.handlers.debug.LazyOpenDecorator: Closing a lazy
> > > sink that was not logically opened
> > > 2011-06-15 05:37:51,868 INFO com.cloudera.flume.agent.LogicalNode:
> > > flume-agent: Connector stopped: LazyOpenSource | LazyOpenDecorator
> > > 2011-06-15 05:37:51,875 INFO com.cloudera.flume.agent.LogicalNode:
> > > Node config successfully set to
> > > com.cloudera.flume.conf.FlumeConfigData@42143753
> > > 2011-06-15 05:37:51,880 INFO com.cloudera.flume.agent.LogicalNode:
> > > Connector started: LazyOpenSource | LazyOpenDecorator
> > > 2011-06-15 05:37:51,881 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Starting
> > > blocking thread pool server on port 35090...
> > > 2011-06-15 05:37:52,788 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on
> > > port 35092...
> > > 2011-06-15 05:37:52,788 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
> > > 6 elements ...
> >
> > > I once produced an exception using this master-down/master-up
> > > procedure:
> >
> > > 2011-06-15 04:50:45,543 ERROR
> > > com.cloudera.flume.core.connector.DirectDriver: Driving src/sink
> > > failed! LazyOpenSource | LazyOpenDecorator because NaiveFileWALDeco
> > > not open for append
> > > java.lang.IllegalStateException: NaiveFileWALDeco not open for append
> > > at com.google.common.base.Preconditions.checkState(Preconditions.java:
> > > 145)
> > > at
> >
> > >
> com.cloudera.flume.agent.durability.NaiveFileWALDeco.append(NaiveFileWALDec
> o.java:
> > > 133)
> > > at com.cloudera.flume.core.CompositeSink.append(CompositeSink.java:61)
> > > at
> > >
> com.cloudera.flume.agent.AgentFailChainSink.append(AgentFailChainSink.java:
> > > 103)
> > > at
> > >
> com.cloudera.flume.core.EventSinkDecorator.append(EventSinkDecorator.java:
> > > 60)
> > > at
> >
> > >
> com.cloudera.flume.handlers.debug.LazyOpenDecorator.append(LazyOpenDecorato
> r.java:
> > > 75)
> > > at com.cloudera.flume.core.connector.DirectDriver
> > > $PumperThread.run(DirectDriver.java:93)
> > > 2011-06-15 04:50:45,544 INFO com.cloudera.flume.agent.LogicalNode:
> > > Connector flume-node exited with error NaiveFileWALDeco not open for
> > > append
> > > 2011-06-15 04:50:46,544 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Closed server on
> > > port 35090...
> > > 2011-06-15 04:50:46,545 INFO
> > > com.cloudera.flume.handlers.thrift.ThriftEventSource: Queue still has
> > > 6 elements ...
> > > 2011-06-15 04:50:50,443 INFO
> > > com.cloudera.flume.agent.AgentFailChainSink: Setting e2e failover
> > > chain to { ackedWriteAhead => { stubbornAppend => { insistentOpen =>
> > > failChain(" %s ","tsink(\"collector1\",35853)","tsink(\"collector2\",
> > > 35853)") } } }
> > > 2011-06-15 04:50:50,443 INFO
> > > com.cloudera.flume.agent.AgentFailChainSink: Setting failover chain to
> > > { ackedWriteAhead => { stubbornAppend => { insistentOpen =>
> > > failChain(" %s ","tsink(\"collector2\",35853)","tsink(\"collector2\",
> > > 35853)") } } }
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // j...@cloudera.com
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
View raw message