flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Stricks <...@wapolabs.com>
Subject Re: Distributed Deployment Questions
Date Sat, 03 Mar 2012 21:56:29 GMT
Just wanted to say that I added a randomized 0-200 second sleep in my
flume-daemon shell script, which I use to start the Flume service on the
agents when new servers are launched.  I expected it to help with the
master crashing, but it has happened again since then, though much less
frequently.

On Sat, Mar 3, 2012 at 4:36 PM, Jay Stricks <jay@wapolabs.com> wrote:

> Really, really appreciate the help, Alex.
>
> 1.  Max open files for root is at 65000 for all five collectors, but I'm
> not sure what you want me to check with respect to network latency.  I
> actually don't have any partitions marked as swap on these machines, as far
> as I can tell with a 'swapon -s' command. So I'll look into making that
> possible. We do have 7.5gb of RAM, but I'm not proving any UOPTS like Xms
> or Xmx when I start Flume on the collectors.
>
> The error that I indicated in the first post was occurring for only one of
> the three flows going through the collectors. The translated configs for
> that flow are:
>
> *Collector*
> rpcSource( 35853 )
> collectorSink( "s3n://flume-data/events/month=%Y-%m/dt=%Y-%m-%d/hr=%k",
> "ngn-app-events-" )
>
> *Agent Type 1 *(8 servers)
> Source: tail( "/var/log/httpd/access_log", "true" )
> Source: syslogUdp( 5141 )
> Source: syslogUdp( 5140 )
>
> *Agent Type 2* (40 servers)
> Source: tail( "/var/log/httpd/access_log", "true" )
> Source: syslogUdp( 5140 )
>
> Sinks for all of these are autoE2EChain, each with a different value()
> decorator. Would it help to spread these over different flows?
>
> 3. Do I tune the max open connections setting in flume-site.xml? I assume
> I should change maxClientCnxns, right? I wonder if globalOutstandingLimit
> would also help. (Found these at
> http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html#sc_minimumConfiguration
> ).
>
> Thanks again for the advice. I'm sure working through this is helping
> other people too!
>
> Jay
>
>
>
>
> On Sat, Mar 3, 2012 at 2:36 AM, alo alt <wget.null@googlemail.com> wrote:
>
>> Hey Jay,
>>
>> 1. please check max open files, network latency, swap. Useful would be a
>> example of sinks or flows.
>>
>> 2. Here it could be that S3 nodes fall behind and you're hitting
>> different servers on S3
>>
>> 3. Flume master uses zookeeper, here you can tune the max open
>> connections. In fact, when you use that feature, deploy one agents, sleep
>> 3, next and so on. Thats one of the reasons flume hast the 3 sec sleep
>> timer at the sysinit restart scripts
>>
>> best,
>>  Alex
>>
>> --
>> Alexander Lorenz
>> http://mapredit.blogspot.com
>>
>> On Mar 2, 2012, at 11:45 PM, Jay Stricks wrote:
>>
>> > Hey folks,
>> >
>> > I'm looking for some advice on a couple of issues I"m having. My setup
>> is Flume v.094--cdh3u2, single master, six collectors (three flows, all
>> autoCollectorSource), ~80 agents (three flows, autoE2E).
>> >
>> > 1. I have begun to have collectors fail with "ERROR
>> connector.DirectDriver: Exiting driver logicalNode <node_name> in error
>> state ThriftEventSource | Collector because null", which looks very similar
>> to the issue address in FLUME-757 (
>> https://issues.apache.org/jira/browse/FLUME-757).  Any update/advice on
>> how to address this? Is it an issue of limiting the size of the files being
>> transmitted to the collectors, or the frequency of transmission? This never
>> happened on 093, and it's a little concerning to see after upgrading.
>> >
>> > 2. I'm a constantly getting "WARN httpclient.RestS3Service: Response
>> '/events%2Fmonth%3D2012-03%2Fdt%3D2012-03-01%2Fhr%3D16%2Fngn-app-events-20120302-154220414-0500.80204348000864.00000026.tmp'
>> - Unexpected response code 404, expected 200", even though the data is
>> being written to S3. I know this has been brought up before, but is there
>> any advice on when to determine if it's a valid error?
>> >
>> > 3. My agents are on machines that are launched and terminated somewhat
>> frequently due to maintenance, etc.  I have the user data scripts set up so
>> that each agent server, upon being launched, starts a Flume shell, connects
>> to the master, and executes its own configuration commands.  Often, my
>> master will fail when too many agent configurations are being submitted.
>> The number of threads grows exponentially at these times, and then fails.
>> I'm curious if anyone else experiences this over-concurrency problem, or
>> how you would recommend avoiding it. Any ideas for how to have the master
>> 'notice' a new agent and execute its configuration itself, which seems like
>> it would be an effective rate limiter, so to speak?
>> >
>> > Thanks a ton for the help!
>> >
>> > Jay S.
>> >
>> >
>>
>>
>

Mime
View raw message