flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Chavez <pcha...@verticalsearchworks.com>
Subject RE: logs jams in flume collector
Date Tue, 05 Nov 2013 04:59:21 GMT
You can also add another HDFS sink on the collector. Make sure to give it a different file
prefix and bind it to the same channel as the existing sink. You won't need a sink group for
this as both sinks will pull from the same channel. I don't need 2 HDFS writers on my collectors
under normal use but it helps when there's been a backlog for some reason.

From: Shangan Chen [mailto:chenshangan521@gmail.com]
Sent: Monday, November 04, 2013 8:48 PM
To: user@flume.apache.org
Subject: Re: logs jams in flume collector

There're two parts in our deployment(flume-agent, flume-collector), we have quite a lot of
flume-agents collect logs and send to several flume-collectors. There is no problem with the
flume-agent as it can send as fast as the logs generated. But when the collector receive the
logs, they always stick in the channel as the hdfs-sink can not write fast enough. So the
problem we face now is how to increase the writing speed to hdfs. The attachment is our configuration
of flume-collector. Thanks

several tips we've tried:
    increase flume-collector amount
    increase channel size and and transaction size
    increase hdfs batch-size

On Tue, Nov 5, 2013 at 6:27 AM, Paul Chavez <pchavez@verticalsearchworks.com<mailto:pchavez@verticalsearchworks.com>>
What do you mean by 'log jam'? Do you mean events are stuck in the channel and all processing
stops, or just that events are moving slower than you'd like?

If it's just going slowly I would start by graphing channel sizes, and event put/take rates
for your sinks. This will show you which sink might need to be sped up, either by having multiple
sinks drain the same channel, tweaking batch sizes or moving any filechannels to dedicated

If it's events getting stuck in the channel due to missing headers or corrupt data, I would
use interceptors to ensure the necessary headers are applied. For instance, I use a couple
of 'category' headers to route event in downstream agents and on the initial source have a
static interceptor that puts in the proper header with the value 'missing' if the header doesn't
exist from the app. Then I can ensure delivery and also have a bucket in HDFS that I can monitor
to ensure no events are getting lost.

As for your nightly processing, if you use Oozie to trigger workflows you can set dataset
dependencies to prevent things from running until the data is ready. I have hourly workflows
that run this way, they don't trigger until the current partition exists and then they process
the previous partition.

Good luck,
Paul Chavez

From: chenchun [mailto:chenchun.feed@gmail.com<mailto:chenchun.feed@gmail.com>]
Sent: Monday, November 04, 2013 3:35 AM

To: user@flume.apache.org<mailto:user@flume.apache.org>
Subject: logs jams in flume collector

Hi, we are using flume to transfer logs to hdfs. We find lots of logs jams in flume collector.
if the generated logs can't write into hdfs by middle night, our daily report will not be
calculated in time. Any suggestions to identify the bottlenecks written hdfs?


have a good day!

View raw message