What do you mean by ‘log jam’? Do you mean events are stuck in the channel and all processing stops, or just that events are moving slower than you’d like?
If it’s just going slowly I would start by graphing channel sizes, and event put/take rates for your sinks. This will show you which sink might need to be sped up, either by having multiple sinks drain the same channel, tweaking batch sizes or moving any filechannels to dedicated disks.
If it’s events getting stuck in the channel due to missing headers or corrupt data, I would use interceptors to ensure the necessary headers are applied. For instance, I use a couple of ‘category’ headers to route event in downstream agents and on the initial source have a static interceptor that puts in the proper header with the value ‘missing’ if the header doesn’t exist from the app. Then I can ensure delivery and also have a bucket in HDFS that I can monitor to ensure no events are getting lost.
As for your nightly processing, if you use Oozie to trigger workflows you can set dataset dependencies to prevent things from running until the data is ready. I have hourly workflows that run this way, they don’t trigger until the current partition exists and then they process the previous partition.
From: chenchun [mailto:firstname.lastname@example.org]
Sent: Monday, November 04, 2013 3:35 AM
Subject: logs jams in flume collector
Hi, we are using flume to transfer logs to hdfs. We find lots of logs jams in flume collector. if the generated logs can't write into hdfs by middle night, our daily report will not be calculated in time. Any suggestions to identify the bottlenecks written hdfs?