flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: File Channel Best Practice
Date Wed, 18 Dec 2013 19:23:33 GMT
Hi Devin,

Please find my response below.

On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <dsuiter@rdx.com> wrote:

> So, if I understand your position on sizing the source properly, you are
> saying that the "fsync" operation is the costly part - it locks the device
> it is flushing to until the operation completes, and takes some time, so if
> you are committing small batches to the channel frequently, you are
> monopolizing the device frequently

Correct, when using file channel, small batches spend most of the time
actually performing fsyncs.

> , but if you set the batch size at the source large enough,

The language here is troublesome because "source" is overloaded. The term
"source" could refer to the flume source or the "source of events" for a
tiered architecture. Additionally some flume sources cannot control batch
size (avro source, http source, syslog) and some have a batch size + a
configured timeout (exec source) that still results in small batches most
of the time.

When using file channel the upstream "source" should send large batches of
events. This might be the source connected directly to the file channel or
in a tiered architecture with say n application servers each running a
local agent which uses memory channel and then forwards events to a
"collector" tier which uses file channel. In either case the upstream
"sources" should use a large batch size.

> you will "take" from the source less frequently, with more data committed
> in every operation.

The concept here is correct - larger batch sizes result in large number of
I/O's per fsync, thus increasing throughput of the system.

Reading goes much faster, and HDFS will manage disk scheduling through
> RecordWriter in the HDFS sink, so those are not as problematic - is that
> accurate?

Just to level set for anyone reading this, File Channel doesn't use HDFS,
HDFS is not aware of File Channel, and the disks we are referring to are
disks used by the File Channel not HDFS.

> So, if you are using a syslog source, that doesn't really offer a batch
> size parameter, would you set up a tiered flow with an Avro hop in the
> middle to aggregate log streams?

Yes, that is a common and recommended configuration. Large setups will have
a local agent using memory channel, a first tier using memory channel and
then a second tier using file channel.

> Something like syslog source>--memory channel-->Avro sink > Avro source
> (large batch) >--file channel-->HDFS sink(s) for example?

Avro Source doesn't have a batch size parameter....here you need to set a
large batch at the Avro Sink layer.

> I appreciate the help you've given on this topic. It's also good to know
> that the best practices are going into the doc, that will push everything
> forward. I've read the Packt publishing book on Flume but it didn't get
> into as much detail as I would like. The Cloudera blogs have been really
> helpful too.
> Thanks so much!

No problem!  Thank you for using our software!


View raw message