flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: File Channel Best Practice
Date Wed, 18 Dec 2013 20:02:13 GMT
Yes, excellent - I was a little muddy on some of the finer points, and I am
glad you clarified for the sake of other mailing list users - I forgot I
have the whole context in my head, but other readers might not.

Thanks again!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <brock@cloudera.com> wrote:

> Hi Devin,
>
> Please find my response below.
>
> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <dsuiter@rdx.com>wrote:
>
>>
>> So, if I understand your position on sizing the source properly, you are
>> saying that the "fsync" operation is the costly part - it locks the device
>> it is flushing to until the operation completes, and takes some time, so if
>> you are committing small batches to the channel frequently, you are
>> monopolizing the device frequently
>>
>
> Correct, when using file channel, small batches spend most of the time
> actually performing fsyncs.
>
>
>> , but if you set the batch size at the source large enough,
>>
>
> The language here is troublesome because "source" is overloaded. The term
> "source" could refer to the flume source or the "source of events" for a
> tiered architecture. Additionally some flume sources cannot control batch
> size (avro source, http source, syslog) and some have a batch size + a
> configured timeout (exec source) that still results in small batches most
> of the time.
>
> When using file channel the upstream "source" should send large batches of
> events. This might be the source connected directly to the file channel or
> in a tiered architecture with say n application servers each running a
> local agent which uses memory channel and then forwards events to a
> "collector" tier which uses file channel. In either case the upstream
> "sources" should use a large batch size.
>
>
>> you will "take" from the source less frequently, with more data committed
>> in every operation.
>>
>
> The concept here is correct - larger batch sizes result in large number of
> I/O's per fsync, thus increasing throughput of the system.
>
> Reading goes much faster, and HDFS will manage disk scheduling through
>> RecordWriter in the HDFS sink, so those are not as problematic - is that
>> accurate?
>>
>
> Just to level set for anyone reading this, File Channel doesn't use HDFS,
> HDFS is not aware of File Channel, and the disks we are referring to are
> disks used by the File Channel not HDFS.
>
>
>> So, if you are using a syslog source, that doesn't really offer a batch
>> size parameter, would you set up a tiered flow with an Avro hop in the
>> middle to aggregate log streams?
>>
>
> Yes, that is a common and recommended configuration. Large setups will
> have a local agent using memory channel, a first tier using memory channel
> and then a second tier using file channel.
>
>
>> Something like syslog source>--memory channel-->Avro sink > Avro source
>> (large batch) >--file channel-->HDFS sink(s) for example?
>>
>
> Avro Source doesn't have a batch size parameter....here you need to set a
> large batch at the Avro Sink layer.
>
>
>> I appreciate the help you've given on this topic. It's also good to know
>> that the best practices are going into the doc, that will push everything
>> forward. I've read the Packt publishing book on Flume but it didn't get
>> into as much detail as I would like. The Cloudera blogs have been really
>> helpful too.
>>
>> Thanks so much!
>>
>
> No problem!  Thank you for using our software!
>
> Brock
>

Mime
View raw message