flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: File Channel Best Practice
Date Wed, 18 Dec 2013 20:10:01 GMT
No problem. I am glad you started this email discussion and as I said
earlier, thank you for using our software! :)


On Wed, Dec 18, 2013 at 2:02 PM, Devin Suiter RDX <dsuiter@rdx.com> wrote:

> Yes, excellent - I was a little muddy on some of the finer points, and I
> am glad you clarified for the sake of other mailing list users - I forgot I
> have the whole context in my head, but other readers might not.
>
> Thanks again!
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <brock@cloudera.com> wrote:
>
>> Hi Devin,
>>
>> Please find my response below.
>>
>> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <dsuiter@rdx.com>wrote:
>>
>>>
>>> So, if I understand your position on sizing the source properly, you are
>>> saying that the "fsync" operation is the costly part - it locks the device
>>> it is flushing to until the operation completes, and takes some time, so if
>>> you are committing small batches to the channel frequently, you are
>>> monopolizing the device frequently
>>>
>>
>> Correct, when using file channel, small batches spend most of the time
>> actually performing fsyncs.
>>
>>
>>> , but if you set the batch size at the source large enough,
>>>
>>
>> The language here is troublesome because "source" is overloaded. The term
>> "source" could refer to the flume source or the "source of events" for a
>> tiered architecture. Additionally some flume sources cannot control batch
>> size (avro source, http source, syslog) and some have a batch size + a
>> configured timeout (exec source) that still results in small batches most
>> of the time.
>>
>> When using file channel the upstream "source" should send large batches
>> of events. This might be the source connected directly to the file channel
>> or in a tiered architecture with say n application servers each running a
>> local agent which uses memory channel and then forwards events to a
>> "collector" tier which uses file channel. In either case the upstream
>> "sources" should use a large batch size.
>>
>>
>>> you will "take" from the source less frequently, with more data
>>> committed in every operation.
>>>
>>
>> The concept here is correct - larger batch sizes result in large number
>> of I/O's per fsync, thus increasing throughput of the system.
>>
>> Reading goes much faster, and HDFS will manage disk scheduling through
>>> RecordWriter in the HDFS sink, so those are not as problematic - is that
>>> accurate?
>>>
>>
>> Just to level set for anyone reading this, File Channel doesn't use HDFS,
>> HDFS is not aware of File Channel, and the disks we are referring to are
>> disks used by the File Channel not HDFS.
>>
>>
>>> So, if you are using a syslog source, that doesn't really offer a batch
>>> size parameter, would you set up a tiered flow with an Avro hop in the
>>> middle to aggregate log streams?
>>>
>>
>> Yes, that is a common and recommended configuration. Large setups will
>> have a local agent using memory channel, a first tier using memory channel
>> and then a second tier using file channel.
>>
>>
>>> Something like syslog source>--memory channel-->Avro sink > Avro source
>>> (large batch) >--file channel-->HDFS sink(s) for example?
>>>
>>
>> Avro Source doesn't have a batch size parameter....here you need to set a
>> large batch at the Avro Sink layer.
>>
>>
>>> I appreciate the help you've given on this topic. It's also good to know
>>> that the best practices are going into the doc, that will push everything
>>> forward. I've read the Packt publishing book on Flume but it didn't get
>>> into as much detail as I would like. The Cloudera blogs have been really
>>> helpful too.
>>>
>>> Thanks so much!
>>>
>>
>> No problem!  Thank you for using our software!
>>
>> Brock
>>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Mime
View raw message