flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: File Channel Best Practice
Date Wed, 18 Dec 2013 17:51:16 GMT
FYI I am trying to capture some of the best practices in the Flume doc
itself:

https://issues.apache.org/jira/browse/FLUME-2277


On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <brock@cloudera.com> wrote:

> Hi,
>
> I'd also add the biggest issue I see with the file channel is batch size
> at the source. Long story short is that file channel was written to
> guarantee no data loss. In order to do that when a transaction is committed
> we need to perform a "fsync" on the disk the transaction was written to.
> fsync's are very expensive so in order to obtain good performance, the
> source must have written a large batch of data. Here is some more
> information on this topic:
>
> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>
> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>
> Brock
>
>
> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <iainwrig@gmail.com> wrote:
>
>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>> this as it seems like a good use case.
>>
>> something like:
>>
>> pool
>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>> buffers in event of sudden power loss)
>>   mirror
>>     sda1
>>     sda2
>>   mirror
>>     sda3
>>     sda4
>>
>> theres probably further tuning that can be done as well within ZFS, but i
>> believe the ZIL will allow for immediate responses to flumes
>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>> to the spindles.
>>
>> Haven't tried this and YMMV. Some good reading available here:
>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>
>> Cheers
>>
>>
>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <dsuiter@rdx.com> wrote:
>>
>>> Hi,
>>>
>>> There has been a lot of discussion about file channel speed today, and I
>>> have had a dilemma I was hoping for some feedback on, since the topic is
>>> hot.
>>>
>>>  Regarding this:
>>> "Hi,
>>>
>>> 1) You are only using a single disk for file channel and it looks like a
>>> single disk for both checkpoint and data directories therefore throughput
>>> is going to be extremely slow."
>>>
>>> How do you solve in a practical sense the requirement for file channel
>>> to have a range of disks for best R/W speed, yet still have network
>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>
>>> It seems like for production file channel implementation, the best
>>> solution is to give Flume a dedicated server somewhere near the edge with a
>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>> cost.
>>>
>>> The alternative seems to be to run Flume on a  physical Cloudera Manager
>>> SCM server that has some extra disks, or run Flume agents concurrent with
>>> datanode processes on worker nodes, but those don't seem good to do,
>>> especially piggybacking on worker nodes, and file channel > HDFS will
>>> compound the issue...
>>>
>>> I know the namenode should definitely not be involved.
>>>
>>> I suppose you could virtualize a few servers on a properly networked
>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>> parallelization at some point...
>>>
>>> Any ideas on the subject?
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>



-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Mime
View raw message