flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: File Channel Best Practice
Date Tue, 17 Dec 2013 18:17:24 GMT
Hi,

I'd also add the biggest issue I see with the file channel is batch size at
the source. Long story short is that file channel was written to guarantee
no data loss. In order to do that when a transaction is committed we need
to perform a "fsync" on the disk the transaction was written to. fsync's
are very expensive so in order to obtain good performance, the source must
have written a large batch of data. Here is some more information on this
topic:

http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/

Brock


On Tue, Dec 17, 2013 at 11:50 AM, iain wright <iainwrig@gmail.com> wrote:

> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
> this as it seems like a good use case.
>
> something like:
>
> pool
>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
> buffers in event of sudden power loss)
>   mirror
>     sda1
>     sda2
>   mirror
>     sda3
>     sda4
>
> theres probably further tuning that can be done as well within ZFS, but i
> believe the ZIL will allow for immediate responses to flumes
> checkpoint/data fsync's while the "actual data" is flushed asynchronously
> to the spindles.
>
> Haven't tried this and YMMV. Some good reading available here:
> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>
> Cheers
>
>
> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <dsuiter@rdx.com> wrote:
>
>> Hi,
>>
>> There has been a lot of discussion about file channel speed today, and I
>> have had a dilemma I was hoping for some feedback on, since the topic is
>> hot.
>>
>>  Regarding this:
>> "Hi,
>>
>> 1) You are only using a single disk for file channel and it looks like a
>> single disk for both checkpoint and data directories therefore throughput
>> is going to be extremely slow."
>>
>> How do you solve in a practical sense the requirement for file channel to
>> have a range of disks for best R/W speed, yet still have network visibility
>> to source data sources and the Hadoop cluster at the same time?
>>
>> It seems like for production file channel implementation, the best
>> solution is to give Flume a dedicated server somewhere near the edge with a
>> JBOD pile properly mounted and partitioned. But that adds to implementation
>> cost.
>>
>> The alternative seems to be to run Flume on a  physical Cloudera Manager
>> SCM server that has some extra disks, or run Flume agents concurrent with
>> datanode processes on worker nodes, but those don't seem good to do,
>> especially piggybacking on worker nodes, and file channel > HDFS will
>> compound the issue...
>>
>> I know the namenode should definitely not be involved.
>>
>> I suppose you could virtualize a few servers on a properly networked host
>> and a fast SANS/NAS connection and get by ok, but that will merge your
>> parallelization at some point...
>>
>> Any ideas on the subject?
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Mime
View raw message