flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iain wright <iainw...@gmail.com>
Subject Re: File Channel Best Practice
Date Wed, 18 Dec 2013 18:23:22 GMT
Hi Brock,

Just curious here and please forgive my ignorance :)

In terms of batching at the source for a file channel, is there a
combination of time and quota for the source polling?

For instance does there have to be 1000 new events to load anything into
the channel when using a 1k batch size, or if say 5 seconds passes and only
250 new events are available in the source will it grab those on some time
based interval?

Thank you,

Iain Wright
Cell: (562) 852-5916

This email message is confidential, intended only for the recipient(s)
named above and may contain information that is privileged, exempt from
disclosure under applicable law. If you are not the intended recipient, do
not disclose or disseminate the message to anyone except the intended
recipient. If you have received this message in error, or are not the named
recipient(s), please immediately notify the sender by return email, and
delete all copies of this message.

On Wed, Dec 18, 2013 at 9:51 AM, Brock Noland <brock@cloudera.com> wrote:

> FYI I am trying to capture some of the best practices in the Flume doc
> itself:
> https://issues.apache.org/jira/browse/FLUME-2277
> On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <brock@cloudera.com> wrote:
>> Hi,
>> I'd also add the biggest issue I see with the file channel is batch size
>> at the source. Long story short is that file channel was written to
>> guarantee no data loss. In order to do that when a transaction is committed
>> we need to perform a "fsync" on the disk the transaction was written to.
>> fsync's are very expensive so in order to obtain good performance, the
>> source must have written a large batch of data. Here is some more
>> information on this topic:
>> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>> Brock
>> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <iainwrig@gmail.com> wrote:
>>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>>> this as it seems like a good use case.
>>> something like:
>>> pool
>>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>>> buffers in event of sudden power loss)
>>>   mirror
>>>     sda1
>>>     sda2
>>>   mirror
>>>     sda3
>>>     sda4
>>> theres probably further tuning that can be done as well within ZFS, but
>>> i believe the ZIL will allow for immediate responses to flumes
>>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>>> to the spindles.
>>> Haven't tried this and YMMV. Some good reading available here:
>>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>> Cheers
>>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <dsuiter@rdx.com> wrote:
>>>> Hi,
>>>> There has been a lot of discussion about file channel speed today, and
>>>> I have had a dilemma I was hoping for some feedback on, since the topic is
>>>> hot.
>>>>  Regarding this:
>>>> "Hi,
>>>> 1) You are only using a single disk for file channel and it looks like
>>>> a single disk for both checkpoint and data directories therefore throughput
>>>> is going to be extremely slow."
>>>> How do you solve in a practical sense the requirement for file channel
>>>> to have a range of disks for best R/W speed, yet still have network
>>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>> It seems like for production file channel implementation, the best
>>>> solution is to give Flume a dedicated server somewhere near the edge with
>>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>>> cost.
>>>> The alternative seems to be to run Flume on a  physical Cloudera
>>>> Manager SCM server that has some extra disks, or run Flume agents
>>>> concurrent with datanode processes on worker nodes, but those don't seem
>>>> good to do, especially piggybacking on worker nodes, and file channel >
>>>> HDFS will compound the issue...
>>>> I know the namenode should definitely not be involved.
>>>> I suppose you could virtualize a few servers on a properly networked
>>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>>> parallelization at some point...
>>>> Any ideas on the subject?
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>> --
>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

View raw message