flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iain wright <iainw...@gmail.com>
Subject Re: File Channel Best Practice
Date Tue, 17 Dec 2013 17:50:23 GMT
Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
this as it seems like a good use case.

something like:

pool
  sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
buffers in event of sudden power loss)
  mirror
    sda1
    sda2
  mirror
    sda3
    sda4

theres probably further tuning that can be done as well within ZFS, but i
believe the ZIL will allow for immediate responses to flumes
checkpoint/data fsync's while the "actual data" is flushed asynchronously
to the spindles.

Haven't tried this and YMMV. Some good reading available here:
https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/

Cheers


On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <dsuiter@rdx.com> wrote:

> Hi,
>
> There has been a lot of discussion about file channel speed today, and I
> have had a dilemma I was hoping for some feedback on, since the topic is
> hot.
>
>  Regarding this:
> "Hi,
>
> 1) You are only using a single disk for file channel and it looks like a
> single disk for both checkpoint and data directories therefore throughput
> is going to be extremely slow."
>
> How do you solve in a practical sense the requirement for file channel to
> have a range of disks for best R/W speed, yet still have network visibility
> to source data sources and the Hadoop cluster at the same time?
>
> It seems like for production file channel implementation, the best
> solution is to give Flume a dedicated server somewhere near the edge with a
> JBOD pile properly mounted and partitioned. But that adds to implementation
> cost.
>
> The alternative seems to be to run Flume on a  physical Cloudera Manager
> SCM server that has some extra disks, or run Flume agents concurrent with
> datanode processes on worker nodes, but those don't seem good to do,
> especially piggybacking on worker nodes, and file channel > HDFS will
> compound the issue...
>
> I know the namenode should definitely not be involved.
>
> I suppose you could virtualize a few servers on a properly networked host
> and a fast SANS/NAS connection and get by ok, but that will merge your
> parallelization at some point...
>
> Any ideas on the subject?
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>

Mime
View raw message