flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject File Channel Best Practice
Date Tue, 17 Dec 2013 16:30:16 GMT
Hi,

There has been a lot of discussion about file channel speed today, and I
have had a dilemma I was hoping for some feedback on, since the topic is
hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a
single disk for both checkpoint and data directories therefore throughput
is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to
have a range of disks for best R/W speed, yet still have network visibility
to source data sources and the Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution
is to give Flume a dedicated server somewhere near the edge with a JBOD
pile properly mounted and partitioned. But that adds to implementation
cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager
SCM server that has some extra disks, or run Flume agents concurrent with
datanode processes on worker nodes, but those don't seem good to do,
especially piggybacking on worker nodes, and file channel > HDFS will
compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host
and a fast SANS/NAS connection and get by ok, but that will merge your
parallelization at some point...

Any ideas on the subject?

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Mime
View raw message