Hi,

There has been a lot of discussion about file channel speed today, and I have had a dilemma I was hoping for some feedback on, since the topic is hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a single disk for both checkpoint and data directories therefore throughput is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to have a range of disks for best R/W speed, yet still have network visibility to source data sources and the Hadoop cluster at the same time? 

It seems like for production file channel implementation, the best solution is to give Flume a dedicated server somewhere near the edge with a JBOD pile properly mounted and partitioned. But that adds to implementation cost. 

The alternative seems to be to run Flume on a  physical Cloudera Manager SCM server that has some extra disks, or run Flume agents concurrent with datanode processes on worker nodes, but those don't seem good to do, especially piggybacking on worker nodes, and file channel > HDFS will compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host and a fast SANS/NAS connection and get by ok, but that will merge your parallelization at some point...

Any ideas on the subject?

Devin Suiter
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556www.rdx.com