flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Chavez <pcha...@verticalsearchworks.com>
Subject RE: File Channel Best Practice
Date Tue, 17 Dec 2013 17:43:45 GMT
We co-locate our flume agents on our data nodes in order to have access to many 'spindles'
for the file channels. We have a small cluster (10 nodes) so these are also our task tracker
nodes and we haven't seen any huge performance issues.

For reference, our typical event ingestion rate is between 2k and 5k events per second under
'normal' production load. I recently had to backfill a couple of weeks of web logs though,
and took the opportunity to examine max throughput rates and how heavy MR load affected things.
During that 'test' we stabilized at about 12K events per second written to HDFS, that was
a single agent using 2 HDFS sinks taking from one file channel. As far as I could tell my
bottleneck was in the Avro hop between my collector and writer agents, not in the HDFS sinks.
When we had all MR slots used by large batch jobs for extended amounts of time the event throughput
degraded to about 3500 events/sec.

I know these are just anecdotal data points, but wanted to share my experience with flume
agents located on the actual data/task nodes themselves. I have done very little optimization
aside from separating the file channel data/log directories onto separate drives.

-Paul Chavez

From: Devin Suiter RDX [mailto:dsuiter@rdx.com]
Sent: Tuesday, December 17, 2013 8:30 AM
To: user@flume.apache.org
Subject: File Channel Best Practice

Hi,

There has been a lot of discussion about file channel speed today, and I have had a dilemma
I was hoping for some feedback on, since the topic is hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a single disk for both
checkpoint and data directories therefore throughput is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to have a range of
disks for best R/W speed, yet still have network visibility to source data sources and the
Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution is to give Flume
a dedicated server somewhere near the edge with a JBOD pile properly mounted and partitioned.
But that adds to implementation cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager SCM server that has
some extra disks, or run Flume agents concurrent with datanode processes on worker nodes,
but those don't seem good to do, especially piggybacking on worker nodes, and file channel
> HDFS will compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host and a fast SANS/NAS
connection and get by ok, but that will merge your parallelization at some point...

Any ideas on the subject?

Devin Suiter
Jr. Data Solutions Software Engineer
[cid:~WRD000.jpg]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com<http://www.rdx.com/>

Mime
View raw message