flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: File Channel Best Practice
Date Tue, 17 Dec 2013 17:49:15 GMT
Thanks Paul, that's good to know.

My cluster is sort of a combination of test and production, so we don't
tinker with the real cluster config often, and my dev pseudo-cluster on VM
doesn't really cater well to file-channel testing, and my source is only
making 1.1 events/minute anyway so this has been tough for me to really
examine closely.

I appreciate the time you took to share.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Dec 17, 2013 at 12:43 PM, Paul Chavez <
pchavez@verticalsearchworks.com> wrote:

> We co-locate our flume agents on our data nodes in order to have access to
> many ‘spindles’ for the file channels. We have a small cluster (10 nodes)
> so these are also our task tracker nodes and we haven’t seen any huge
> performance issues.
>
>
>
> For reference, our typical event ingestion rate is between 2k and 5k
> events per second under ‘normal’ production load. I recently had to
> backfill a couple of weeks of web logs though, and took the opportunity to
> examine max throughput rates and how heavy MR load affected things. During
> that ‘test’ we stabilized at about 12K events per second written to HDFS,
> that was a single agent using 2 HDFS sinks taking from one file channel. As
> far as I could tell my bottleneck was in the Avro hop between my collector
> and writer agents, not in the HDFS sinks. When we had all MR slots used by
> large batch jobs for extended amounts of time the event throughput degraded
> to about 3500 events/sec.
>
>
>
> I know these are just anecdotal data points, but wanted to share my
> experience with flume agents located on the actual data/task nodes
> themselves. I have done very little optimization aside from separating the
> file channel data/log directories onto separate drives.
>
>
>
> -Paul Chavez
>
>
>
> *From:* Devin Suiter RDX [mailto:dsuiter@rdx.com]
> *Sent:* Tuesday, December 17, 2013 8:30 AM
> *To:* user@flume.apache.org
> *Subject:* File Channel Best Practice
>
>
>
> Hi,
>
>
>
> There has been a lot of discussion about file channel speed today, and I
> have had a dilemma I was hoping for some feedback on, since the topic is
> hot.
>
>
>
> Regarding this:
>
> "Hi,
>
>
>
> 1) You are only using a single disk for file channel and it looks like a
> single disk for both checkpoint and data directories therefore throughput
> is going to be extremely slow."
>
>
>
> How do you solve in a practical sense the requirement for file channel to
> have a range of disks for best R/W speed, yet still have network visibility
> to source data sources and the Hadoop cluster at the same time?
>
>
>
> It seems like for production file channel implementation, the best
> solution is to give Flume a dedicated server somewhere near the edge with a
> JBOD pile properly mounted and partitioned. But that adds to implementation
> cost.
>
>
>
> The alternative seems to be to run Flume on a  physical Cloudera Manager
> SCM server that has some extra disks, or run Flume agents concurrent with
> datanode processes on worker nodes, but those don't seem good to do,
> especially piggybacking on worker nodes, and file channel > HDFS will
> compound the issue...
>
>
>
> I know the namenode should definitely not be involved.
>
>
>
> I suppose you could virtualize a few servers on a properly networked host
> and a fast SANS/NAS connection and get by ok, but that will merge your
> parallelization at some point...
>
>
>
> Any ideas on the subject?
>
>
> *Devin Suiter*
>
> Jr. Data Solutions Software Engineer
>
> [image: Image removed by sender.]
>
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>

Mime
View raw message