flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schneider <ch...@christopher-schneider.com>
Subject Re: Fastest way to get data into flume?
Date Thu, 27 Mar 2014 19:29:53 GMT
Thanks for all the great replies.

My specific situation is a bit more complex than I let on initially.

Flume running multiple agents will absolutely be able to scale to the size
we need for production.  But since our system is time-based, waiting for
real-world measurements to arrive, we have a simulation layer making
convincing real data to be pushed in for development & demos. (ie, create
events at 1000x accelerated time, so we can see the effects of our change
without waiting weeks).

So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our laptops
as we're doing development.  I suppose memory channel is fine in this case,
since it's all test data, but maximum single-agent speed is needed to
support the higher time accelerations I want.

Unfortunately, our production system demands a horizontal scaling system
(flume is great), and our dev environment would be best with a vertically
scaling system (not as much flume's goal from what I can tell).

Are there any tricks / tweaks that can get single-agent speeds up?  What's
the fastest (maybe not 100% safe?) source type? Can we minimize the cost of
ACKing messages in the source?

On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane <mkeane@conversantmedia.com>wrote:

> I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1
> and it failed miserably - I was able to make a change to improve it's
> performance, arguably a bug fix (I think socket acknowledgement it was
> expecting) but Netcat source was still my bottle neck.
> Have you read the blog on performance tuning - I'm not sure where you are
> in your flume implementation but I found it helpful.
> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 &
> https://blogs.apache.org/flume/entry/apache_flume_filechannel
> Since you need persistent storage I believe your only option still is the
> file channel.  To get the performance you need you'll need dedicated disks
> for the queue and write ahead log - I had good luck with a solid state
> drive.  With a single disk drive performance was awful.
> To get the throughput I wanted with compression I had one source tied to 6
> file channels with compression on each channel.  Perhaps there is a better
> way but that is how I got it working.
> We also configured Forced Write Back on centos boxes serving as flume
> agents.  That was an optimization our IT Operations team made that helped
> throughput.  That is a skill I don't have but I believe it does put you at
> risk of data loss if the server fails because it does more caching before
> flushing to disk.
> We are currently fluming between 40 and 50 billion log lines per day
> (10-12TB of data) from 14 servers "collector tier" sinking the data to 8
> servers in the "storage tier" that writes to HDFS (MapR's implementation)
> with problem.  We had no problem with 1/2 the servers however we configured
> fail over and paired up the servers for this purpose.  Which by the way
> works flawlessly - able to pull one server out for maintenance and add back
> in no problem.
> Here are some high level points to our implementation.
> 1.  Instead of netcat source I made use of the Embedded Agent - When I
> created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a
> configurable number of log lines in the payload, usually 200 lines of log
> data.  Ultimately I went away from text data all together and serialized
> 200 avro "log objects" as a avro data file byte array and that was my
> payload.
> 2.  Keep your batch size large.  I set mine to 50 - so 10,000 log lines
> (or objects) in a single batch.
> 3.  You will get duplicates so be prepared to either customize flume to
> prevent duplicates (our solution) or write map reduce jobs to remove
> duplicates.
> Regards,
> Mike
> ________________________________
> From: Andrew Ehrlich [andrew@aehrlich.com]
> Sent: Thursday, March 27, 2014 1:07 PM
> To: user@flume.apache.org
> Subject: Re: Fastest way to get data into flume?
> What about having more than one flume agent?
> You could have two agents that read the small messages and sink to HDFS,
> or two agents that read the messages, serialize them, and send them to a
> third agent which sinks them into HDFS.
> On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider <
> chris@christopher-schneider.com<mailto:chris@christopher-schneider.com>>
> wrote:
> I have a fair bit of data continually being created in the form of
> smallish messages (a few hundred bytes), which needs to enter flume, and
> eventually sink into HDFS.
> I need to be sure that the data lands in persistent storage and won't be
> lost, but otherwise throughput isn't important. It just needs to be fast
> enough to not back up.
> I'm running into a bottleneck in the initial ingestion of data.
> I've tried the netcat source, and the thrift source but both have capped
> out at a thousand or so records per second.
> Batching up the thrift api items into sets of 10 and using appendBatch is
> a pretty large speedup, but still not enough.
> Here's a gist of my ruby test script, and some example runs, and my config.
> https://gist.github.com/cschneid/9792305
> 1.  Are there any obvious performance changes I can do to speed up
> ingestion?
> 2. How fast can flume reasonably go? Should I switch my source to be
> something else that's faster? What?
> 3. Is there a better tool for this kind of task? (rapid, safe ingestion
> small messages).
> Thanks!
> Chris
> This email and any files included with it may contain privileged,
> proprietary and/or confidential information that is for the sole use
> of the intended recipient(s).  Any disclosure, copying, distribution,
> posting, or use of the information contained in or attached to this
> email is prohibited unless permitted by the sender.  If you have
> received this email in error, please immediately notify the sender
> via return email, telephone, or fax and destroy this original transmission
> and its included files without reading or saving it in any manner.
> Thank you.

View raw message