flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Lord <jl...@cloudera.com>
Subject Re: Fastest way to get data into flume?
Date Thu, 27 Mar 2014 19:34:13 GMT
Increase your batch sizes


On Thu, Mar 27, 2014 at 12:29 PM, Chris Schneider <
chris@christopher-schneider.com> wrote:

> Thanks for all the great replies.
>
> My specific situation is a bit more complex than I let on initially.
>
> Flume running multiple agents will absolutely be able to scale to the size
> we need for production.  But since our system is time-based, waiting for
> real-world measurements to arrive, we have a simulation layer making
> convincing real data to be pushed in for development & demos. (ie, create
> events at 1000x accelerated time, so we can see the effects of our change
> without waiting weeks).
>
> So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our laptops
> as we're doing development.  I suppose memory channel is fine in this case,
> since it's all test data, but maximum single-agent speed is needed to
> support the higher time accelerations I want.
>
> Unfortunately, our production system demands a horizontal scaling system
> (flume is great), and our dev environment would be best with a vertically
> scaling system (not as much flume's goal from what I can tell).
>
> Are there any tricks / tweaks that can get single-agent speeds up?  What's
> the fastest (maybe not 100% safe?) source type? Can we minimize the cost of
> ACKing messages in the source?
>
>
> On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane <mkeane@conversantmedia.com>wrote:
>
>> I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1
>> and it failed miserably - I was able to make a change to improve it's
>> performance, arguably a bug fix (I think socket acknowledgement it was
>> expecting) but Netcat source was still my bottle neck.
>>
>> Have you read the blog on performance tuning - I'm not sure where you are
>> in your flume implementation but I found it helpful.
>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 &
>> https://blogs.apache.org/flume/entry/apache_flume_filechannel
>>
>> Since you need persistent storage I believe your only option still is the
>> file channel.  To get the performance you need you'll need dedicated disks
>> for the queue and write ahead log - I had good luck with a solid state
>> drive.  With a single disk drive performance was awful.
>>
>> To get the throughput I wanted with compression I had one source tied to
>> 6 file channels with compression on each channel.  Perhaps there is a
>> better way but that is how I got it working.
>>
>> We also configured Forced Write Back on centos boxes serving as flume
>> agents.  That was an optimization our IT Operations team made that helped
>> throughput.  That is a skill I don't have but I believe it does put you at
>> risk of data loss if the server fails because it does more caching before
>> flushing to disk.
>>
>> We are currently fluming between 40 and 50 billion log lines per day
>> (10-12TB of data) from 14 servers "collector tier" sinking the data to 8
>> servers in the "storage tier" that writes to HDFS (MapR's implementation)
>> with problem.  We had no problem with 1/2 the servers however we configured
>> fail over and paired up the servers for this purpose.  Which by the way
>> works flawlessly - able to pull one server out for maintenance and add back
>> in no problem.
>>
>> Here are some high level points to our implementation.
>>
>> 1.  Instead of netcat source I made use of the Embedded Agent - When I
>> created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a
>> configurable number of log lines in the payload, usually 200 lines of log
>> data.  Ultimately I went away from text data all together and serialized
>> 200 avro "log objects" as a avro data file byte array and that was my
>> payload.
>>
>> 2.  Keep your batch size large.  I set mine to 50 - so 10,000 log lines
>> (or objects) in a single batch.
>>
>> 3.  You will get duplicates so be prepared to either customize flume to
>> prevent duplicates (our solution) or write map reduce jobs to remove
>> duplicates.
>>
>>
>> Regards,
>>
>> Mike
>>
>> ________________________________
>> From: Andrew Ehrlich [andrew@aehrlich.com]
>> Sent: Thursday, March 27, 2014 1:07 PM
>> To: user@flume.apache.org
>> Subject: Re: Fastest way to get data into flume?
>>
>> What about having more than one flume agent?
>>
>> You could have two agents that read the small messages and sink to HDFS,
>> or two agents that read the messages, serialize them, and send them to a
>> third agent which sinks them into HDFS.
>>
>>
>> On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider <
>> chris@christopher-schneider.com<mailto:chris@christopher-schneider.com>>
>> wrote:
>> I have a fair bit of data continually being created in the form of
>> smallish messages (a few hundred bytes), which needs to enter flume, and
>> eventually sink into HDFS.
>>
>> I need to be sure that the data lands in persistent storage and won't be
>> lost, but otherwise throughput isn't important. It just needs to be fast
>> enough to not back up.
>>
>> I'm running into a bottleneck in the initial ingestion of data.
>>
>> I've tried the netcat source, and the thrift source but both have capped
>> out at a thousand or so records per second.
>>
>> Batching up the thrift api items into sets of 10 and using appendBatch is
>> a pretty large speedup, but still not enough.
>>
>> Here's a gist of my ruby test script, and some example runs, and my
>> config.
>>
>> https://gist.github.com/cschneid/9792305
>>
>>
>> 1.  Are there any obvious performance changes I can do to speed up
>> ingestion?
>> 2. How fast can flume reasonably go? Should I switch my source to be
>> something else that's faster? What?
>> 3. Is there a better tool for this kind of task? (rapid, safe ingestion
>> small messages).
>>
>> Thanks!
>> Chris
>>
>>
>>
>>
>>
>> This email and any files included with it may contain privileged,
>> proprietary and/or confidential information that is for the sole use
>> of the intended recipient(s).  Any disclosure, copying, distribution,
>> posting, or use of the information contained in or attached to this
>> email is prohibited unless permitted by the sender.  If you have
>> received this email in error, please immediately notify the sender
>> via return email, telephone, or fax and destroy this original transmission
>> and its included files without reading or saving it in any manner.
>> Thank you.
>>
>>
>

Mime
View raw message