flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy <jimmyj...@gmail.com>
Subject Re: Fastest way to get data into flume?
Date Thu, 27 Mar 2014 19:40:06 GMT
I know I am a bit derailing, but scaling flume and HDFS in single VM is ...
well I guess I understand why, but is it a good approach to try to squeeze
every bit out of the virtual machine sitting on your laptop especially for
hadoop/flume?

Can you stand up small cluster in e.g. AWS if you really want to do high
volume perf testing. That should be very simple task for Whirr or CM or ...


On Thu, Mar 27, 2014 at 12:34 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Increase your batch sizes
>
>
> On Thu, Mar 27, 2014 at 12:29 PM, Chris Schneider <
> chris@christopher-schneider.com> wrote:
>
>> Thanks for all the great replies.
>>
>> My specific situation is a bit more complex than I let on initially.
>>
>> Flume running multiple agents will absolutely be able to scale to the
>> size we need for production.  But since our system is time-based, waiting
>> for real-world measurements to arrive, we have a simulation layer making
>> convincing real data to be pushed in for development & demos. (ie, create
>> events at 1000x accelerated time, so we can see the effects of our change
>> without waiting weeks).
>>
>> So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our
>> laptops as we're doing development.  I suppose memory channel is fine in
>> this case, since it's all test data, but maximum single-agent speed is
>> needed to support the higher time accelerations I want.
>>
>> Unfortunately, our production system demands a horizontal scaling system
>> (flume is great), and our dev environment would be best with a vertically
>> scaling system (not as much flume's goal from what I can tell).
>>
>> Are there any tricks / tweaks that can get single-agent speeds up?
>>  What's the fastest (maybe not 100% safe?) source type? Can we minimize the
>> cost of ACKing messages in the source?
>>
>>
>> On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane <mkeane@conversantmedia.com>wrote:
>>
>>> I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1
>>> and it failed miserably - I was able to make a change to improve it's
>>> performance, arguably a bug fix (I think socket acknowledgement it was
>>> expecting) but Netcat source was still my bottle neck.
>>>
>>> Have you read the blog on performance tuning - I'm not sure where you
>>> are in your flume implementation but I found it helpful.
>>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 &
>>> https://blogs.apache.org/flume/entry/apache_flume_filechannel
>>>
>>> Since you need persistent storage I believe your only option still is
>>> the file channel.  To get the performance you need you'll need dedicated
>>> disks for the queue and write ahead log - I had good luck with a solid
>>> state drive.  With a single disk drive performance was awful.
>>>
>>> To get the throughput I wanted with compression I had one source tied to
>>> 6 file channels with compression on each channel.  Perhaps there is a
>>> better way but that is how I got it working.
>>>
>>> We also configured Forced Write Back on centos boxes serving as flume
>>> agents.  That was an optimization our IT Operations team made that helped
>>> throughput.  That is a skill I don't have but I believe it does put you at
>>> risk of data loss if the server fails because it does more caching before
>>> flushing to disk.
>>>
>>> We are currently fluming between 40 and 50 billion log lines per day
>>> (10-12TB of data) from 14 servers "collector tier" sinking the data to 8
>>> servers in the "storage tier" that writes to HDFS (MapR's implementation)
>>> with problem.  We had no problem with 1/2 the servers however we configured
>>> fail over and paired up the servers for this purpose.  Which by the way
>>> works flawlessly - able to pull one server out for maintenance and add back
>>> in no problem.
>>>
>>> Here are some high level points to our implementation.
>>>
>>> 1.  Instead of netcat source I made use of the Embedded Agent - When I
>>> created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a
>>> configurable number of log lines in the payload, usually 200 lines of log
>>> data.  Ultimately I went away from text data all together and serialized
>>> 200 avro "log objects" as a avro data file byte array and that was my
>>> payload.
>>>
>>> 2.  Keep your batch size large.  I set mine to 50 - so 10,000 log lines
>>> (or objects) in a single batch.
>>>
>>> 3.  You will get duplicates so be prepared to either customize flume to
>>> prevent duplicates (our solution) or write map reduce jobs to remove
>>> duplicates.
>>>
>>>
>>> Regards,
>>>
>>> Mike
>>>
>>> ________________________________
>>> From: Andrew Ehrlich [andrew@aehrlich.com]
>>> Sent: Thursday, March 27, 2014 1:07 PM
>>> To: user@flume.apache.org
>>> Subject: Re: Fastest way to get data into flume?
>>>
>>> What about having more than one flume agent?
>>>
>>> You could have two agents that read the small messages and sink to HDFS,
>>> or two agents that read the messages, serialize them, and send them to a
>>> third agent which sinks them into HDFS.
>>>
>>>
>>> On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider <
>>> chris@christopher-schneider.com<mailto:chris@christopher-schneider.com>>
>>> wrote:
>>> I have a fair bit of data continually being created in the form of
>>> smallish messages (a few hundred bytes), which needs to enter flume, and
>>> eventually sink into HDFS.
>>>
>>> I need to be sure that the data lands in persistent storage and won't be
>>> lost, but otherwise throughput isn't important. It just needs to be fast
>>> enough to not back up.
>>>
>>> I'm running into a bottleneck in the initial ingestion of data.
>>>
>>> I've tried the netcat source, and the thrift source but both have capped
>>> out at a thousand or so records per second.
>>>
>>> Batching up the thrift api items into sets of 10 and using appendBatch
>>> is a pretty large speedup, but still not enough.
>>>
>>> Here's a gist of my ruby test script, and some example runs, and my
>>> config.
>>>
>>> https://gist.github.com/cschneid/9792305
>>>
>>>
>>> 1.  Are there any obvious performance changes I can do to speed up
>>> ingestion?
>>> 2. How fast can flume reasonably go? Should I switch my source to be
>>> something else that's faster? What?
>>> 3. Is there a better tool for this kind of task? (rapid, safe ingestion
>>> small messages).
>>>
>>> Thanks!
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>> This email and any files included with it may contain privileged,
>>> proprietary and/or confidential information that is for the sole use
>>> of the intended recipient(s).  Any disclosure, copying, distribution,
>>> posting, or use of the information contained in or attached to this
>>> email is prohibited unless permitted by the sender.  If you have
>>> received this email in error, please immediately notify the sender
>>> via return email, telephone, or fax and destroy this original
>>> transmission
>>> and its included files without reading or saving it in any manner.
>>> Thank you.
>>>
>>>
>>
>

Mime
View raw message