flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Otto <o...@wikimedia.org>
Subject Re: Need for UDP / Multicast Source
Date Thu, 17 Jan 2013 16:26:34 GMT
Ok, I'm still struggling with this a bit.  Here's what I've currently got going.

In order to make it easier to check what I am and am not receiving, I've narrowed the logs
that I store in HDFS down to those originating from a single host (cp1044.wikimedia.org).
 Each host generates contiguous sequence numbers for each log line.  I can use the sequence
number to make sure I'm not missing lines from a host.

On another nearby node, I started a process to store all of the log lines originating from
this cp1044.  I then started the Flume agent and waited a 3 minutes for it to roll files 3
times.  I currently have 4 HDFS sinks going, so this created a total of 12 files.  I got the
files out of HDFS, and then sorted on their sequence numbers to gain the first and last sequence
number in this set of files.  

I took those two border sequence numbers and extracted all of the log lines generated by cp1044
on the nearby host (not using Flume).  I should be able to compare the number of lines here
with the number of lines in the 12 files I extracted from HDFS and Flume.  If they are the
same, then Flume and UDPSource is working!

Flume saved 19451 events to HDFS, and the number of raw events recorded outside of Flume and
HDFS was 78176.  I'm up to about 25% of data!  Better but still not good enough. :(

This was for about 3 minutes of data, so for a single host, this shouldn't be more than 500
events per second.  I must be doing something really wrong on the Flume tweaky side of things,
eh?  Any more ideas?


P.S.  YOU GUYS ARE SO HELPFUL.  Thanks so much for everything thus far.

On Jan 17, 2013, at 10:34 AM, Andrew Otto <otto@wikimedia.org> wrote:

>> with UDP there's no guaranty that the data will reach destination.
> True, but I'm experimenting with using Flume as a replacement for a system that is already
in place.  I actually got the numbers I listed below by grabbing data directly off of the
UDP stream and saving them to a file on local disk.  Its possible that UDP data is getting
lost in the network somewhere, but if that were the case I wouldn't know about it.  I am comparing
Flume's performance to a single process writing to a local disk.

View raw message