flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Otto <o...@wikimedia.org>
Subject Re: Need for UDP / Multicast Source
Date Wed, 16 Jan 2013 21:22:39 GMT
Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream.  This is available
to me via UDP Multicast.  Everything seems to be working great, except that I seem to be missing
a lot of data.

Our webrequest log stream consists of about 100000 events per second, which amounts to around
50 Mb per second.

I understand that this is probably too much for a single node to handle, but I should be able
to either see most of the data written to HDFS, or at least see errors about channels being
filled to capacity.  HDFS files are set to roll every 60 seconds.  Each of my files is only
about 4.2MB, which is only 72 Kb per second.  That's only 0.14% of the data I'm expecting
to consume.

Where did the rest of it go?  If Flume is dropping it, why doesn't it tell me!?

Here's my flume.conf:

https://gist.github.com/4551001


Thanks!




On Jan 15, 2013, at 2:31 PM, Andrew Otto <otto@wikimedia.org> wrote:

> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838.
> 
> Would love some reviews, thanks!
> -Andrew
> 
> 
> On Jan 14, 2013, at 1:01 PM, Andrew Otto <otto@wikimedia.org> wrote:
> 
>> Thanks guys!  I've opened up a JIRA here:
>> 
>> https://issues.apache.org/jira/browse/FLUME-1838
>> 
>> 
>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <wget.null@gmail.com>
wrote:
>> 
>>> Hey Andrew,
>>> 
>>> for your reference, we have a lot of developer informations in our wiki:
>>> 
>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section
>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet
>>> 
>>> cheers,
>>> Alex
>>> 
>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <hshreedharan@cloudera.com>
wrote:
>>> 
>>>> Hi Andrew, 
>>>> 
>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly
sure that if you find such a source useful, there would definitely be others who find it useful
too. I'd recommend filing a jira and starting a discussion, and then submitting the patch.
We would be happy to review and commit it. 
>>>> 
>>>> 
>>>> Thanks,
>>>> Hari
>>>> 
>>>> -- 
>>>> Hari Shreedharan
>>>> 
>>>> 
>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating
using Flume for our web request log HDFS imports. We've previously been using Kafka, but have
had to change short term architecture plans in order to get data into HDFS reliably and regularly
soon.
>>>>> 
>>>>> Our current web request logs are available for consumption over a multicast
UDP stream. I could hack something together to try and pipe this into Flume using the existing
sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather
reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream
as a Flume source.
>>>>> 
>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just
stripping out the syslog event header extraction, and adding in multicast Datagram connection
code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast
being a configuration option.
>>>>> 
>>>>> My question to you guys is, is this something the Flume community would
find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git
repo over on github and will be doing my work there. I'd love to share it upstream if it would
be useful.
>>>>> 
>>>>> Thanks!
>>>>> -Andrew Otto
>>>>> Systems Engineer
>>>>> Wikimedia Foundation
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Alexander Alten-Lorenz
>>> http://mapredit.blogspot.com
>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>> 
>> 
> 


Mime
View raw message