flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Need for UDP / Multicast Source
Date Wed, 16 Jan 2013 23:09:52 GMT
No, each sink will not consume the same data. If data is taken and committed from a channel,
only the sink which took it will see it. When a sink calls take, no other sink will be able
to access the data (though it is still in the channel) unless the transaction is rolled back
(or in case of the FileChannel, the channel gets restarted due to agent restart or reconfig).
If you have a sink processor, only one of the n sinks in the group is active at one time (basically
there is one thread running the n sinks, polling them based on the sink processor's decision
on which sink to poll). Without a sink processor, each sink gets its own sink runner thread.
 


Hari 

-- 
Hari Shreedharan


On Wednesday, January 16, 2013 at 3:03 PM, Andrew Otto wrote:

> Ok, thanks.  Quick Q:  Won't each sink consume the same data?  Do I need to set up the
load balancing sink processor to keep that from happening?
> 
> 
> On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)>
wrote:
> > Also can you try adding more HDFS sinks reading from the same channel. I'd recommend
using different file prefixes, or paths for each sink, to avoid collision. Since each sink
really has just one thread driving them, adding multiple sinks might help. Also, keep an eye
on the memory channel's sizes and see if it is filling up (there will be ChannelExceptions
in the logs if it is).  
> > 
> > 
> > Hari 
> > 
> > -- 
> > Hari Shreedharan
> > 
> > 
> > On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote:
> > 
> > > Good to hear! Take five six thread dumps of it and then them our way.
> > > 
> > > On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <otto@wikimedia.org (mailto:otto@wikimedia.org)>
wrote:
> > > > Cool, thanks for the advice! That's a great blog post.
> > > > 
> > > > I've changed my ways (for now at least). I've got lots of disks to use
once memory starts working, and this node has tooooons of memory (192G).
> > > > 
> > > > Here's my new flume.conf:
> > > > https://gist.github.com/4551513
> > > > 
> > > > This is doing better, for sure. Note that I took out the timestamp regex_extractor
just in case that was impacting performance. I'm using the regular ol' timestamp interceptor
now.
> > > > 
> > > > I'm still not doing so great though. I'm getting about 300 Mb per minute
in my HDFS files. I should be getting about 300G. That's better than before though. I've got
10% of the data this time, rather than 0.14% :)
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <brock@cloudera.com (mailto:brock@cloudera.com)>
wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I would use memory channel for now as opposed to file channel. For
> > > > > file channel to keep up with that you'd need multiple disks. Also
your
> > > > > checkpoint period is super-low which will cause lots of checkpoints
> > > > > and slow things down.
> > > > > 
> > > > > However, I think the biggest issue is probably batch size. With that
> > > > > much data you are likely going to want a large batch size for all
> > > > > components involved. Something a low multiple of 1000. There is a
good
> > > > > article on this:
> > > > > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
> > > > > 
> > > > > To re-cap would:
> > > > > 
> > > > > Use memory channel for now. Once you prove things work you can work
on
> > > > > tuning file channel (going to write larger batch sizes and multiple
> > > > > disks).
> > > > > 
> > > > > Increase the batch size for your source/sink.
> > > > > 
> > > > > On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <otto@wikimedia.org
(mailto:otto@wikimedia.org)> wrote:
> > > > > > Ok, I'm trying my new UDPSource with Wikimedia's webrequest
log stream. This is available to me via UDP Multicast. Everything seems to be working great,
except that I seem to be missing a lot of data.
> > > > > > 
> > > > > > Our webrequest log stream consists of about 100000 events per
second, which amounts to around 50 Mb per second.
> > > > > > 
> > > > > > I understand that this is probably too much for a single node
to handle, but I should be able to either see most of the data written to HDFS, or at least
see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds.
Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of
the data I'm expecting to consume.
> > > > > > 
> > > > > > Where did the rest of it go? If Flume is dropping it, why doesn't
it tell me!?
> > > > > > 
> > > > > > Here's my flume.conf:
> > > > > > 
> > > > > > https://gist.github.com/4551001
> > > > > > 
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Jan 15, 2013, at 2:31 PM, Andrew Otto <otto@wikimedia.org
(mailto:otto@wikimedia.org)> wrote:
> > > > > > 
> > > > > > > I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838.
> > > > > > > 
> > > > > > > Would love some reviews, thanks!
> > > > > > > -Andrew
> > > > > > > 
> > > > > > > 
> > > > > > > On Jan 14, 2013, at 1:01 PM, Andrew Otto <otto@wikimedia.org
(mailto:otto@wikimedia.org)> wrote:
> > > > > > > 
> > > > > > > > Thanks guys! I've opened up a JIRA here:
> > > > > > > > 
> > > > > > > > https://issues.apache.org/jira/browse/FLUME-1838
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz
<wget.null@gmail.com (mailto:wget.null@gmail.com)> wrote:
> > > > > > > > 
> > > > > > > > > Hey Andrew,
> > > > > > > > > 
> > > > > > > > > for your reference, we have a lot of developer
informations in our wiki:
> > > > > > > > > 
> > > > > > > > > https://cwiki.apache.org/confluence/display/FLUME/Developer+Section
> > > > > > > > > https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet
> > > > > > > > > 
> > > > > > > > > cheers,
> > > > > > > > > Alex
> > > > > > > > > 
> > > > > > > > > On Jan 14, 2013, at 6:37 PM, Hari Shreedharan
<hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > > > > > > > > 
> > > > > > > > > > Hi Andrew,
> > > > > > > > > > 
> > > > > > > > > > Really happy to hear Wikimedia Foundation
is considering Flume. I am fairly sure that if you find such a source useful, there would
definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion,
and then submitting the patch. We would be happy to review and commit it.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Hari
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Hari Shreedharan
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On Monday, January 14, 2013 at 9:29 AM,
Andrew Otto wrote:
> > > > > > > > > > 
> > > > > > > > > > > Hi all,
> > > > > > > > > > > 
> > > > > > > > > > > I'm an Systems Engineer at the Wikimedia
Foundation, and we're investigating using Flume for our web request log HDFS imports. We've
previously been using Kafka, but have had to change short term architecture plans in order
to get data into HDFS reliably and regularly soon.
> > > > > > > > > > > 
> > > > > > > > > > > Our current web request logs are available
for consumption over a multicast UDP stream. I could hack something together to try and pipe
this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of
socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume
directly from the multicast UDP stream as a Flume source.
> > > > > > > > > > > 
> > > > > > > > > > > I coded up proof of concept based on
the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding
in multicast Datagram connection code. I plan on cleaning this up, and making this a generic
raw UDP source, with multicast being a configuration option.
> > > > > > > > > > > 
> > > > > > > > > > > My question to you guys is, is this
something the Flume community would find useful? If so, should I open up a JIRA to track this?
I've got a fork of the Flume git repo over on github and will be doing my work there. I'd
love to share it upstream if it would be useful.
> > > > > > > > > > > 
> > > > > > > > > > > Thanks!
> > > > > > > > > > > -Andrew Otto
> > > > > > > > > > > Systems Engineer
> > > > > > > > > > > Wikimedia Foundation
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Alexander Alten-Lorenz
> > > > > > > > > http://mapredit.blogspot.com (http://mapredit.blogspot.com/)
> > > > > > > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > -- 
> > > Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
> > > 
> > 
> > 
> 


Mime
View raw message