flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Percy <mpe...@apache.org>
Subject Re: New blog post on Flume performance tuning
Date Fri, 11 Jan 2013 22:27:06 GMT
Hi Simon,
There is no good way that I am aware of for Flume to dedup messages. This
is because there is no abstraction for doing pairwise comparison of events,
and, as you scale up, maintaining some kind of hash table of processed
events generally becomes prohibitive or makes it not worth the effort at
the streaming layer.

The most straightforward way to dedup Flume events is to tag them with some
kind of unique ID at event creation time. Then you can dedup with a
MapReduce job (in the case of writing to HDFS) or by making your operations
idempotent (in the case, for example, of writing keys to HBase).

Regards,
Mike



On Fri, Jan 11, 2013 at 12:59 PM, Xu (Simon) Chen <xchenum@gmail.com> wrote:

> Great post, Mike!
>
> One question if you can either address via mailing list or future posts...
>
> I am curious about how to remove duplicated messages in this flow. For
> example, when I set up a switch/router to send syslog messages, I'd
> like to send two syslog collectors or two flume agents. In this case,
> the switch/router is just a dumb device, not knowing how to fail-over
> or load-balance. As a result, we have two copies of the same message
> going into flume.
>
> I have seen people describing doing hbase operations to remove
> duplicates, but I am wondering if we can do anything in the flume
> infrastructure.
>
> Thanks.
> -Simon
>
> On Fri, Jan 11, 2013 at 3:48 PM, Mohammad Tariq <dontariq@gmail.com>
> wrote:
> > +1
> >
> > Thank you so much Mike, for all the good work.
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> >
> >
> > On Sat, Jan 12, 2013 at 2:15 AM, Mike Percy <mpercy@apache.org> wrote:
> >>
> >> Thanks Brock! I've been working on this, off and on, for a while. :)
> >>
> >>
> >> On Fri, Jan 11, 2013 at 12:18 PM, Brock Noland <brock@cloudera.com>
> wrote:
> >>>
> >>> Nice post!
> >>>
> >>> On Fri, Jan 11, 2013 at 12:13 PM, Mike Percy <mpercy@apache.org>
> wrote:
> >>> > Hi folks,
> >>> > I just posted to the Apache blog on how to do performance tuning with
> >>> > Flume.
> >>> > I plan on following it up with a post about using the Flume
> monitoring
> >>> > capabilities while tuning. Feedback is welcome.
> >>> >
> >>> > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
> >>> >
> >>> > Regards,
> >>> > Mike
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Apache MRUnit - Unit testing MapReduce -
> >>> http://incubator.apache.org/mrunit/
> >>
> >>
> >
>

Mime
View raw message