flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Majid Alfifi <majid.alf...@gmail.com>
Subject Re: About duplicate events and how to deal with them in Flume with interceptors.
Date Fri, 07 Aug 2015 11:59:28 GMT
Right the example of CircularFifoQueue won't work for your case.
What's the ultimate destination for the Flume events?


> On Aug 7, 2015, at 2:33 PM, Guillermo Ortiz <konstt2000@gmail.com> wrote:
> Thanks for the answer. I was talking more about possible failures of an Flume Agent.
There's a tiny possiblity to get duplicates not because the source is producing duplicates.
It's true that they should be a really small percentage of the data size but if the agent
crashs you could get duplicates when you starts the agent again. 
> I guess that you need a third player if you want to manage this case of duplicates and
it's not possible to use a CircularFifoQueue in the same JVM than Flume that's why I thought
about Redis or something similar. Ideally, that system should be independent of Flume and
have HA.
> 2015-08-07 13:20 GMT+02:00 Majid Alfifi <majid.alfifi@gmail.com>:
>> It's not clear if you are referring to duplicates that result from the source or
duplicates that result from Flume itself trying to maintain the at-least-once delivery of
>> I had a case were the source was producing  duplicates but the network bandwidth
was almost fully utilized by the regular de-duplicated stream so we couldn't afford to have
duplicates travel all the way to the final destination (HDFS in our case). We ultimately just
used a CircularFifoQueue in a flume interceptor. It was a good fit because for our case all
duplicates will come in about 30-seconds window. We were receiving about 600 event per second
so a CircularFifoQueue of size 18,000 for example was an easy solution to remove duplicates
but at the expense of having a single flume agent to remove duplicates (SPOF).
>> However, we still see duplicates at the final destination that are a result of Flume
architecture or from occasional duplicates that come more than 30 seconds apart from the source
but they were a very small percentage of the data size. We had a MapReduce job that removed
those remaining duplicates in HDFS.
>> -Majid
>> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <konstt2000@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I would like to delete duplicates in Flume with Interceptors.
>> > The idea is to calculate an MD5 or similar for the event and store in Redis
or another database. I want just to check the lost of performance and which it's the best
solution for dealing with it.
>> >
>> > As I understand the max number of events what they could be duplicates depend
of the batchSize. So, you only need to store that number of keys in your database. I don't
know if Redis has that feature as capped collection in Mongo.
>> >
>> > Has someone done something similar and knows the lost of performance? Which
could it be the best place where to store the keys for really fast access?? Mongo, Redis,...?
I think that HBase or Cassandra could be worse since with Redis or similar could be in the
same host than Flume and you don't lose time because the network.
>> > Any other solution to deal with duplicates in realtime?
>> >
>> >

View raw message