flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Majid Alfifi <majid.alf...@gmail.com>
Subject Re: About duplicate events and how to deal with them in Flume with interceptors.
Date Fri, 07 Aug 2015 11:20:30 GMT
It's not clear if you are referring to duplicates that result from the source or duplicates
that result from Flume itself trying to maintain the at-least-once delivery of events.

I had a case were the source was producing  duplicates but the network bandwidth was almost
fully utilized by the regular de-duplicated stream so we couldn't afford to have duplicates
travel all the way to the final destination (HDFS in our case). We ultimately just used a
CircularFifoQueue in a flume interceptor. It was a good fit because for our case all duplicates
will come in about 30-seconds window. We were receiving about 600 event per second so a CircularFifoQueue
of size 18,000 for example was an easy solution to remove duplicates but at the expense of
having a single flume agent to remove duplicates (SPOF). 

However, we still see duplicates at the final destination that are a result of Flume architecture
or from occasional duplicates that come more than 30 seconds apart from the source but they
were a very small percentage of the data size. We had a MapReduce job that removed those remaining
duplicates in HDFS.


> On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <konstt2000@gmail.com> wrote:
> Hi, 
> I would like to delete duplicates in Flume with Interceptors. 
> The idea is to calculate an MD5 or similar for the event and store in Redis or another
database. I want just to check the lost of performance and which it's the best solution for
dealing with it. 
> As I understand the max number of events what they could be duplicates depend of the
batchSize. So, you only need to store that number of keys in your database. I don't know if
Redis has that feature as capped collection in Mongo.
> Has someone done something similar and knows the lost of performance? Which could it
be the best place where to store the keys for really fast access?? Mongo, Redis,...? I think
that HBase or Cassandra could be worse since with Redis or similar could be in the same host
than Flume and you don't lose time because the network.
> Any other solution to deal with duplicates in realtime?

View raw message