flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadish Bihani <jagadish.bih...@pubmatic.com>
Subject Efficient way of handling of duplicates/Deduplication strategy
Date Mon, 24 Dec 2012 07:49:09 GMT

I was thinking of 2 possible approaches for this :

Approach 1.  Deduplication at destination- Using spooling dir source 
-file channel - hdfs sink combination:
-- After the HDFS sink has written to the HDFS directory. We can run a 
map-reduce job
which based on certain unique key deduplicates the events and dumps into 
another HDFS
directory. But problem with this approach is that dedup has to depend on 
the duration of the failure.
e.g. Events batch A is written to HDFS. But source agent gets killed 
before complete file is read and then
renamed. And the agent restarts after a  long time (say 8 hours) then 
batch A will be duplicated
after 8 hours. But destination deduplication has no way of knowing it. 
It runs say hourly. Then it will
happily move initial received batch A to main HDFS directory and final 
result will have duplicates.

Can we somehow use Flume headers or any other approach to solve the 
above problem scenario ?

Approach 2. Using external RPC client - avro source -file channel - HDFS 
-- External RPC client use SpoolingFileReader API and also maintains 
offset for each file
after every batch is successfully written. So whenever an agent fails 
and restarts
file's offset is read from the directory and file is read from that 
offset and not from beginning.
Thus this approach has probability of duplication of at most "1 batch" 
per failure.
(There can be scenario when multiple failures can lead to duplicating 
more than 1 batch at a time.
But presently I am assuming it as negligible.)

Does this sound like a good approach?


View raw message