flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khurana <skhurana...@gmail.com>
Subject Flume duplicates
Date Wed, 05 Mar 2014 06:12:23 GMT
We have flume ingestion servers in production environment which are getting
data from scribe source. These servers are behind a load balancer. We
observed that we get lots of duplicates (7-8 times of original events) when
a) take out a flume server from load balancer
b) wait for channel capacity to be zero i.e. wait for all data to be
flushed out.
c) change some configuration in flume (e.g. 1 time we changed the batch
d) put the server back into load balancer.

As soon as the flume server is put back into load balancer we see sudden
surge of data being processed. These are duplicate records (events).
Question is

a) Why do we see 7-9 times of duplicate events when we add this server back
into load balancer.
b) What is the best way to handle such type of changes in flume production
boxes so that we dont see these many duplicated.

Few hundred or couple of thousands duplicates we can live with. But if
instead of getting 1,50,000 events we get 9,00,000 events (mostly
duplicates) then our workflows will start having problems.

View raw message