We have flume ingestion servers in production environment which are getting data from scribe source. These servers are behind a load balancer. We observed that we get lots of duplicates (7-8 times of original events) when we
a) take out a flume server from load balancer
b) wait for channel capacity to be zero i.e. wait for all data to be flushed out.
c) change some configuration in flume (e.g. 1 time we changed the batch size)
d) put the server back into load balancer.
As soon as the flume server is put back into load balancer we see sudden surge of data being processed. These are duplicate records (events). Question is
a) Why do we see 7-9 times of duplicate events when we add this server back into load balancer.
b) What is the best way to handle such type of changes in flume production boxes so that we dont see these many duplicated.
Few hundred or couple of thousands duplicates we can live with. But if instead of getting 1,50,000 events we get 9,00,000 events (mostly duplicates) then our workflows will start having problems.