flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Ortiz <konstt2...@gmail.com>
Subject Re: Deal with duplicates in Flume with a crash.
Date Wed, 03 Dec 2014 22:46:40 GMT
That's interesting, do you have the RegionServers in different nodes
that your Flume Agents?? Because that could be a lot of traffic.
If you want to check duplicates for each log, the number of
checks/puts are always the same. What's the sense to put several logs
in the same event?

2014-12-03 23:35 GMT+01:00 Mike Keane <mkeane@conversantmedia.com>:
> We effectively mitigated this problem by using the UUID interceptor and customizing the
HDFS Sink to do a check and put of the UUID to HBase.  In the customized sink we check HBase
to see if we have seen the UUID before, if we have it is a duplicate and we log a new duplicate
metric with the existing sink metrics and throw the event away.  If we have not seen the UUID
before we write the Event to HDFS and do a put of the UUID to hbase.
> Because of our volume to minimize the number of check/puts to HBase we put multiple logs
in a single FlumeEvent.
> -Mike
> ________________________________________
> From: Guillermo Ortiz [konstt2000@gmail.com]
> Sent: Wednesday, December 03, 2014 4:15 PM
> To: user@flume.apache.org
> Subject: Re: Deal with duplicates in Flume with a crash.
> I didn't know anything about a Hive Sink, I'll check the JIRA about it, thanks.
> The pipeline is Flume-Kafka-SparkStreaming-XXX
> So I guess I should deal in SparkStreaming with it, right? I guess
> that it would be easy to do it with an UUID interceptor or is there
> another way easier?
> 2014-12-03 22:56 GMT+01:00 Roshan Naik <roshan@hortonworks.com>:
>> Using the UUID interceptor at the source closest to data origination.. it
>> will help identify duplicate events after they are delivered.
>> If it satisfies your use case, the upcoming Hive Sink will mitigate the
>> problem a little bit (since it uses transactions to write to destination).
>> -roshan
>> On Wed, Dec 3, 2014 at 8:44 AM, Joey Echeverria <joey@cloudera.com> wrote:
>>> There's nothing built into Flume to deal with duplicates, it only
>>> provides at-least-once delivery semantics.
>>> You'll have to handle it in your data processing applications or add
>>> an ETL step to deal with duplicates before making data available for
>>> other queries.
>>> -Joey
>>> On Wed, Dec 3, 2014 at 5:46 AM, Guillermo Ortiz <konstt2000@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I would like to know if there's a easy way to deal with data
>>> > duplication when an agent crashs and it resends same data again.
>>> >
>>> > Is there any mechanism to deal with it in Flume,
>>> --
>>> Joey Echeverria
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
> This email and any files included with it may contain privileged,
> proprietary and/or confidential information that is for the sole use
> of the intended recipient(s).  Any disclosure, copying, distribution,
> posting, or use of the information contained in or attached to this
> email is prohibited unless permitted by the sender.  If you have
> received this email in error, please immediately notify the sender
> via return email, telephone, or fax and destroy this original transmission
> and its included files without reading or saving it in any manner.
> Thank you.

View raw message