flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: De-duping events during ingestion
Date Sat, 18 Apr 2015 00:42:05 GMT
You can (and we did), just note that HBase will add at least 5ms
latency per event.

On Fri, Apr 17, 2015 at 5:39 PM, Buntu Dev <buntudev@gmail.com> wrote:
> Thanks Hari. One can't use some sort of lookup (maybe HBase) using the
> interceptors to see if certain combination of query params (user+page+action
> key) exist already that was seen in the past 5mins to skip the current
> event?
>
>
>
> On Fri, Apr 17, 2015 at 1:56 PM, Hari Shreedharan
> <hshreedharan@cloudera.com> wrote:
>>
>> That would have to be done outside Flume, perhaps using something like
>> Spark Streaming, or Storm.
>>
>> Thanks,
>> Hari
>>
>>
>> On Fri, Apr 17, 2015 at 12:15 AM, Buntu Dev <buntudev@gmail.com> wrote:
>>>
>>> Are there any known strategies to handle duplicate events during
>>> ingestion? I use Flume to ingest apache logs to parse the request using
>>> Morphlines and there are some duplicate requests with certain query params
>>> differing. I would like to handle these once I parse and split the query
>>> params into tokens in Morphlines. How does one lookup previous events in the
>>> stream (say in the 5min window) and de-dupe before writing to the sink?
>>>
>>> Thanks!
>>
>>
>

Mime
View raw message