flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: [Transaction] About KafkaSource and HDFSEventSource Transaction GGuarantee
Date Tue, 14 Apr 2015 16:50:04 GMT
Flume is at-least-once system. This means we will never lose data, but
you may get duplicate events on errors.
In the cases you pointed out - the events were written but we still
BACKOFF, you will get duplicate events in the channel or in HDFS.

You probably want to write a small script to de-duplicate the data in
HDFS, like we do in this example:
https://github.com/hadooparchitecturebook/clickstream-tutorial/blob/master/03_processing/01_dedup/pig/dedup.pig

Gwen

On Tue, Apr 14, 2015 at 9:17 AM, Tao Li <litao.buptsse@gmail.com> wrote:
> Hi all:
>
> I have a question about "Transaction". For example, KafkaSource code like
> this:
>
> try {
>     getChannelProcessor().processEventBatch(eventList);
>     consumer.commitOffsets();
>     return Status.READY
> } catch(Exception e) {
>     return Status.BACKOFF;
> }
>
> If processEventBatch() succeed, but commitOffsets() failed, will return
> BACKOFF. But the eventList is already  write to channel.
>
> ----------------------------------
>
> Also for HDFSEventSink code like this:
>
> try {
>     bucketWriter.append(event);
>     bucketWriter.flush();
>     transaction.commit();
>     return Status.READY;
> } catch(Exception e) {
>     transaction.rollback();
>     return Status.BACKOFF;
> }
>
> If bucketWriter.flush() succeed, but transaction.commit() failed, will
> transaction.rollback() and return BACKOFF. But the event is already flush to
> HDFS.
>
>
> 2015-04-15 0:09 GMT+08:00 Tao Li <litao.buptsse@gmail.com>:
>>
>> Hi all:
>>
>> I have a question about "Transaction". For example, KafkaSource code like
>> this:
>> try {
>>     getChannelProcessor().processEventBatch(eventList);
>>     consumer.commitOffsets();
>>
>> }
>
>

Mime
View raw message