flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bijoy Deb <bijoy.comput...@gmail.com>
Subject Flume issue: Copying the same source file multiple times with different timestamps in case of HDFS IO error
Date Thu, 06 Aug 2015 09:08:28 GMT
Hi,

I have a Flume process that transfers multiple files (around 10 files of
400GB each) per day from a specific source directory to an HDFS sink.I am
facing an issue when there is an HDFS IO error while Flume is in the
process of copying the files from source to sink.The issue is that Flume is
copying the same file twice to sink,with 2 different timestamps,resulting
in duplication of data in my downstream processes,which is not what I want.

Can anyone kindly let me know if this is a known issue with Flume,and if
yes,is there any workaround to this?

Relevant details:
Flume version: 1.3.1
1. Source/Spool dir File location: /test/part1/2015072110_layer2_1.gz

2. HDFS sink/destination: hdfs:///staging/test/

3. Files dumped in sink by Flume:
/staging/test/2015072110_layer2_1.1437634754144.gz
/staging/test/2015072110_layer2_1.1437634754145.gz

4. Flume agent logs:

(SpoolingFileReader.java:170)] File is
processed....************************/test/part1/2015072110_layer2_1.gz
2015-07-23 02:59:09,392 (pool-14-thread-1) [INFO -
com.flume.spool.zip.SpoolingFileReader.retireCurrentFile(SpoolingFileReader.java:270)]
Preparing to move file /test/part1/2015072110_layer2_1.gz to
/test/part1/2015072110_layer2_1.gz.COMPLETED
2015-07-23 02:59:09,395 (pool-14-thread-1) [INFO -
com.flume.spool.zip.SpoolingFileReader.readEvents(SpoolingFileReader.java:176)]
flag was set as true
2015-07-23 02:59:14,808 (hdfs-c1s1-call-runner-8) [INFO -
org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:208)]
Creating /staging/test/2015072110_layer2_1.1437634754144.gz.tmp
2015-07-23 02:59:32,144 (SinkRunner-PollingRunner-DefaultSinkProcessor)
[WARN -
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:456)]
HDFS IO error
java.io.IOException: Callable timed out after 18000 ms
    at
org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:352)
    at
org.apache.flume.sink.hdfs.HDFSEventSink.append(HDFSEventSink.java:727)
    at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:430)
    at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
    at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
    at java.lang.Thread.run(Thread.java:853)
Caused by: java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:212)
    at
org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:345)
    ... 5 more
2015-07-23 02:59:37,269 (hdfs-c1s1-call-runner-9) [INFO -
org.apache.flume.sink.hdfs.BucketWriter.renameBucket(BucketWriter.java:427)]
Renaming /staging/test/2015072110_layer2_1.1437634754144.gz.tmp to
/staging/test/2015072110_layer2_1.1437634754144.gz
2015-07-23 02:59:38,513 (hdfs-c1s1-call-runner-9) [INFO -
org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:208)]
Creating /staging/test/2015072110_layer2_1.1437634754145.gz.tmp
2015-07-23 02:59:56,333 (hdfs-c1s1-roll-timer-0) [INFO -
org.apache.flume.sink.hdfs.BucketWriter$5.call(BucketWriter.java:322)]
Closing idle bucketWriter /staging/test/2015072110_layer2_1
2015-07-23 02:59:56,340 (hdfs-c1s1-roll-timer-0) [INFO -
org.apache.flume.sink.hdfs.BucketWriter.renameBucket(BucketWriter.java:427)]
Renaming /staging/test/2015072110_layer2_1.1437634754145.gz.tmp to
/staging/test/2015072110_layer2_1.1437634754145.gz

Thanks,
Bijoy

Mime
View raw message