flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadish Bihani <jagadish.bih...@pubmatic.com>
Subject Flume bz2 issue while processing by a map reduce job
Date Fri, 26 Oct 2012 11:00:57 GMT
Hi

I have a very peculiar scenario.

  1. My HDFS sink creates a bz2 file. File is perfectly fine I can 
decompress it and
read it. It has 0.2 million records.
2. Now I give that file to map-reduce job (hadoop 1.0.3) and 
surprisingly it only
reads first 100 records.
3. I then decompress the same file on local file system and use bzip2 
command of
linux to again compress it and copy to HDFS.
4. Now I run the map -reduce job and this time it correctly processes 
all the records.

I think flume agent writes compressed data to HDFS file in batches. And 
somehow
bzip2 codec used by hadoop uses only first part of it.

This way bz2 files generated by Flume, if used directly, can't be 
processed by Map reduce job.
Is there any solution to it?

Any inputs about other compression formats?

P.S.
Versions:

Flume 1.2.0 (Raw version; downloaded from 
http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
Hadoop 1.0.3

Regards,
Jagadish

Mime
View raw message