About the bz2 suggestion, we have a ton of downstream jobs that assume gzip compressed files - so it is better to stick to gzip.

The plan B for us is to have a Oozie step to gzip compress the logs before proceeding with downstream Hadoop jobs - but that looks like a hack to me!!

Sagar

On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <sagarmehta@gmail.com> wrote:
hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l

gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression OK, trailing garbage ignored
100

This should be about 50,000 events for the 5 min window!!

Sagar

On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <brock@cloudera.com> wrote:
Hi,

Can you try:  zcat file > output

I think what is occurring is because of the flush the output file is
actually several concatenated gz files.

Brock

On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <sagarmehta@gmail.com> wrote:
> Yeah I have tried the text write format in vain before, but nevertheless
> gave it a try again!! Below is the latest file - still the same thing.
>
> hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
> Mon Jan 14 23:02:07 UTC 2013
>
> hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> Found 1 items
> -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14 22:55
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>
> hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> .
> hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
> collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>
> gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: decompression
> OK, trailing garbage ignored
>
> Interestingly enough, the gzip page says it is a harmless warning -
> http://www.gzip.org/#faq8
>
> However, I'm losing events on decompression so I cannot afford to ignore
> this warning. The gzip page gives an example about magnetic tape - there is
> an analogy of hdfs block here since the file is initially stored in hdfs
> before I pull it out on the local filesystem.
>
> Sagar
>
>
>
>
> On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <cwoodson.dev@gmail.com>
> wrote:
>>
>> collector102.sinks.sink1.hdfs.writeFormat = TEXT
>> collector102.sinks.sink2.hdfs.writeFormat = TEXT
>
>
>



--
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/