flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Mehta <sagarme...@gmail.com>
Subject Re: Question about gzip compression when using Flume Ng
Date Tue, 15 Jan 2013 01:52:29 GMT
As for S3, in fact that is our current architecture :) [ EMR computations
were 2 years back - now we do locally in our clusters] and we want to move
away from it since our Hadoop computations happen in our own clusters and
we end up pulling data from S3 every hour which we would prefer to be local
besides us having connectivity issues with S3.

As for upgrading Hadoop, yes that is on the near-term roadmap, but as I
said since this is a reasonably sized  production cluster [400+ nodes in
all], so the change won't be over midnight.

I also tried changing the hdfs.fileType to SequenceFile but then it
complained about need for some native Hadoop code.

Sagar

On Mon, Jan 14, 2013 at 5:17 PM, Connor Woodson <cwoodson.dev@gmail.com>wrote:

> The issue appears to be in Hadoop's GZip compression, as Flume uses this
> libraries to do it; and as you're using older libraries, the gzip isn't as
> good. A possible problem is that the version of gzip implemented by Hadoop
> doesn't support concatenated files (I know there's an issue with
> concatenated bzip2 files and various versions of hadoop). As such, bz2
> probably also won't work.
>
> It could be possible to do a workaround of creating a gzip serializer, and
> then writing to hdfs in binary form. I think you will also need to create a
> new writeFormat, as I'm not quite sure how the SequenceFile one works; but
> if the gzip bits are computed on the client side then you won't have to
> deal with whatever implementation of gzip your hadoop uses.
>
> Or you upgrade hadoop. I have no idea which is easier. (Or you move your
> data to S3 and your map-reduce to EMR ;)
>
> - Connor
>
>
> On Mon, Jan 14, 2013 at 5:03 PM, Sagar Mehta <sagarmehta@gmail.com> wrote:
>
>> Hmm - good point!! Even in the best case say this works, moving to a
>> newer Hadoop version for the entire 2 production clusters that depend on it
>> [400+ nodes] will need some thorough testing and won't be immediate.
>>
>> I would have loved for the gzip compression part to have worked more or
>> less out of the box but for now most likely seems to be a Oozie step to
>> pre-compress before downstream takes over.
>>
>> I'm still open to suggestions/insights from this group which has been
>> super-prompt so far :)
>>
>> Sagar
>>
>>
>> On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <brock@cloudera.com> wrote:
>>
>>> Hi,
>>>
>>> That's just the file channel. The HDFSEventSink will need a heck of a
>>> lot more than the just those two jars. To override the version of
>>> hadoop it will find from the hadoop command you probably want to set
>>> HADOOP_HOME in flume-env.sh to your custom install.
>>>
>>> Also, the client and server should be the same version.
>>>
>>> Brock
>>>
>>> On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> wrote:
>>> > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got
>>> some
>>> > errors about the guava dependencies so put in that jar too]
>>> >
>>> > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core"
>>> -e
>>> > "guava"
>>> > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar
>>> > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50
>>> > hadoop-core-0.20.2-cdh3u5.jar
>>> >
>>> > Now I don't event see the file being created in hdfs and the flume log
>>> is
>>> > happily talking about housekeeping for some file channel checkpoints,
>>> > updating pointers et al
>>> >
>>> > Below is tail of flume log
>>> >
>>> > hadoop@collector102:/data/flume_log$ tail -10 flume.log
>>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
>>> > org.apache.flume.channel.file.Log - Updated checkpoint for file:
>>> > /data/flume_data/channel2/data/log-36 position: 129415524
>>> logWriteOrderID:
>>> > 1358209947324
>>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
>>> > org.apache.flume.channel.file.LogFile - Closing RandomReader
>>> > /data/flume_data/channel2/data/log-34
>>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
>>> > org.apache.flume.channel.file.Log - Updated checkpoint for file:
>>> > /data/flume_data/channel1/data/log-36 position: 129415524
>>> logWriteOrderID:
>>> > 1358209947323
>>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
>>> > org.apache.flume.channel.file.LogFile - Closing RandomReader
>>> > /data/flume_data/channel1/data/log-34
>>> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO
>>> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
>>> > currentPosition = 18577138, logWriteOrderID = 1358209947324
>>> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO
>>> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
>>> > currentPosition = 18577138, logWriteOrderID = 1358209947323
>>> > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO
>>> > org.apache.flume.channel.file.LogFile - Closing RandomReader
>>> > /data/flume_data/channel1/data/log-35
>>> > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO
>>> > org.apache.flume.channel.file.LogFile - Closing RandomReader
>>> > /data/flume_data/channel2/data/log-35
>>> > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO
>>> > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
>>> > currentPosition = 217919486, logWriteOrderID = 1358209947323
>>> > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO
>>> > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
>>> > currentPosition = 217919486, logWriteOrderID = 1358209947324
>>> >
>>> > Sagar
>>> >
>>> >
>>> > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <brock@cloudera.com>
>>> wrote:
>>> >>
>>> >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
>>> >> I would upgrade to CDH3u5 or CDH 4.1.2.
>>> >>
>>> >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> wrote:
>>> >> > About the bz2 suggestion, we have a ton of downstream jobs that
>>> assume
>>> >> > gzip
>>> >> > compressed files - so it is better to stick to gzip.
>>> >> >
>>> >> > The plan B for us is to have a Oozie step to gzip compress the
logs
>>> >> > before
>>> >> > proceeding with downstream Hadoop jobs - but that looks like a
hack
>>> to
>>> >> > me!!
>>> >> >
>>> >> > Sagar
>>> >> >
>>> >> >
>>> >> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
>>> >> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
>>> >> >>
>>> >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz:
>>> >> >> decompression
>>> >> >> OK, trailing garbage ignored
>>> >> >> 100
>>> >> >>
>>> >> >> This should be about 50,000 events for the 5 min window!!
>>> >> >>
>>> >> >> Sagar
>>> >> >>
>>> >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <brock@cloudera.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> Can you try:  zcat file > output
>>> >> >>>
>>> >> >>> I think what is occurring is because of the flush the output
file
>>> is
>>> >> >>> actually several concatenated gz files.
>>> >> >>>
>>> >> >>> Brock
>>> >> >>>
>>> >> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <
>>> sagarmehta@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Yeah I have tried the text write format in vain before,
but
>>> >> >>> > nevertheless
>>> >> >>> > gave it a try again!! Below is the latest file - still
the same
>>> >> >>> > thing.
>>> >> >>> >
>>> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
>>> >> >>> > Mon Jan 14 23:02:07 UTC 2013
>>> >> >>> >
>>> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >> >>> > Found 1 items
>>> >> >>> > -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14
22:55
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >> >>> >
>>> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >> >>> > .
>>> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
>>> >> >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >> >>> >
>>> >> >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz:
>>> >> >>> > decompression
>>> >> >>> > OK, trailing garbage ignored
>>> >> >>> >
>>> >> >>> > Interestingly enough, the gzip page says it is a harmless
>>> warning -
>>> >> >>> > http://www.gzip.org/#faq8
>>> >> >>> >
>>> >> >>> > However, I'm losing events on decompression so I cannot
afford
>>> to
>>> >> >>> > ignore
>>> >> >>> > this warning. The gzip page gives an example about
magnetic
>>> tape -
>>> >> >>> > there is
>>> >> >>> > an analogy of hdfs block here since the file is initially
>>> stored in
>>> >> >>> > hdfs
>>> >> >>> > before I pull it out on the local filesystem.
>>> >> >>> >
>>> >> >>> > Sagar
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson
>>> >> >>> > <cwoodson.dev@gmail.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT
>>> >> >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> Apache MRUnit - Unit testing MapReduce -
>>> >> >>> http://incubator.apache.org/mrunit/
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Apache MRUnit - Unit testing MapReduce -
>>> >> http://incubator.apache.org/mrunit/
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce -
>>> http://incubator.apache.org/mrunit/
>>>
>>
>>
>

Mime
View raw message