flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cormier, Christopher" <christopher.corm...@teamaol.com>
Subject Flume/HDFS Encoding
Date Fri, 14 Dec 2012 20:48:26 GMT
Hello All,
I'm also a new user to Flume and was hoping someone could point me in the right direction
or tell me what silly little piece I'm missing from the puzzle.  I apologize if this has been
covered but after searching for a few days I couldn't find anything that helped.  Also if
there's a better suited group for this to be posted to just let me know.

I have flume configured to read from a log4j log file using a tail source and send data into
an HDFS sink.  All of the plumbing seems to work fine - I'm able to query the data using a
quick map reduce job and verify that the entries are in fact getting into Hadoop.  What's
interesting (annoying) is some additional characters that are being added to each request.
 Running hadoop dfs -cat somefile I get something like this (where [Data_From_The_Log_Here]
is properly formatted and looks valid from what I can tell) :

SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] õpµ^R÷ï³¬Õ     ;*j
    7[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ                                
  [Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ﳬÕ
                                 Î[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ﳬÕ

Here's the flume config:

requestToHDFS.channels = MemoryChannel
requestToHDFS.sinks = HDFS
requestToHDFS.sources = Tail

requestToHDFS.sources.Tail.channels = MemoryChannel
requestToHDFS.sources.Tail.interceptors = ts
requestToHDFS.sources.Tail.interceptors.ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
requestToHDFS.sources.Tail.type = exec
requestToHDFS.sources.Tail.command = tail -F /path/to/someLogFile.log

requestToHDFS.sinks.HDFS.channel = MemoryChannel
requestToHDFS.sinks.HDFS.type = hdfs
requestToHDFS.sinks.HDFS.hdfs.path = hdfs://somehadoopserver:9000/logs/%Y/%m/%d/%H

requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream
# also tried...
#requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile

requestToHDFS.sinks.HDFS.hdfs.batchSize = 10
requestToHDFS.sinks.HDFS.hdfs.rollSize = 0
requestToHDFS.sinks.HDFS.hdfs.rollCount = 10000
requestToHDFS.sinks.HDFS.hdfs.rollInterval = 600

requestToHDFS.channels.MemoryChannel.type = memory
requestToHDFS.channels.MemoryChannel.capacity = 10000
requestToHDFS.channels.transactionCapacity = 100

I'm able to get around the issue by doing some parsing in a map reduce job to isolate the
log entries I want, but it seems like I'm missing something.  The additional characters/encoding/whatever
on each line seems to have some data that Flume uses for sending events across the wire. 
Is there a way to eliminate this before a record is sent into HDFS?  Is this just the way
records are stored in HDFS and I need to account for the additional characters when querying
the data?  Ideally the entries in Hadoop would look something like this:


Versions are as follows:
Flume 1.2.0
Subversion https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1 -r 1360090<https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1%20-r%201360090>
Hadoop 1.1.1
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108

Thanks in advance!


View raw message