Hello All,

I’m also a new user to Flume and was hoping someone could point me in the right direction or tell me what silly little piece I’m missing from the puzzle.  I apologize if this has been covered but after searching for a few days I couldn’t find anything that helped.  Also if there’s a better suited group for this to be posted to just let me know.


I have flume configured to read from a log4j log file using a tail source and send data into an HDFS sink.  All of the plumbing seems to work fine - I’m able to query the data using a quick map reduce job and verify that the entries are in fact getting into Hadoop.  What’s interesting (annoying) is some additional characters that are being added to each request.  Running hadoop dfs –cat somefile I get something like this (where [Data_From_The_Log_Here] is properly formatted and looks valid from what I can tell) :


SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] p^Rﳬ     ;*j     7[Data_From_The_Log_Here]Y] p^Rﳬ                                                                                                                                                                                                                      ;*j

  [Data_From_The_Log_Here]Y] p^Rﳬ


                                                                                                                                                                                                                     [Data_From_The_Log_Here]Y] p^Rﳬ




Here’s the flume config:


requestToHDFS.channels = MemoryChannel

requestToHDFS.sinks = HDFS

requestToHDFS.sources = Tail


requestToHDFS.sources.Tail.channels = MemoryChannel

requestToHDFS.sources.Tail.interceptors = ts

requestToHDFS.sources.Tail.interceptors.ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

requestToHDFS.sources.Tail.type = exec

requestToHDFS.sources.Tail.command = tail -F /path/to/someLogFile.log


requestToHDFS.sinks.HDFS.channel = MemoryChannel

requestToHDFS.sinks.HDFS.type = hdfs

requestToHDFS.sinks.HDFS.hdfs.path = hdfs://somehadoopserver:9000/logs/%Y/%m/%d/%H


requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream

# also tried...

#requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile



requestToHDFS.sinks.HDFS.hdfs.batchSize = 10

requestToHDFS.sinks.HDFS.hdfs.rollSize = 0

requestToHDFS.sinks.HDFS.hdfs.rollCount = 10000

requestToHDFS.sinks.HDFS.hdfs.rollInterval = 600


requestToHDFS.channels.MemoryChannel.type = memory

requestToHDFS.channels.MemoryChannel.capacity = 10000

requestToHDFS.channels.transactionCapacity = 100


I’m able to get around the issue by doing some parsing in a map reduce job to isolate the log entries I want, but it seems like I’m missing something.  The additional characters/encoding/whatever on each line seems to have some data that Flume uses for sending events across the wire.  Is there a way to eliminate this before a record is sent into HDFS?  Is this just the way records are stored in HDFS and I need to account for the additional characters when querying the data?  Ideally the entries in Hadoop would look something like this:






Versions are as follows:

Flume 1.2.0

Subversion https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1 -r 1360090

Hadoop 1.1.1

Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108


Thanks in advance!