flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh West <...@one.com>
Subject Syslog Infrastructure with Flume
Date Fri, 26 Oct 2012 14:05:08 GMT
Hey folks,

I've been experimenting with Flume for a few weeks now, trying to 
determine an approach to designing a reliable, highly available, 
scalable system to store logs from various sources, including syslog.  
Ideally, this system will meet the following requirements:

 1. Logs from syslog across all servers make their way into HDFS.
 2. Logs are stored in HDFS in a manner that is available for
      * Example:  HIVE partitions - with HDFS Flume Sink, can set
        hdfs.path to
      * Example:  Custom map reduce jobs...
 3. Logs are stored in HDFS in a manner that is available for "reading"
    by sysadmins:
      * During troubleshooting/firefighting, it is quite helpful to be
        able to login to a central logging system and tail -f / grep logs.
      * We need to be able to see the logs "live".

Some folks may be wondering why are we choosing Flume for syslog, 
instead of something like Graylog2 or Logstash?  The answer is we will 
be using Flume + Hadoop for the transport and processing of other types 
of data in addition to syslog.  For example, webserver access logs for 
post processing and statistical analysis.  So, we would like to make the 
most use of the Hadoop cluster, keeping all logs of all types in one 
redundant/scalable solution. Additionally, by keeping both syslog and 
webserver access logs in Hadoop/HDFS, we can begin to correlate events.

I've run into some snags while attempting to implement Flume in a manner 
that satisfies the requirements listed in the top of this message:

 1. Logs to HDFS:
      * I can indeed use the Flume HDFS Sink to reliably write logs into
      * Needed to write custom serializer to add Hostname and Timestamp
        fields back to syslog messages.
      * See: https://issues.apache.org/jira/browse/FLUME-1666
 2. Logs to HDFS in manner available for
    reading/firefighting/troubleshooting by sysadmins:
      * Flume HDFS Sink uses the BucketWriter for recording flume events
        to HDFS.
      * Creates data files like:
      * Each file is format of FlumeData (or custom prefix) followed by
        . followed by unix timestamp of when the file was created.
          o This is somewhat necessary... As you could have multiple
            Flume writers, writing to the same HDFS, the files cannot be
            opened by more than one writer.  So each writer should write
            to its own file.
      * Latest file, currently being written to, is suffixed with ".tmp".
      * This approach is not very sysadmin-friendly....
          o You have to find the latest (ie. the .tmp files) and hadoop
            fs -tail -f /path/to/file.tmp
          o Hadoop's fs -tail -f command first prints the entire file's
            contents, then begins tailing.

So the sum of it all is Flume is awesome for getting syslog (and other) 
data into HDFS for post processing, but not the best at getting it into 
HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal 
world, I have syslog data coming into Flume via one transport (i.e. 
SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a 
manner that is both post-processable and sysadmin-friendly, but it looks 
like this isn't going to happen.

I've thus investigated some alternative approaches to meet the 
requirements.  One of these approaches is to have all of my servers send 
their syslog messages to a central box running rsyslog.  Then, rsyslog 
would perform one of the following actions:

 1. Write logs to HDFS directly using 'omhdfs' module, in a format that
    is both post-processable and sysadmin-friendly :-)
 2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
    has HDFS mounted as a filesystem.
 3. Write logs to a local filesystem and also replicate logs into a
    flume agent, configured with a SyslogSource and HDFS sink.

Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS 
but it has issues creating/appending files. Additionally, templating is 
somewhat suspect (ie. making directories /syslog/someserver/somefacility 

Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
support append mode (yet) or rsyslog is trying to create/open the files 
in a manner not compliant with HDFS.  No surprise, as we all know HDFS 
can be somewhat "special" at times ;-)  It's actually no matter 
anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather 
useless.  The data is only and finally fed to the tail command once a 
full 64MB (or whatever you use) block size of data has been written to 
the file.  One would only be able to use "hadoop fs -tail -f 
/path/to/log" which has its own issues mentioned previously.

Option #3 would definitely work.  However, now I'm storing my logs 
twice.  Once on some local filesystem and another time in HDFS.  It 
works but its not ideal as it's a waste of space.  And you've probably 
noticed from this email so far, I'd prefer the *ideal* solution :-)

*Note*:  Astute flumers would probably look at option #3 and recommend 
making use of the RollingFileSink in addition to the HDFSSink.  
Unfortunately, the RollingFileSink doesn't support templated/dynamic 
directory creation like the HDFSSink with its hdfs.path setting of 

So what exactly am I asking here?  Well, I'd like to know first how 
others are doing this.  A hybrid of rsyslog and Flume?  All and only 
Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how 
would you recommend I handle this?

Thanks for any and all thoughts you can provide.

Josh West
Lead Systems Administrator
One.com, jsw@one.com

View raw message