flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh West <...@one.com>
Subject Re: Syslog Infrastructure with Flume
Date Tue, 30 Oct 2012 09:42:47 GMT
Hi Ron,

Yep -- looks like we'll be storing logs twice, for the time being. But 
we're *so* close to not having to!

On 10/26/2012 11:06 PM, Ron Thielen wrote:
> I am exactly where you are with this, except for the problem of my not 
> having had time to write a serializer to address the Hostname 
> Timestamp issue.Questions about the use of Flume in this manner seem 
> to recur on a regular basis, so it seems a common use case.
> Sorry I cannot offer a solution since I am in your shoes at the 
> moment, unfortunately looking at storing logs twice.
> Ron Thielen
> Ronald J Thielen
> *From:*Josh West [mailto:jsw@one.com]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *To:* user@flume.apache.org
> *Subject:* Syslog Infrastructure with Flume
> Hey folks,
> I've been experimenting with Flume for a few weeks now, trying to 
> determine an approach to designing a reliable, highly available, 
> scalable system to store logs from various sources, including syslog.  
> Ideally, this system will meet the following requirements:
>  1. Logs from syslog across all servers make their way into HDFS.
>  2. Logs are stored in HDFS in a manner that is available for
>     post-processing:
>       * Example:  HIVE partitions - with HDFS Flume Sink, can set
>         hdfs.path to
>         hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
>       * Example:  Custom map reduce jobs...
>  3. Logs are stored in HDFS in a manner that is available for
>     "reading" by sysadmins:
>       * During troubleshooting/firefighting, it is quite helpful to be
>         able to login to a central logging system and tail -f / grep logs.
>       * We need to be able to see the logs "live".
> Some folks may be wondering why are we choosing Flume for syslog, 
> instead of something like Graylog2 or Logstash?  The answer is we will 
> be using Flume + Hadoop for the transport and processing of other 
> types of data in addition to syslog. For example, webserver access 
> logs for post processing and statistical analysis.  So, we would like 
> to make the most use of the Hadoop cluster, keeping all logs of all 
> types in one redundant/scalable solution.  Additionally, by keeping 
> both syslog and webserver access logs in Hadoop/HDFS, we can begin to 
> correlate events.
> I've run into some snags while attempting to implement Flume in a 
> manner that satisfies the requirements listed in the top of this message:
>  1. Logs to HDFS:
>       * I can indeed use the Flume HDFS Sink to reliably write logs
>         into HDFS.
>       * Needed to write custom serializer to add Hostname and
>         Timestamp fields back to syslog messages.
>       * See: https://issues.apache.org/jira/browse/FLUME-1666
>         <https://issues.apache.org/jira/browse/FLUME-1666>
>  2. Logs to HDFS in manner available for
>     reading/firefighting/troubleshooting by sysadmins:
>       * Flume HDFS Sink uses the BucketWriter for recording flume
>         events to HDFS.
>       * Creates data files like:
>         /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       * Each file is format of FlumeData (or custom prefix) followed
>         by . followed by unix timestamp of when the file was created.
>           o This is somewhat necessary... As you could have multiple
>             Flume writers, writing to the same HDFS, the files cannot
>             be opened by more than one writer.  So each writer should
>             write to its own file.
>       * Latest file, currently being written to, is suffixed with ".tmp".
>       * This approach is not very sysadmin-friendly....
>           o You have to find the latest (ie. the .tmp files) and
>             hadoop fs -tail -f /path/to/file.tmp
>           o Hadoop's fs -tail -f command first prints the entire
>             file's contents, then begins tailing.
> So the sum of it all is Flume is awesome for getting syslog (and 
> other) data into HDFS for post processing, but not the best at getting 
> it into HDFS in a sysadmin troubleshooting/firefighting format.  In an 
> ideal world, I have syslog data coming into Flume via one transport 
> (i.e. SyslogTcp Source or SyslogUDP Source) and being written into 
> HDFS in a manner that is both post-processable and sysadmin-friendly, 
> but it looks like this isn't going to happen.
> I've thus investigated some alternative approaches to meet the 
> requirements.  One of these approaches is to have all of my servers 
> send their syslog messages to a central box running rsyslog.  Then, 
> rsyslog would perform one of the following actions:
>  1. Write logs to HDFS directly using 'omhdfs' module, in a format
>     that is both post-processable and sysadmin-friendly :-)
>  2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
>     has HDFS mounted as a filesystem.
>  3. Write logs to a local filesystem and also replicate logs into a
>     flume agent, configured with a SyslogSource and HDFS sink.
> Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
> rsyslog isn't working very well.  I've gotten it to login to 
> Hadoop/HDFS but it has issues creating/appending files.  Additionally, 
> templating is somewhat suspect (ie. making directories 
> /syslog/someserver/somefacility dynamically).
> Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
> support append mode (yet) or rsyslog is trying to create/open the 
> files in a manner not compliant with HDFS.  No surprise, as we all 
> know HDFS can be somewhat "special" at times ;-)  It's actually no 
> matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is 
> rather useless.  The data is only and finally fed to the tail command 
> once a full 64MB (or whatever you use) block size of data has been 
> written to the file.  One would only be able to use "hadoop fs -tail 
> -f /path/to/log" which has its own issues mentioned previously.
> Option #3 would definitely work.  However, now I'm storing my logs 
> twice.  Once on some local filesystem and another time in HDFS.  It 
> works but its not ideal as it's a waste of space. And you've probably 
> noticed from this email so far, I'd prefer the *ideal* solution :-)
> *Note*:  Astute flumers would probably look at option #3 and recommend 
> making use of the RollingFileSink in addition to the HDFSSink.  
> Unfortunately, the RollingFileSink doesn't support templated/dynamic 
> directory creation like the HDFSSink with its hdfs.path setting of 
> "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> So what exactly am I asking here?  Well, I'd like to know first how 
> others are doing this.  A hybrid of rsyslog and Flume?  All and only 
> Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how 
> would you recommend I handle this?
> Thanks for any and all thoughts you can provide.
> -- 
> Josh West
> Lead Systems Administrator
> One.com,jsw@one.com  <mailto:jsw@one.com>

Josh West
Lead Systems Administrator
One.com, jsw@one.com

View raw message