flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Syslog Infrastructure with Flume
Date Wed, 31 Oct 2012 19:22:23 GMT
Roshan,

I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, rather than a
Hive Sink? Hive is definitely very popular and would be well appreciated if we could get a
Hive Sink written. Since HCatalog is supposed to be compatible with the Hive metastore (right?),
why not just implement a Hive sink and make it available to a larger community? I'd definitely
like to see a Hive Sink, and would definitely prioritize that and then if explicitly required,
add HCatalog support to that - this way it is useful to people who use Hive as well. In fact,
there already is a Hive Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008. 

I am a +1 for a Hive Sink, so please take a look.


Thanks,
Hari


-- 
Hari Shreedharan


On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:

> I am in the process of investigating the possibility of creating  a HCatalog sink for
Flume which should be able to handle such use cases. For your use case it could be thought
of as a Hive sink. Goal is basically as follows... it would allow multiple flume agents to
pump logs into a hive tables. That would make the data query-able without additional manual
steps. Data will get added periodically in the form of new partitions to Hive. You would not
have to deal with temporary files or manual addition of data into hive.  
> 
> -roshan
> 
> 
> 
> On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <ralph.goers@dslextreme.com (mailto:ralph.goers@dslextreme.com)>
wrote:
> > Since you ask...
> > 
> > In our environment our primary concern is audit logs - have have to audit banking
transactions as well as changes administrators make. We have a legacy system that needed to
be integrated that had records in a form different than what we want stored. We also need
to allow administrators to view events as close to real time as possible. Plus we have to
aggregate data across 2 data centers. Although we are currently not including web server access
logs we plan to integrate them in over time.  We also have requirements from our security
team to pass events for their use to ArcSight. 
> > 
> > 1. We have a "log extractor" that receives legacy events as they occur and converts
them into our new format and passes them to Flume. All new applications use the Log4j 2 Flume
Appender to get data to Flume. 
> > 2. Flume passes the data to ArcSight for our security team's use.
> > 3. We wrote a Flume to Cassandra Sink.
> > 4. We wrote our own REST query services to retrieve the data from Cassandra.
> > 5. Since we are using DataStax Enterprise version of Cassandra we have also set
up "Analytic" nodes that run Hadoop on top of Cassandra. This allows the data to be accessed
via normal Hadoop tools for data analytics.
> > 6. We have written our own reporting UI component in our Administrative Platform
to allow administrators to view activities in real time or to schedule background data collection
so users can post process the data on their own.
> > 
> > We do not have anything to allow an admin to "tail" the log but it wouldn't be hard
at all to write an application to accept Flume events via Avro and display the last "n" events
as they arrive. 
> > 
> > One thing I should point out. We format our events in accordance with RFC 5424 and
store that in the Flume event body. We then store all our individual pieces of audit event
data in Flume headers fields.  The RFC 5424 message is what we send to ArcSight. The event
fields and the compressed body are all stored in individual columns in Cassandra. 
> > 
> > Ralph
> > 
> > 
> > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
> > > I am exactly where you are with this, except for the problem of my not having
had time to write a serializer to address the Hostname Timestamp issue.  Questions about the
use of Flume in this manner seem to recur on a regular basis, so it seems a common use case.

> > >  
> > > Sorry I cannot offer a solution since I am in your shoes at the moment, unfortunately
looking at storing logs twice.
> > >  
> > > Ron Thielen
> > >  
> > > 
> > > <image001.jpg>
> > >  
> > > From: Josh West [mailto:jsw@one.com] 
> > > Sent: Friday, October 26, 2012 9:05 AM
> > > To: user@flume.apache.org (mailto:user@flume.apache.org)
> > > Subject: Syslog Infrastructure with Flume 
> > >  
> > > Hey folks,
> > > 
> > > I've been experimenting with Flume for a few weeks now, trying to determine
an approach to designing a reliable, highly available, scalable system to store logs from
various sources, including syslog.  Ideally, this system will meet the following requirements:

> > > Logs from syslog across all servers make their way into HDFS.
> > > Logs are stored in HDFS in a manner that is available for post-processing:
> > > Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> > > Example:  Custom map reduce jobs...
> > > 
> > > 
> > > Logs are stored in HDFS in a manner that is available for "reading" by sysadmins:
> > > During troubleshooting/firefighting, it is quite helpful to be able to login
to a central logging system and tail -f / grep logs.
> > > We need to be able to see the logs "live".
> > > 
> > > 
> > > 
> > > 
> > > Some folks may be wondering why are we choosing Flume for syslog, instead of
something like Graylog2 or Logstash?  The answer is we will be using Flume + Hadoop for the
transport and processing of other types of data in addition to syslog.  For example, webserver
access logs for post processing and statistical analysis.  So, we would like to make the most
use of the Hadoop cluster, keeping all logs of all types in one redundant/scalable solution.
 Additionally, by keeping both syslog and webserver access logs in Hadoop/HDFS, we can begin
to correlate events.
> > > 
> > > 
> > > I've run into some snags while attempting to implement Flume in a manner that
satisfies the requirements listed in the top of this message:
> > > 
> > > Logs to HDFS:
> > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> > > Needed to write custom serializer to add Hostname and Timestamp fields back
to syslog messages.
> > > See:  https://issues.apache.org/jira/browse/FLUME-1666
> > > 
> > > 
> > > Logs to HDFS in manner available for reading/firefighting/troubleshooting by
sysadmins:
> > > Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
> > > Creates data files like:  /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> > > Each file is format of FlumeData (or custom prefix) followed by . followed
by unix timestamp of when the file was created.
> > > This is somewhat necessary... As you could have multiple Flume writers, writing
to the same HDFS, the files cannot be opened by more than one writer.  So each writer should
write to its own file.
> > > 
> > > 
> > > Latest file, currently being written to, is suffixed with ".tmp".
> > > This approach is not very sysadmin-friendly....
> > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f /path/to/file.tmp
> > > Hadoop's fs -tail -f command first prints the entire file's contents, then
begins tailing.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > So the sum of it all is Flume is awesome for getting syslog (and other) data
into HDFS for post processing, but not the best at getting it into HDFS in a sysadmin troubleshooting/firefighting
format.  In an ideal world, I have syslog data coming into Flume via one transport (i.e. SyslogTcp
Source or SyslogUDP Source) and being written into HDFS in a manner that is both post-processable
and sysadmin-friendly, but it looks like this isn't going to happen.
> > > 
> > > 
> > > I've thus investigated some alternative approaches to meet the requirements.
 One of these approaches is to have all of my servers send their syslog messages to a central
box running rsyslog.  Then, rsyslog would perform one of the following actions:
> > > 
> > > Write logs to HDFS directly using 'omhdfs' module, in a format that is both
post-processable and sysadmin-friendly :-)
> > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has HDFS
mounted as a filesystem.
> > > Write logs to a local filesystem and also replicate logs into a flume agent,
configured with a SyslogSource and HDFS sink.
> > > 
> > > 
> > > Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog
isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has issues creating/appending
files.  Additionally, templating is somewhat suspect (ie. making directories /syslog/someserver/somefacility
dynamically).
> > > 
> > > 
> > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't support
append mode (yet) or rsyslog is trying to create/open the files in a manner not compliant
with HDFS.  No surprise, as we all know HDFS can be somewhat "special" at times ;-)  It's
actually no matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather useless.
 The data is only and finally fed to the tail command once a full 64MB (or whatever you use)
block size of data has been written to the file.  One would only be able to use "hadoop fs
-tail -f /path/to/log" which has its own issues mentioned previously.
> > > 
> > > 
> > > Option #3 would definitely work.  However, now I'm storing my logs twice. 
Once on some local filesystem and another time in HDFS.  It works but its not ideal as it's
a waste of space.  And you've probably noticed from this email so far, I'd prefer the ideal
solution :-)
> > > 
> > > 
> > > Note:  Astute flumers would probably look at option #3 and recommend making
use of the RollingFileSink in addition to the HDFSSink.  Unfortunately, the RollingFileSink
doesn't support templated/dynamic directory creation like the HDFSSink with its hdfs.path
setting of "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> > > 
> > > 
> > > So what exactly am I asking here?  Well, I'd like to know first how others
are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With custom serializers/interceptors/sinks?
 Or perhaps... how would you recommend I handle this?
> > > 
> > > 
> > > Thanks for any and all thoughts you can provide.
> > > 
> > > 
> > >  
> > > 
> > > -- 
> > > Josh West
> > > Lead Systems Administrator
> > > One.com (http://One.com), jsw@one.com (mailto:jsw@one.com)
> > > 
> > > 
> > > 
> > > <Ronald J  Thielen.vcf>
> > > 
> > 
> > 
> > 
> 


Mime
View raw message