flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Israel Ekpo <isr...@aicer.org>
Subject Re: Archive Task Logs (Stdout, Stderr, Sysogs) and Job Tracker logs of a Hadoop Cluster for later analysis
Date Mon, 08 Apr 2013 17:41:04 GMT
Christian,

>From your comments, it seems Flume will be the right tool for the task.

The SpoolingDirectorySource would be a great choice for the task you have
since the log data has already been generated.

However, the Spooling Directory Source requires that the files be immutable.

This means once a file is created or dropped in the spooling directory it
cannot be modified.

Consequently, you will not be able to just use the currently log directory
where the log files are continuously being appended to.

I would recommend that you set aside a separate directory for spooling for
Flume and then set up some sort of cronjob or scheduled task that will
periodically drop the logs into the spooling directory after traversing the
symlinks and recursively processing the log directories.

The SpoolingDirectorySource currently does not recursively traverse the
spooled folders.

It assumes that all the files you plan to consume are in the root folder
you are spooling.

Use FileChannel as the channel as this is more reliable.

Depending of the type of analysis you want to conduct, the
ElasticSearchSink might be a good choice for your sink.

Feel free to review the user guide for other options for Sinks.

http://flume.apache.org/FlumeUserGuide.html

You can also set up your own custom sink if you have other centralized
datastores in mind.

Spend some time to go through the user guide and developer guide so that
you can get a better understanding of the architecture and use cases.

http://flume.apache.org/FlumeUserGuide.html

http://flume.apache.org/FlumeDeveloperGuide.html


On 8 April 2013 10:33, Christian Schneider <cschneiderpublic@gmail.com>wrote:

> Hi,
> I need to collect log data from our Cluster.
>
> For this I think I need to copy the Contents of:
> * JobTracker: /var/log/hadoop-0.20-mapreduce/history/
> * TaskTracker: /var/log/hadoop-0.20-mapreduce/userlogs/
>
> It should also follow symlinks and copy recusrive.
>
> Is flume the right tool to do this?
>
> E.g. with the "Spooling Directory Source"?
>
> Best Regards,
> Christian.
>

Mime
View raw message