Take a look at logtail2, it keeps a bookmark in an offsetfile, so as to be able to resume where it left off on last run.

It's available in debian repo, in the logcheck package.


On Sat, Jul 28, 2012 at 10:18 AM, Brock Noland <brock@cloudera.com> wrote:

Yes you if you use tail, you will eventually both lose data and get duplicates.  It's better to send the events to Flume from the application generating them. Flume has a java "client" which can do this as well as a log4j appender.


On Fri, Jul 27, 2012 at 11:20 PM, Jagadish Bihani <jagadish.bihani@pubmatic.com> wrote:

In Flume-ng is there any way using exec (tail -F) as the source to get
only the new lines  which are being added to the log file ?
(i.e. there is a growing log file and we want to transfer all the logs using flume
without duplication of logs)

I understand if something fails and as tail doesn't maintain state we will have duplicates.
But we are not considering failovers as of now.

So I think "tail -F" is useful only in scenarios where sink or any intermediate
agent can remove duplicates. Is it correct?

But as tail looks like quite a popular source in flume I thought I might be missing

Presently using "tail -F <file>" as the source to read from the log file leads to
scenarios like this:

1. If file has not  changed for a while, but tail still tails file every
second and then prints the same lines again (depending upon -n option)
2. Even if file grows then using tail we can't quite control which lines we want?


Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/