flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: how spooling directory source identifies the complete file
Date Wed, 23 Jul 2014 05:29:50 GMT
This is specified in Flume's User Guide

"Unlike the Exec source, this source is reliable and will not miss data,
even if Flume is restarted or killed. In exchange for this reliability,
only immutable, uniquely-named files must be dropped into the spooling
directory. Flume tries to detect these problem conditions and will fail
loudly if they are violated:

   1. If a file is written to after being placed into the spooling
   directory, Flume will print an error to its log file and stop processing.
   2. If a file name is reused at a later time, Flume will print an error
   to its log file and stop processing.

To avoid the above issues, it may be useful to add a unique identifier
(such as a timestamp) to log file names when they are moved into the
spooling directory."

On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <saran0081986@gmail.com>

> Hi Jeff,
> Thanks of your comments.But what I am really looking for is  , consider we
> are copying a file of 1 GB to spool directory , if suppose copy is in
> progress , how flume recognize that the complete file is copied into the
> spool directory and the file is ready for processing ?
> how flume make sure it doesnt start processing the partially copied file.
> On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jlord@cloudera.com> wrote:
>> I believe the way this works is that flume creates a meta directory to
>> track which file is being read.
>> In the event of a restart of the agent the entire file will be re-read
>> which will create some duplicate events.
>> https://github.com/apache/flume/blob/flume-1.5/flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java#L474
>> On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <saran0081986@gmail.com
>> > wrote:
>>> Hi,
>>> I am planning to use spooling directory to move logfiles in hdfs sink.
>>> I like to know how flume identifies the file we are moving to spool
>>> directory is complete file or partial & its move still in progress.
>>> if suppose a file is of large size and we started moving it to spooler
>>> directory , how flume identifies that the complete file is transferred or
>>> is still in progress.
>>> Please help me out here.
>>> Thanks,
>>> saravana


Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

View raw message