flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Needham, Guy" <Guy.Need...@virginmedia.co.uk>
Subject RE: how spooling directory source identifies the complete file
Date Wed, 23 Jul 2014 07:46:09 GMT
Hi Saravana,

Flume will check the size and the time of the last edit to the file when it starts reading
it and when it has finished reading. If the two sets of values differ between the start and
end of the file reading process, Flume will fail noisily. This means that you must move a
fully written file to the directory or it will not be ingested into your workflow. If you're
running it on a unix system, you can't use a cp command to drop the file into the directory
as cp uses incremental writes whereas mv will move the file in one go.

Guy Needham | Data Discovery
Virgin Media | Enterprise Data, Design & Management
Bartley Wood Business Park, Hook, Hampshire RG27 9UP
D 01256 75 3362
I welcome VSRE emails. Learn more at http://vsre.info/

From: SaravanaKumar TR [mailto:saran0081986@gmail.com]
Sent: 23 July 2014 06:38
To: user@flume.apache.org
Subject: Re: how spooling directory source identifies the complete file

Thanks Ashish , I already referred to this info.

But I couldn't see any explanation in flume user guide about how flume differentiates between
copy-in progress file and fully copied file.

On Wed, Jul 23, 2014 at 10:59 AM, Ashish <paliwalashish@gmail.com<mailto:paliwalashish@gmail.com>>
This is specified in Flume's User Guide

"Unlike the Exec source, this source is reliable and will not miss data, even if Flume is
restarted or killed. In exchange for this reliability, only immutable, uniquely-named files
must be dropped into the spooling directory. Flume tries to detect these problem conditions
and will fail loudly if they are violated:

  1.  If a file is written to after being placed into the spooling directory, Flume will print
an error to its log file and stop processing.
  2.  If a file name is reused at a later time, Flume will print an error to its log file
and stop processing.

To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp)
to log file names when they are moved into the spooling directory."

On Wed, Jul 23, 2014 at 10:17 AM, SaravanaKumar TR <saran0081986@gmail.com<mailto:saran0081986@gmail.com>>
Hi Jeff,

Thanks of your comments.But what I am really looking for is  , consider we are copying a file
of 1 GB to spool directory , if suppose copy is in progress , how flume recognize that the
complete file is copied into the spool directory and the file is ready for processing ?

how flume make sure it doesnt start processing the partially copied file.

On Tue, Jul 22, 2014 at 11:15 PM, Jeff Lord <jlord@cloudera.com<mailto:jlord@cloudera.com>>
I believe the way this works is that flume creates a meta directory to track which file is
being read.
In the event of a restart of the agent the entire file will be re-read which will create some
duplicate events.


On Tue, Jul 22, 2014 at 6:15 AM, SaravanaKumar TR <saran0081986@gmail.com<mailto:saran0081986@gmail.com>>

I am planning to use spooling directory to move logfiles in hdfs sink.

I like to know how flume identifies the file we are moving to spool directory is complete
file or partial & its move still in progress.

if suppose a file is of large size and we started moving it to spooler directory , how flume
identifies that the complete file is transferred or is still in progress.

Please help me out here.



Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Save Paper - Do you really need to print this e-mail?

Visit www.virginmedia.com for more information, and more fun.

This email and any attachments are or may be confidential and legally privileged
and are sent solely for the attention of the addressee(s). If you have received this
email in error, please delete it from your system: its use, disclosure or copying is
unauthorised. Statements and opinions expressed in this email may not represent
those of Virgin Media. Any representations or commitments in this email are
subject to contract. 

Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, RG27 9UP
Registered in England and Wales with number 2591237

View raw message