flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Chavez <pcha...@verticalsearchworks.com>
Subject How to exclude .tmp files?
Date Thu, 27 Dec 2012 18:01:36 GMT
This is kind of a generic HDFS question, but it does relate to flume, so hopefully someone
can provide feedback.

I have a flume configuration that sinks to HDFS using timestamp headers. I would like to setup
a post-processor using Oozie to pull the data as it lands in HDFS into Hive, doing some cleaning
and compression along the way.

However I am running into an issue where if I inadvertently read a .tmp file the flume agent
that is writing to it stops sinking with an HDFS error.

The flume docs state "The file in use will have the name mangled to include ".tmp" at the
end. Once the file is closed, this extension is removed. This allows excluding partially complete
files in the directory." but I cannot figure out how to exclude files based on extension via
either Pig or Hive.

In general I should not need to exclude as I could reasonably assume the directory is done
being written to, but in the event of delays in flume or my initial app agent starting the
data flow the directory could still be written to when the Oozie coordinator materializes
a job.

It seems like this should be easy, but I'm not having any luck searching for a solution. Any
insight or advice is appreciated,

thank you,
Paul Chavez

View raw message