flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pritchard, Charles X. -ND" <Charles.X.Pritchard....@disney.com>
Subject Re: flume and hadoop append
Date Wed, 09 Apr 2014 19:54:20 GMT

On Apr 9, 2014, at 8:06 AM, Brock Noland <brock@cloudera.com<mailto:brock@cloudera.com>>
wrote:

Hi Charles,

> Exploring the idea of using “append” instead of creating new files with
> HDFS every few minutes.
...
it's possible the client would write a partial line without a newline. Then the client on
restart would append to that existing line. The subsequent line would be correctly formatted.

Is this an issue with Hadoop architecture or an issue with the way flume calls/does not call
some kind of fsync/sync interface?
Hadoop has append but there’s no merge — be wonderful to just write data then atomically
call “merge this”. Never a corrupt file!

Having a partially appended record would have that unfortunate consequence of causing fastidious
MR jobs to throw errors on occasion.


On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <cshannon108@gmail.com<mailto:cshannon108@gmail.com>>
wrote:
Not sure what you are trying to do, but the HDFS sink appends. It's just that you have to
determine what your roll-over strategy will be. Instead of every few minutes, you can set
the hdfs.rollInterval=0 (disables) and set the hdfs.rollSize to however large you want your
files before you roll over to appending to a new file. You can also use hdfs.rollCount to
set your roll-over for a certain number of records. I use rollSize for my roll-over strategy.

Sounds like a good strategy. Do you also access those HDFS files while they’re still being
written to — that is — do you hit the edge case that Brock brought up?


-Charles

Mime
View raw message