flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Connor Woodson <cwoodson....@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Thu, 17 Jan 2013 20:29:34 GMT
the rollInterval will still cause the last 01-17 file to be closed
eventually. The way the HDFS sink works with the different files is each
unique path is specified by a different BucketWriter object. The sink can
hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
and bucketwriters are only removed when you create the 5001th writer
(5001th unique path). However, generally once a writer is closed it is
never used again (all of your 1-17 writers will never be used again). To
avoid keeping them in the sink's internal list of writers, the idleTimeout
is a specified number of seconds in which no data is received by the
BucketWriter. After this time, the writer will try to close itself and will
then tell the sink to remove it, thus freeing up everything used by the

So the idleTimeout is just a setting to help limit memory usage by the hdfs
sink. The ideal time for it is longer than the maximum time between events
(capped at the rollInterval) - if you know you'll receive a constant stream
of events you might just set it to a minute or something. Or if you are
fine with having multiple files open per hour, you can set it to a lower
number; maybe just over the average time between events. For me in just
testing, I set it >= rollInterval for the cases when no events are received
in a given hour (I'd rather keep the object alive for an extra hour than
create files every 30 minutes or something).

Hope that was helpful,

- Connor

On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar <
bhaskarvk@gmail.com> wrote:

> Say If I have
> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
> hdfs.rollInterval=60
> Now, if there is a file
> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
> This file is not ready to be rolled over yet, i.e. 60 seconds are not
> up and now it's past 12 midnight, i.e. new day
> And events start to be written to
> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
> will the file 2013-01-17 never be rolled over, unless I have something
> like hdfs.idleTimeout=60  ?
> If so how do flume sinks keep track of files they need to rollover
> after idealTimeout ?
> In short what's the exact use of idealTimeout parameter ?

View raw message