flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juhani_conno...@cyberagent.co.jp>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Fri, 18 Jan 2013 02:08:49 GMT
It's also useful if you want files to get promptly closed and renamed 
from the .tmp or whatever.

We use it with something like 30seconds setting(we have a constant 
stream of data) and hourly bucketing.

There is also the issue that files closed by rollInterval are never 
removed from the internal linkedList so it actually causes a small 
memory leak(which can get big in the long term if you have a lot of 
files and hourly renames). I believe this is what is causing the OOM 
Mohit is getting in FLUME-1850

So I personally would recommend using it(with a setting that will close 
files before rollInterval does).

On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
> Ah I see. Again something useful to have in the flume user guide.
> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com> wrote:
>> the rollInterval will still cause the last 01-17 file to be closed
>> eventually. The way the HDFS sink works with the different files is each
>> unique path is specified by a different BucketWriter object. The sink can
>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>> and bucketwriters are only removed when you create the 5001th writer (5001th
>> unique path). However, generally once a writer is closed it is never used
>> again (all of your 1-17 writers will never be used again). To avoid keeping
>> them in the sink's internal list of writers, the idleTimeout is a specified
>> number of seconds in which no data is received by the BucketWriter. After
>> this time, the writer will try to close itself and will then tell the sink
>> to remove it, thus freeing up everything used by the bucketwriter.
>> So the idleTimeout is just a setting to help limit memory usage by the hdfs
>> sink. The ideal time for it is longer than the maximum time between events
>> (capped at the rollInterval) - if you know you'll receive a constant stream
>> of events you might just set it to a minute or something. Or if you are fine
>> with having multiple files open per hour, you can set it to a lower number;
>> maybe just over the average time between events. For me in just testing, I
>> set it >= rollInterval for the cases when no events are received in a given
>> hour (I'd rather keep the object alive for an extra hour than create files
>> every 30 minutes or something).
>> Hope that was helpful,
>> - Connor
>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>> <bhaskarvk@gmail.com> wrote:
>>> Say If I have
>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>> hdfs.rollInterval=60
>>> Now, if there is a file
>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>> up and now it's past 12 midnight, i.e. new day
>>> And events start to be written to
>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>> will the file 2013-01-17 never be rolled over, unless I have something
>>> like hdfs.idleTimeout=60  ?
>>> If so how do flume sinks keep track of files they need to rollover
>>> after idealTimeout ?
>>> In short what's the exact use of idealTimeout parameter ?

View raw message