flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Fri, 18 Jan 2013 02:17:18 GMT
I have been using it and it's great feature to have.

One question I have though is, what happens when flume dies unexpectedly,
does it leave .tmp files behind? How to clean those away and close it

On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> It's also useful if you want files to get promptly closed and renamed from
> the .tmp or whatever.
> We use it with something like 30seconds setting(we have a constant stream
> of data) and hourly bucketing.
> There is also the issue that files closed by rollInterval are never
> removed from the internal linkedList so it actually causes a small memory
> leak(which can get big in the long term if you have a lot of files and
> hourly renames). I believe this is what is causing the OOM Mohit is getting
> in FLUME-1850
> So I personally would recommend using it(with a setting that will close
> files before rollInterval does).
> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>> Ah I see. Again something useful to have in the flume user guide.
>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com>
>> wrote:
>>> the rollInterval will still cause the last 01-17 file to be closed
>>> eventually. The way the HDFS sink works with the different files is each
>>> unique path is specified by a different BucketWriter object. The sink can
>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>>> and bucketwriters are only removed when you create the 5001th writer
>>> (5001th
>>> unique path). However, generally once a writer is closed it is never used
>>> again (all of your 1-17 writers will never be used again). To avoid
>>> keeping
>>> them in the sink's internal list of writers, the idleTimeout is a
>>> specified
>>> number of seconds in which no data is received by the BucketWriter. After
>>> this time, the writer will try to close itself and will then tell the
>>> sink
>>> to remove it, thus freeing up everything used by the bucketwriter.
>>> So the idleTimeout is just a setting to help limit memory usage by the
>>> hdfs
>>> sink. The ideal time for it is longer than the maximum time between
>>> events
>>> (capped at the rollInterval) - if you know you'll receive a constant
>>> stream
>>> of events you might just set it to a minute or something. Or if you are
>>> fine
>>> with having multiple files open per hour, you can set it to a lower
>>> number;
>>> maybe just over the average time between events. For me in just testing,
>>> I
>>> set it >= rollInterval for the cases when no events are received in a
>>> given
>>> hour (I'd rather keep the object alive for an extra hour than create
>>> files
>>> every 30 minutes or something).
>>> Hope that was helpful,
>>> - Connor
>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>> <bhaskarvk@gmail.com> wrote:
>>>> Say If I have
>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>> hdfs.rollInterval=60
>>>> Now, if there is a file
>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>> up and now it's past 12 midnight, i.e. new day
>>>> And events start to be written to
>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>> like hdfs.idleTimeout=60  ?
>>>> If so how do flume sinks keep track of files they need to rollover
>>>> after idealTimeout ?
>>>> In short what's the exact use of idealTimeout parameter ?

View raw message