flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Connor Woodson <cwoodson....@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Fri, 18 Jan 2013 02:20:56 GMT

For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
property. The default location is /tmp/hadoop-${user.name} To change this
you can add -Dhadoop.tmp.dir=<path> to your Flume command line call, or you
can specify the property in the core-site.xml of wherever your HADOOP_HOME
environment variable points to.

- Connor

On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cwoodson.dev@gmail.com>wrote:

> Whether idleTimeout is lower or higher than rollInterval is a preference;
> set it before, and assume you get one message right on the turn of the
> hour, then you will have some part of that hour without any bucket writers;
> but if you get another message at the end of the hour, you will end up with
> two files instead of one. Set it idleTimeout to be longer and you will get
> just one file, but also (at worst case) you will have twice as many
> bucketwriters open; so it all depends on how many files you want/how much
> memory you have to spare.
> - Connor
> An aside:
> bucketwriters, after being closed by rollInterval, aren't really a memory
> leak; they just are very rarely useful to keep around (your path could rely
> on hostname, and you could use a rollinterval, and then those bucketwriters
> will still remain useful). And they will get removed eventually; by default
> after you've created your 5001st bucketwriter, the first (or whichever was
> used longest ago) will be removed.
> And I don't think that's the cause behind 1850 as he did have an
> idleTimeout set at 15 minutes.
> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>> It's also useful if you want files to get promptly closed and renamed
>> from the .tmp or whatever.
>> We use it with something like 30seconds setting(we have a constant stream
>> of data) and hourly bucketing.
>> There is also the issue that files closed by rollInterval are never
>> removed from the internal linkedList so it actually causes a small memory
>> leak(which can get big in the long term if you have a lot of files and
>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>> in FLUME-1850
>> So I personally would recommend using it(with a setting that will close
>> files before rollInterval does).
>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>> Ah I see. Again something useful to have in the flume user guide.
>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com>
>>> wrote:
>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>> eventually. The way the HDFS sink works with the different files is each
>>>> unique path is specified by a different BucketWriter object. The sink
>>>> can
>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>> 5000),
>>>> and bucketwriters are only removed when you create the 5001th writer
>>>> (5001th
>>>> unique path). However, generally once a writer is closed it is never
>>>> used
>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>> keeping
>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>> specified
>>>> number of seconds in which no data is received by the BucketWriter.
>>>> After
>>>> this time, the writer will try to close itself and will then tell the
>>>> sink
>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>> So the idleTimeout is just a setting to help limit memory usage by the
>>>> hdfs
>>>> sink. The ideal time for it is longer than the maximum time between
>>>> events
>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>> stream
>>>> of events you might just set it to a minute or something. Or if you are
>>>> fine
>>>> with having multiple files open per hour, you can set it to a lower
>>>> number;
>>>> maybe just over the average time between events. For me in just
>>>> testing, I
>>>> set it >= rollInterval for the cases when no events are received in a
>>>> given
>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>> files
>>>> every 30 minutes or something).
>>>> Hope that was helpful,
>>>> - Connor
>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>> <bhaskarvk@gmail.com> wrote:
>>>>> Say If I have
>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>> hdfs.rollInterval=60
>>>>> Now, if there is a file
>>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>> up and now it's past 12 midnight, i.e. new day
>>>>> And events start to be written to
>>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>> like hdfs.idleTimeout=60  ?
>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>> after idealTimeout ?
>>>>> In short what's the exact use of idealTimeout parameter ?

View raw message