flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhaskar V. Karambelkar" <bhaska...@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Thu, 17 Jan 2013 21:38:52 GMT
Ah I see. Again something useful to have in the flume user guide.

On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com> wrote:
> the rollInterval will still cause the last 01-17 file to be closed
> eventually. The way the HDFS sink works with the different files is each
> unique path is specified by a different BucketWriter object. The sink can
> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
> and bucketwriters are only removed when you create the 5001th writer (5001th
> unique path). However, generally once a writer is closed it is never used
> again (all of your 1-17 writers will never be used again). To avoid keeping
> them in the sink's internal list of writers, the idleTimeout is a specified
> number of seconds in which no data is received by the BucketWriter. After
> this time, the writer will try to close itself and will then tell the sink
> to remove it, thus freeing up everything used by the bucketwriter.
> So the idleTimeout is just a setting to help limit memory usage by the hdfs
> sink. The ideal time for it is longer than the maximum time between events
> (capped at the rollInterval) - if you know you'll receive a constant stream
> of events you might just set it to a minute or something. Or if you are fine
> with having multiple files open per hour, you can set it to a lower number;
> maybe just over the average time between events. For me in just testing, I
> set it >= rollInterval for the cases when no events are received in a given
> hour (I'd rather keep the object alive for an extra hour than create files
> every 30 minutes or something).
> Hope that was helpful,
> - Connor
> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
> <bhaskarvk@gmail.com> wrote:
>> Say If I have
>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>> hdfs.rollInterval=60
>> Now, if there is a file
>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>> up and now it's past 12 midnight, i.e. new day
>> And events start to be written to
>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>> will the file 2013-01-17 never be rolled over, unless I have something
>> like hdfs.idleTimeout=60  ?
>> If so how do flume sinks keep track of files they need to rollover
>> after idealTimeout ?
>> In short what's the exact use of idealTimeout parameter ?

View raw message