flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Connor Woodson <cwoodson....@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Fri, 18 Jan 2013 04:18:28 GMT
Alright, that makes sense. The takeaway from this conversation for everyone
else:

If you use idleTimeout, be sure to set the rollInterval to 0. And if you
don't use idleTimeout, be sure to lower maxOpenFiles to a number relative
to your expected throughput. To use the least memory, you will want to use
idleTimeout; but the result will be that more files created in hdfs.

- Connor


On Thu, Jan 17, 2013 at 7:39 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  That breaks the use case idleTimeout was originally made for: making
> sure the file is closed promptly after data stops arriving. We use this to
> make sure the files ready for our batches which run quite soon after. The
> time that rollInterval will trigger is unpredictable as it will reset every
> time any other type of roll is triggered(event count or size).
>
> By making rollInterval behave properly all of this is a non-issue. My
> recommendation to users woudl be not to use rollInterval if they're
> bucketing by time(it's redundant behavior).
>
> Documentation could definitely be improved. Once we sort out the approach
> we want to take I can write it up to make the difference and usage clearer.
>
>
> On 01/18/2013 12:24 PM, Connor Woodson wrote:
>
> The way idleTimeout works right now is that it's another rollInterval; it
> will work best when rollInterval is not set and so it seems that it's use
> is best for when you don't want to use a rollInterval and just want to have
> your bucketwriters close when no events are coming through (caused by path
> change or something else; and you can still roll reliably with either count
> or size)
>
>  As such, perhaps it is more clear if idleTimeout is renamed to idleRoll
> or such?
>
>  And then change idleTimeout to only count seconds since it was closed;
> if a bucketwriter is closed for long enough it will automatically remove
> itself. This type of idle will then work well with rollInterval, while the
> other one doesn't (idleRoll + rollInterval creates two time-based rollers.
> There are certainly times for that, but not all of the time).
>
>  - Connor
>
>
> On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>>  It seemed neater at the time. It's only an issue because rollInterval
>> doesn't remove the entry in sfWriters. We could change it so that close
>> doesn't cancel it, and have it check whether or not the writer is already
>> closed, but that'd be kind of ugly.
>>
>> @Mohit:
>>
>> When flume dies unexpectedly the .tmp file remains. When it restarts
>> there is some logic in HDFS sink to recover it(and continue writing from
>> there). I'm not actually sure of the specifics. You may want to try and
>> just kill -9 a running flume process on a test machine and then start it
>> up, look at the logs and see what happens with the output.
>>
>> If flume dies cleanly the file is properly closed.
>>
>>
>> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>>
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
>> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
>> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
>> is probably necessary.
>>
>>
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cwoodson.dev@gmail.com>wrote:
>>
>>> @Mohit:
>>>
>>>  For the HDFS Sink, the tmp files are placed based on the
>>> hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name}
>>> To change this you can add -Dhadoop.tmp.dir=<path> to your Flume command
>>> line call, or you can specify the property in the core-site.xml of wherever
>>> your HADOOP_HOME environment variable points to.
>>>
>>>  - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cwoodson.dev@gmail.com>wrote:
>>>
>>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>>> preference; set it before, and assume you get one message right on the turn
>>>> of the hour, then you will have some part of that hour without any bucket
>>>> writers; but if you get another message at the end of the hour, you will
>>>> end up with two files instead of one. Set it idleTimeout to be longer and
>>>> you will get just one file, but also (at worst case) you will have twice
as
>>>> many bucketwriters open; so it all depends on how many files you want/how
>>>> much memory you have to spare.
>>>>
>>>>  - Connor
>>>>
>>>>  An aside:
>>>> bucketwriters, after being closed by rollInterval, aren't really a
>>>> memory leak; they just are very rarely useful to keep around (your path
>>>> could rely on hostname, and you could use a rollinterval, and then those
>>>> bucketwriters will still remain useful). And they will get removed
>>>> eventually; by default after you've created your 5001st bucketwriter, the
>>>> first (or whichever was used longest ago) will be removed.
>>>>
>>>>  And I don't think that's the cause behind 1850 as he did have an
>>>> idleTimeout set at 15 minutes.
>>>>
>>>>
>>>>  On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>>
>>>>> It's also useful if you want files to get promptly closed and renamed
>>>>> from the .tmp or whatever.
>>>>>
>>>>> We use it with something like 30seconds setting(we have a constant
>>>>> stream of data) and hourly bucketing.
>>>>>
>>>>> There is also the issue that files closed by rollInterval are never
>>>>> removed from the internal linkedList so it actually causes a small memory
>>>>> leak(which can get big in the long term if you have a lot of files and
>>>>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>>>>> in FLUME-1850
>>>>>
>>>>> So I personally would recommend using it(with a setting that will
>>>>> close files before rollInterval does).
>>>>>
>>>>>
>>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>>>
>>>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>>>
>>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <
>>>>>> cwoodson.dev@gmail.com> wrote:
>>>>>>
>>>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>>>> eventually. The way the HDFS sink works with the different files
is
>>>>>>> each
>>>>>>> unique path is specified by a different BucketWriter object.
The
>>>>>>> sink can
>>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>>>>> 5000),
>>>>>>> and bucketwriters are only removed when you create the 5001th
writer
>>>>>>> (5001th
>>>>>>> unique path). However, generally once a writer is closed it is
never
>>>>>>> used
>>>>>>> again (all of your 1-17 writers will never be used again). To
avoid
>>>>>>> keeping
>>>>>>> them in the sink's internal list of writers, the idleTimeout
is a
>>>>>>> specified
>>>>>>> number of seconds in which no data is received by the BucketWriter.
>>>>>>> After
>>>>>>> this time, the writer will try to close itself and will then
tell
>>>>>>> the sink
>>>>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>>>>
>>>>>>> So the idleTimeout is just a setting to help limit memory usage
by
>>>>>>> the hdfs
>>>>>>> sink. The ideal time for it is longer than the maximum time between
>>>>>>> events
>>>>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>>>>> stream
>>>>>>> of events you might just set it to a minute or something. Or
if you
>>>>>>> are fine
>>>>>>> with having multiple files open per hour, you can set it to a
lower
>>>>>>> number;
>>>>>>> maybe just over the average time between events. For me in just
>>>>>>> testing, I
>>>>>>> set it >= rollInterval for the cases when no events are received
in
>>>>>>> a given
>>>>>>> hour (I'd rather keep the object alive for an extra hour than
create
>>>>>>> files
>>>>>>> every 30 minutes or something).
>>>>>>>
>>>>>>> Hope that was helpful,
>>>>>>>
>>>>>>> - Connor
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>>>> <bhaskarvk@gmail.com> wrote:
>>>>>>>
>>>>>>>> Say If I have
>>>>>>>>
>>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>>>
>>>>>>>> hdfs.rollInterval=60
>>>>>>>>
>>>>>>>> Now, if there is a file
>>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds
are
>>>>>>>> not
>>>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>>>> And events start to be written to
>>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>>>>
>>>>>>>> will the file 2013-01-17 never be rolled over, unless I have
>>>>>>>> something
>>>>>>>> like hdfs.idleTimeout=60  ?
>>>>>>>> If so how do flume sinks keep track of files they need to
rollover
>>>>>>>> after idealTimeout ?
>>>>>>>>
>>>>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Mime
View raw message