flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: hdfs.idleTimeout ,what's it used for ?
Date Fri, 18 Jan 2013 05:12:57 GMT


Sent from my iPhone

On Jan 17, 2013, at 6:46 PM, Juhani Connolly <juhani_connolly@cyberagent.co.jp> wrote:

> It seemed neater at the time. It's only an issue because rollInterval doesn't remove
the entry in sfWriters. We could change it so that close doesn't cancel it, and have it check
whether or not the writer is already closed, but that'd be kind of ugly.
> 
> @Mohit:
> 
> When flume dies unexpectedly the .tmp file remains. When it restarts there is some logic
in HDFS sink to recover it(and continue writing from there). I'm not actually sure of the
specifics. You may want to try and just kill -9 a running flume process on a test machine
and then start it up, look at the logs and see what happens with the output.

Does it also work when there is a long delay before flume gets started? We are bucketing by
the hr so if start occurs in the next hour but flume actually died in previous hr and had
 .tmp then does it still cleanup on restart
> 
> If flume dies cleanly the file is properly closed.
> 
> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the rollInterval
occurring. That's annoying. So setting a lower idleTimeout, and drastically decreasing maxOpenFiles
to at most 2 * possible open files, is probably necessary.
>> 
>> 
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cwoodson.dev@gmail.com> wrote:
>>> @Mohit:
>>> 
>>> For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir property.
The default location is /tmp/hadoop-${user.name} To change this you can add -Dhadoop.tmp.dir=<path>
                to your Flume command line call, or you can specify the property in the core-site.xml
of wherever your                 HADOOP_HOME environment variable points to.
>>> 
>>> - Connor
>>> 
>>> 
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cwoodson.dev@gmail.com>
wrote:
>>>> Whether idleTimeout is lower or higher than rollInterval is a preference;
set it before, and assume you get one message right on the turn of the hour, then you will
have some part of that hour without any bucket writers; but if you get another message at
the end of the hour, you will end up with two files instead of one. Set it idleTimeout to
be longer and you will get just one file, but also (at worst case) you will have twice as
many bucketwriters open; so it all depends on how many files you want/how much memory you
have to spare.
>>>> 
>>>> - Connor
>>>> 
>>>> An aside:
>>>> bucketwriters, after being closed by rollInterval, aren't really a memory
leak; they just are very rarely useful to keep around (your path could rely on hostname, and
you could use a rollinterval, and then those bucketwriters will still remain useful). And
they will get removed eventually; by default after you've created your 5001st bucketwriter,
the first (or whichever was used longest ago) will be removed.
>>>> 
>>>> And I don't think that's the cause behind 1850 as he did have an idleTimeout
set at 15 minutes.
>>>> 
>>>> 
>>>> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <juhani_connolly@cyberagent.co.jp>
wrote:
>>>>> It's also useful if you want files to get promptly closed and renamed
from the .tmp or whatever.
>>>>> 
>>>>> We use it with something like 30seconds setting(we have a constant stream
of data) and hourly bucketing.
>>>>> 
>>>>> There is also the issue that files closed by rollInterval are never removed
from the internal linkedList so it actually causes a small memory leak(which can get big in
the long term if you have a lot of files and hourly renames). I believe this is what is causing
the OOM Mohit is getting in FLUME-1850
>>>>> 
>>>>> So I personally would recommend using it(with a setting that will close
files before rollInterval does).
>>>>> 
>>>>> 
>>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>>> 
>>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cwoodson.dev@gmail.com>
wrote:
>>>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>>>> eventually. The way the HDFS sink works with the different files
is each
>>>>>>> unique path is specified by a different BucketWriter object.
The sink can
>>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
5000),
>>>>>>> and bucketwriters are only                                  
      removed when you create the 5001th writer (5001th
>>>>>>> unique path). However, generally once a writer is closed it is
never used
>>>>>>> again (all of your 1-17 writers will never be used again). To
avoid keeping
>>>>>>> them in the sink's internal list of writers, the idleTimeout
is a specified
>>>>>>> number of seconds in which no data is received by the BucketWriter.
After
>>>>>>> this time, the writer will try                              
          to close itself and will then                                         tell the sink
>>>>>>> to remove it, thus freeing up everything used by the        
                                bucketwriter.
>>>>>>> 
>>>>>>> So the idleTimeout is just a setting to help limit memory usage
by the hdfs
>>>>>>> sink. The ideal time for it is longer than the maximum time between
events
>>>>>>> (capped at the rollInterval) - if you know you'll receive a constant
stream
>>>>>>> of events you might just set it to a minute or something. Or
if you are fine
>>>>>>> with having multiple files open per hour, you can set it to a
lower number;
>>>>>>> maybe just over the average time between events. For me in just
testing, I
>>>>>>> set it >= rollInterval for the cases when no events are  
                                      received in a given
>>>>>>> hour (I'd rather keep the object alive for an extra hour than
create files
>>>>>>> every 30 minutes or something).
>>>>>>> 
>>>>>>> Hope that was helpful,
>>>>>>> 
>>>>>>> - Connor
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>>>> <bhaskarvk@gmail.com>                                 
       wrote:
>>>>>>>> Say If I have
>>>>>>>> 
>>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>>> 
>>>>>>>> hdfs.rollInterval=60
>>>>>>>> 
>>>>>>>> Now, if there is a file
>>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds
are not
>>>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>>>> And events start to be written to
>>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>>>> 
>>>>>>>> will the file 2013-01-17 never be rolled over, unless I have
something
>>>>>>>> like hdfs.idleTimeout=60  ?
>>>>>>>> If so how do flume sinks keep track of files they need to
rollover
>>>>>>>> after idealTimeout ?
>>>>>>>> 
>>>>>>>> In short what's the exact use of idealTimeout parameter ?
> 

Mime
View raw message