flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roshan Naik <ros...@hortonworks.com>
Subject Re: Flume HDFS sink memory requierment.
Date Tue, 09 Feb 2016 21:21:34 GMT
  1.  Num of open files is configurable. See hdfs.maxOpenFiles. May be better to setup your
flume source it in way that it allows the HDFS sink to work on a smaller number set of files
  2.  If I recall correctly.. it writes the events out immediately and doesn't buffer. Some
bufferring is surelly happening in the hdfs client libs. Beyond that mostly should be book-keeping
info (open file handles etc) and any mem used in compression (like a block per open file,
if using block compression). Best to measure it with a test setup. See how the 'in-use' mem
consumption differs by when 1 file open, 100 files open, 1000 files open.
  3.  No. Assuming you are using file channel, then you can try starting from say 8GB as the
max heap size for the agent, and go from there. Mem consumption of  Memory/Spillable channels
depend on the  their memory capactiy settings.


From: R P <hadooper@outlook.com<mailto:hadooper@outlook.com>>
Reply-To: "user@flume.apache.org<mailto:user@flume.apache.org>" <user@flume.apache.org<mailto:user@flume.apache.org>>
Date: Tuesday, February 9, 2016 at 11:17 AM
To: "user@flume.apache.org<mailto:user@flume.apache.org>" <user@flume.apache.org<mailto:user@flume.apache.org>>
Subject: Flume HDFS sink memory requierment.

Hello All,

  Hope you all are having great time. Thanks for reading my question, I appreciate any suggestion/reply.

I am evaluating flume for HDFS write. We get sparse data which will be bucketed into thousands
of different logs. As this data is received sporadically through out the day we get into HDFS
small files problem.

To address this problem one solution is to use file size as the only condition for file close
using hdfs.rollSize.  As we might have thousands of files open for hours I have following

1. Will flume keep thousands of files open until hdfs.rollSize condition is met?

2. How much memory is used by HDFS sink when thousands of files are open at a time?

3. Is memory used for HDFS event buffer equal to data written on HDFS? e.g if thousands of
files to be written has total size of 500gb, will flume sink need 500gb memory size?

Thanks again for your input.


View raw message