Hello All,

  Hope you all are having great time. Thanks for reading my question, I appreciate any suggestion/reply. 


I am evaluating flume for HDFS write. We get sparse data which will be bucketed into thousands of different logs. As this data is received sporadically through out the day we get into HDFS small files problem.


To address this problem one solution is to use file size as the only condition for file close using hdfs.rollSize.  As we might have thousands of files open for hours I have following questions.


1. Will flume keep thousands of files open until hdfs.rollSize condition is met?

2. How much memory is used by HDFS sink when thousands of files are open at a time?

3. Is memory used for HDFS event buffer equal to data written on HDFS? e.g if thousands of files to be written has total size of 500gb, will flume sink need 500gb memory size? 


Thanks again for your input.


-R