Hdfs doesn't work ecactly like a regular filesytem. Works in blocks, by default of 128MB if I remember right.
Could be that the Namenode doesn't update the size until a block is closed

On Mar 21, 2016 8:10 AM, "no jihun" <jeesim2@gmail.com> wrote:
Hello developers.

I am using flume to write avro file on hdfs.

Now I am very curious why avro.tmp file size on hdfs do not increased eventually but increased at the last moment(closing time)
even though many batch or transaction happen.


For example.

There is continues high event traffic(5k event/s) in comming to the hdfs channel which sinked to hdfs.

when the '~~9320.avro.tmp' file created It start with size 895KB
본문 이미지 2



but even channels sinked well to hdfs the ~~9320.avro.tmp file never get increased in size.
- 5 minutes later
본문 이미지 6
- 5 minutes later
본문 이미지 7


And lastly rolling happen due to the rollSize.
At that moment 9320.avro file increased to 113MB.

본문 이미지 3

Also the next file 9321.avro.tmp do not increased until rolled.
5 minutes later
본문 이미지 8



I thought "may be the avro file buffered on the Flume agent's machine and flushed entire file at the last moment, closing, rolling"

So I checked the network traffic at the rolling moment,
but network traffic does not go high at the rolling moment.

Finally I think hdfs sink flush the transaction batch to hdfs but HDFS holds the stream on somewhere on the hadoop and do not write the stream to disk until file closing time.

Does any body does know about the progress that avro file created, flushed, closed on hdfs?



This is configuration of hdfssink.

hadoop1.sinks.hdfsSk.type = hdfs
hadoop1.sinks.hdfsSk.channel = fileCh1
hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
hadoop1.sinks.hdfsSk.serializer = avro_event
hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
hadoop1.sinks.hdfsSk.hdfs.path = xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300


Thanks!