flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From no jihun <jees...@gmail.com>
Subject About Avro file writing progress on hdfs via Flume.
Date Mon, 21 Mar 2016 08:10:27 GMT
Hello developers.

I am using flume to write avro file on hdfs.

Now I am very curious why avro.tmp file size on hdfs do not increased
eventually but increased at the last moment(closing time)
even though many batch or transaction happen.

For example.

There is continues high event traffic(5k event/s) in comming to the hdfs
channel which sinked to hdfs.

when the '~~9320.avro.tmp' file created It start with size 895KB
[image: 본문 이미지 2]

but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
increased in size.
- 5 minutes later
[image: 본문 이미지 6]
- 5 minutes later
[image: 본문 이미지 7]

And lastly rolling happen due to the rollSize.
At that moment 9320.avro file increased to 113MB.

[image: 본문 이미지 3]

Also the next file 9321.avro.tmp do not increased until rolled.
5 minutes later
[image: 본문 이미지 8]

I thought "may be the avro file buffered on the Flume agent's machine and
flushed entire file at the last moment, closing, rolling"

So I checked the network traffic at the rolling moment,
but network traffic does not go high at the rolling moment.

Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
holds the stream on somewhere on the hadoop and do not write the stream to
disk until file closing time.

Does any body does know about the progress that avro file created, flushed,
closed on hdfs?

This is configuration of hdfssink.

hadoop1.sinks.hdfsSk.type = hdfs
hadoop1.sinks.hdfsSk.channel = fileCh1
hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
hadoop1.sinks.hdfsSk.serializer = avro_event
hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
hadoop1.sinks.hdfsSk.hdfs.path =
hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300


View raw message