flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gonzalo Herreros <gherre...@gmail.com>
Subject Re: About Avro file writing progress on hdfs via Flume.
Date Mon, 21 Mar 2016 09:03:30 GMT
Hdfs doesn't work ecactly like a regular filesytem. Works in blocks, by
default of 128MB if I remember right.
Could be that the Namenode doesn't update the size until a block is closed
On Mar 21, 2016 8:10 AM, "no jihun" <jeesim2@gmail.com> wrote:

> Hello developers.
>
> I am using flume to write avro file on hdfs.
>
> Now I am very curious why avro.tmp file size on hdfs do not increased
> eventually but increased at the last moment(closing time)
> even though many batch or transaction happen.
>
>
> For example.
>
> There is continues high event traffic(5k event/s) in comming to the hdfs
> channel which sinked to hdfs.
>
> when the '~~9320.avro.tmp' file created It start with size 895KB
> [image: 본문 이미지 2]
>
>
>
> but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
> increased in size.
> - 5 minutes later
> [image: 본문 이미지 6]
> - 5 minutes later
> [image: 본문 이미지 7]
>
>
> And lastly rolling happen due to the rollSize.
> At that moment 9320.avro file increased to 113MB.
>
> [image: 본문 이미지 3]
>
> Also the next file 9321.avro.tmp do not increased until rolled.
> 5 minutes later
> [image: 본문 이미지 8]
>
>
>
> I thought "may be the avro file buffered on the Flume agent's machine and
> flushed entire file at the last moment, closing, rolling"
>
> So I checked the network traffic at the rolling moment,
> but network traffic does not go high at the rolling moment.
>
> Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
> holds the stream on somewhere on the hadoop and do not write the stream to
> disk until file closing time.
>
> Does any body does know about the progress that avro file created,
> flushed, closed on hdfs?
>
>
>
> This is configuration of hdfssink.
>
> hadoop1.sinks.hdfsSk.type = hdfs
> hadoop1.sinks.hdfsSk.channel = fileCh1
> hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
> hadoop1.sinks.hdfsSk.serializer = avro_event
> hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
> hadoop1.sinks.hdfsSk.hdfs.path =
> xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
> hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
> hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
> hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
> hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
> hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
> hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
> hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300
>
>
> Thanks!
>

Mime
View raw message