flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From no jihun <jees...@gmail.com>
Subject Re: About Avro file writing progress on hdfs via Flume.
Date Mon, 21 Mar 2016 09:29:32 GMT
Thanks Gonzalo.

For other people . .
This is the detail explanation I found.

https://community.hortonworks.com/questions/6251/practical-limits-on-number-of-simultaneous-open-hd.html
2016. 3. 21. 오후 6:03에 "Gonzalo Herreros" <gherreros@gmail.com>님이 작성:

> Hdfs doesn't work ecactly like a regular filesytem. Works in blocks, by
> default of 128MB if I remember right.
> Could be that the Namenode doesn't update the size until a block is closed
> On Mar 21, 2016 8:10 AM, "no jihun" <jeesim2@gmail.com> wrote:
>
>> Hello developers.
>>
>> I am using flume to write avro file on hdfs.
>>
>> Now I am very curious why avro.tmp file size on hdfs do not increased
>> eventually but increased at the last moment(closing time)
>> even though many batch or transaction happen.
>>
>>
>> For example.
>>
>> There is continues high event traffic(5k event/s) in comming to the hdfs
>> channel which sinked to hdfs.
>>
>> when the '~~9320.avro.tmp' file created It start with size 895KB
>> [image: 본문 이미지 2]
>>
>>
>>
>> but even channels sinked well to hdfs the ~~9320.avro.tmp file never get
>> increased in size.
>> - 5 minutes later
>> [image: 본문 이미지 6]
>> - 5 minutes later
>> [image: 본문 이미지 7]
>>
>>
>> And lastly rolling happen due to the rollSize.
>> At that moment 9320.avro file increased to 113MB.
>>
>> [image: 본문 이미지 3]
>>
>> Also the next file 9321.avro.tmp do not increased until rolled.
>> 5 minutes later
>> [image: 본문 이미지 8]
>>
>>
>>
>> I thought "may be the avro file buffered on the Flume agent's machine and
>> flushed entire file at the last moment, closing, rolling"
>>
>> So I checked the network traffic at the rolling moment,
>> but network traffic does not go high at the rolling moment.
>>
>> Finally I think hdfs sink flush the transaction batch to hdfs but HDFS
>> holds the stream on somewhere on the hadoop and do not write the stream to
>> disk until file closing time.
>>
>> Does any body does know about the progress that avro file created,
>> flushed, closed on hdfs?
>>
>>
>>
>> This is configuration of hdfssink.
>>
>> hadoop1.sinks.hdfsSk.type = hdfs
>> hadoop1.sinks.hdfsSk.channel = fileCh1
>> hadoop1.sinks.hdfsSk.hdfs.fileType = DataStream
>> hadoop1.sinks.hdfsSk.serializer = avro_event
>> hadoop1.sinks.hdfsSk.serializer.compressionCodec = snappy
>> hadoop1.sinks.hdfsSk.hdfs.path =
>> xxxxxxx/data/flume/%{category}/%{type}/%Y/%m/%d/%{partition}/%{hour}
>> hadoop1.sinks.hdfsSk.hdfs.filePrefix = %{type}_%Y-%m-%d_%H_%{host}
>> hadoop1.sinks.hdfsSk.hdfs.fileSuffix = .avro
>> hadoop1.sinks.hdfsSk.hdfs.rollInterval = 3700
>> hadoop1.sinks.hdfsSk.hdfs.rollSize = 67000000
>> hadoop1.sinks.hdfsSk.hdfs.rollCount = 0
>> hadoop1.sinks.hdfsSk.hdfs.batchSize = 10000
>> hadoop1.sinks.hdfsSk.hdfs.idleTimeout = 300
>>
>>
>> Thanks!
>>
>

Mime
View raw message