flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Stryia <Andrei_Str...@epam.com>
Subject Generate valid Avro file in S3
Date Thu, 20 Aug 2015 09:38:50 GMT
Hi there,

Our system generates a lot of small files in Avro format with the same Schema and sends them
to the Flume via Thrift RPC.
Our Flume agent has the following configuration:

agent.channels=ch1
agent.sources=thrift-source1
agent.sinks=s3-sink1
agent.channels.ch1.type=file
agent.channels.ch1.checkpointDir=/flume/ch1/checkpoint
agent.channels.ch1.dataDirs=/flume/ch1/data
agent.sources.thrift-source1.channels=ch1
agent.sources.thrift-source1.type=thrift
agent.sources.thrift-source1.bind=0.0.0.0
agent.sources.thrift-source1.threads=5
agent.sources.thrift-source1.port=1026
agent.sinks.s3-sink1.channel=ch1
agent.sinks.s3-sink1.type=hdfs
agent.sinks.s3-sink1.hdfs.path=s3n://bucket/path/
agent.sinks.s3-sink1.hdfs.filePrefix=documents
agent.sinks.s3-sink1.hdfs.fileSuffix=.avro
agent.sinks.s3-sink1.hdfs.rollInterval =0
agent.sinks.s3-sink1.hdfs.rollSize=20971520
agent.sinks.s3-sink1.hdfs.rollCount=0
agent.sinks.s3-sink1.hdfs.batchSize=10
agent.sinks.s3-sink1.hdfs.fileType=DataStream
agent.sinks.s3-sink1.hdfs.useLocalTimeStamp=true

Currently Flume just concatenate all Avro files to the single file, as result I have one big
file, where Schema and other Avro specific metadata written multiple times.
How can I configure Flume to generate valid Avro container file, where schema is written once
and which contains Avro datum (without metadata) from all small files (the schema for all
files are the same).

Thanks,
Andrei.

Mime
View raw message