flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Wise <m...@nextdoor.com>
Subject Re: Flume 1.3.0 + HDFS Sink + S3N + avro_vent + Hive…?
Date Wed, 08 May 2013 20:21:49 GMT
Eric,
  I found the bug just a little bit ago... 

>  agent.sinks.s3.hdfs.batchSize = 10000
> -agent.sinks.s3.hdfs.serializer = avro_event
> -agent.sinks.s3.hdfs.fileType = SequenceFile
> +agent.sinks.s3.hdfs.writeFormat = Text
> +agent.sinks.s3.hdfs.fileType = DataStream
>  agent.sinks.s3.hdfs.timeZone = UTC
> +agent.sinks.s3.hdfs.filePrefix = FlumeData
> +agent.sinks.s3.hdfs.fileSuffix = .avro
> +agent.sinks.s3.serializer = avro_event

  Essentially I was setting the serializer in the wrong part of the configuration, and Flume
wasn't letting me know.. once I fixed that, using the avro-tools package on the files created
by this Sink seems to work just fine. Its terribly un-documented, but it does seem to work
now.

--Matt

On May 8, 2013, at 1:12 PM, Eric Sammer <esammer@cloudera.com> wrote:

> Matt:
> 
> This is because what you're actually doing is writing Avro records into Hadoop Sequence
Files. The Avro tools only know how to read Avro Data Files (which are, effectively, meant
to supercede Sequence Files). The serializer plugin only says "write each event as an Avro
record." It doesn't say "write these Avro records as an Avro Data File." It's all very confusing,
admittedly. I don't think we support writing Avro Data Files with the HDFS sink today. In
other words, you need to use the Sequence File APIs to read the *files* produced by Flume.
The records within those files will, however, be Avro records.
> 
> 
> 
> On Wed, May 8, 2013 at 10:42 AM, Matt Wise <matt@nextdoor.com> wrote:
> We're still working on getting our POC of Flume up and running... right now we have log
events that pass through our Flume nodes via a Syslog input and are happily sent off to ElasticSearch
for indexing. We're also sending these events to S3, but we're finding that they seem to be
unreadable with the avro tools.
> 
>> # S3 Output Sink
>> agent.sinks.s3.type = hdfs
>> agent.sinks.s3.channel = fc1
>> agent.sinks.s3.hdfs.path = s3n://XXX:XXX@our_bucket/flume/events/%y-%m-%d/%H
>> agent.sinks.s3.hdfs.rollInterval = 600
>> agent.sinks.s3.hdfs.rollSize = 0
>> agent.sinks.s3.hdfs.rollCount = 10000
>> agent.sinks.s3.hdfs.batchSize = 10000
>> agent.sinks.s3.hdfs.serializer = avro_event
>> agent.sinks.s3.hdfs.fileType = SequenceFile
>> agent.sinks.s3.hdfs.timeZone = UTC
> 
> 
> When we try to look at the avro-serialized files, we get this error:
> 
>> [localhost avro]$ java -jar avro-tools-1.7.4.jar getschema FlumeData.1367857371493
>> Exception in thread "main" java.io.IOException: Not a data file.
>>         at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
>>         at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
>>         at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)
>>         at org.apache.avro.tool.DataFileGetSchemaTool.run(DataFileGetSchemaTool.java:48)
>>         at org.apache.avro.tool.Main.run(Main.java:80)
>>         at org.apache.avro.tool.Main.main(Main.java:69)
> 
> At this point we're a bit unclear how we're supposed to use these FlumeData files with
normal Avro tools?
> 
> --Matt
> 
> 
> 
> -- 
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com


Mime
View raw message