flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: Handling malformed data when using custom AvroEventSerializer and HDFS Sink
Date Thu, 02 Jan 2014 15:25:03 GMT
On Tue, Dec 31, 2013 at 8:34 PM, ed <edorsey@gmail.com> wrote:

> Hello,
> We are using Flume v1.4 to load JSON formatted log data into HDFS as Avro.
>  Our flume setup looks like this:
> NXLog ==> (FlumeHTTPSource -> HDFSSink w/ custom EventSerializer)
> Right now our custom EventSerializer (which extends
> AbstractAvroEventSerializer) takes the JSON input from the HTTPSource and
> converts it into an avro record of the appropriate type for the incoming
> log file.  This is working great and we use the serializer to add some
> additional "synthetic" fields to the avro record that don't exist in the
> original JSON log data.
> My question concerns how to handle malformed JSON data (or really any
> error inside of the custom EventSerializer).  It's very likely that as we
> parse the JSON there will be records where something is malformed (either
> the JSON itself, or a field is of the wrong type etc.).
> For example, a "port" field which should always be an Integer might for
> some reason have some ASCII text in it.  I'd like to catch these errors in
> the EventSerializer and then write out the bad JSON to a log file somewhere
> that we can monitor.

Yeah it would be nice to have a better story about this in Flume.

> What is the best way to do this?

Typically people will either log it to a file or send it through another
"flow" to a different HDFS sink.

> Right now, all the logic for catching bad JSON would be inside of the
> "convert" function of the EventSerializer.  Should the convert function
> itself throw an exception that will be gracefully handled upstream

The exception will be logged but that is it..

> or do I just return a "null" value if there was an error?  Would it be
> appropriate to log errors directly to a database from inside the
> EventSerializer convert method or would this be too slow?

That might be too slow to do directly. If I did that I'd have a separate
thread doing that and then an in-memory queue between the serializer and

> What are the best practices for this type of error handling?

If looks to me like we'd need to change AbstractAvroEventSerilizer to
filter out nulls:


which we could easily do.  Since you don't want to wait for that you could
override the write method to do this.

> Thank you for any assistance!
> Best Regards,
> Ed

Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

View raw message