Thank you for your quick response!
I could not easily find the exact log entry that had the issue - as all I had were 30M input log files :).
After further debugging, I figured out what the issue was . Here is what happened.
For production, we use Exec sink with 'tail -f '. For my local testing I use a spooling dir. The issue happened when I was using the spooldir sink, when a log file had non-UTF-8 characters.
However, the exception that I've posted came not from processing the log file! The flow was as following:
1. Flume is started with spooldir sink
2. a log file with non-utf-8 chars is moved into the spooldir
3. Flume starts processing, encounters a "bad" character and stops (no errors or anything)
4. I kill Flume manually and restart - without cleaning out its .flumespool dir
5. FLume starts up and now chokes up processing its own .flumespool dir and the left-over file in there! - this is where the MalformedInputException came from
When I processed the same file via Exec sink, and 'tail -n 10000 ..' command - it was processed successfully - which told me the issue is specific to the spooled sink.
The solution was to add this parameter to the spooldir sink:
a1.sources.r1.inputCharset = ISO8859-1
From: Jeff Lord <firstname.lastname@example.org>
To: "email@example.com" <firstname.lastname@example.org>; Marina <email@example.com>
Sent: Monday, March 9, 2015 11:17 AM
Subject: Re: MalformedInputException processing logs from Varnish server
Do you have a sample of the characters/data which you believe to be causing this?
Can you just confirm you are using apache version of flume or a specific distro?
Also in your message you mention that you are using tail -f which would be the exec source but the stack trace looks like you are actually using the spooldir source.