Hoping to get some insight on where to further troubleshoot this issue. The scenario is we have a web application which accepts URL encoded UTF-8 characters (Cyrillic text in this instance) and then our web application sends this data to a Flume agent via HTTPSource with the JSONHandler. This agent then in turn sends the event along via Avro sink to another Flume agent which writes it to HDFS using the HDFS sink.
We initially noticed the data was no longer valid in the HDFS file and after investigating have found the following:
- The initial POST is correct, verified via a network trace and looking at binary data on the wire.
- The Avro event sent from the Flume agent is mangled, again verified via network trace and looking at the binary payload.
We do not explicitly set the content type header on the POST from our application as documentation states if not set then UTF-8 will be assumed.
Can anyone elaborate on when/why this data is being corrupted?