flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Flume logs http request info
Date Wed, 27 Feb 2013 19:08:25 GMT
Thomas, 

Looks like your data is written out as text. It is possible that while Flume had written out
the entire dataset, your HDFS cluster failed to allocate a fresh block after persisting half
your row. In such a case, a dangling partial event is possible - and Flume will retry the
whole event because HDFS will throw an exception. Either you should use a binary format where
malformed data can be easily identified and discarded or the job you are using should be able
to ignore malformed data. I am not a Hive expert, but I know that you can only select rows
from a table which match a certain criteria - and making sure you have a non-nullable last
column is a good check - so if the last column is null (select * from table where last_row!=null),
the row can be ignored - since it may not have been written out correctly.


Hope this helps.


Hari 

-- 
Hari Shreedharan


On Wednesday, February 27, 2013 at 3:25 AM, Thomas Adam wrote:

> Hi,
> 
> I have a issue with my flume agents which collectes JSON data and save
> it to an hdfs store for hive. Today my daily job was broken because
> malformed rows. I looked in this files to see what is happend and I
> see I have something like this in my file:
> 
> ...
> POST / HTTP/1.0
> Host: localhost:50000
> Content-Length: 185
> Content-Type: application/x-www-form-urlencoded
> ...
> 
> And this brokens my JSON serde in Hive. IMHO the flume agents logs
> data themselves and I'm sure that I don't send any things like this.
> 
> I have two flume agents.
> The first one collects data from my application with the HTTPSource:
> 
> http.sources = user_events
> http.channels = user_events
> http.sinks = user_events
> 
> http.sources.user_events.type = org.apache.flume.source.http.HTTPSource
> http.sources.user_events.port = 50000
> http.sources.user_events.interceptors = timestamp
> http.sources.user_events.interceptors.timestamp.type = timestamp
> http.sources.user_events.channels = user_events
> 
> http.channels.user_events.type = memory
> http.channels.user_events.capacity = 100000
> http.channels.user_events.transactionCapacity = 1000
> 
> http.sinks.user_events.type = avro
> http.sinks.user_events.channel = user_events
> http.sinks.user_events.hostname = 10.2.0.190
> http.sinks.user_events.port = 20000
> http.sinks.user_events.batch-size = 100
> 
> And the second agents puts the data into hdfs:
> 
> hdfs.sources = user_events
> hdfs.channels = user_events
> hdfs.sinks = user_events
> 
> hdfs.sources.user_events.type = avro
> hdfs.sources.user_events.channels = user_events
> hdfs.sources.user_events.bind = 10.2.0.190
> hdfs.sources.user_events.port = 20000
> 
> hdfs.channels.user_events.type = memory
> hdfs.channels.user_events.capacity = 100000
> hdfs.channels.user_events.transactionCapacity = 1000
> 
> hdfs.sinks.user_events.type = hdfs
> hdfs.sinks.user_events.channel = user_events
> hdfs.sinks.user_events.hdfs.path =
> hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H
> hdfs.sinks.user_events.hdfs.filePrefix = flume
> hdfs.sinks.user_events.hdfs.rollInterval = 600
> hdfs.sinks.user_events.hdfs.rollSize = 134217728
> hdfs.sinks.user_events.hdfs.rollCount = 0
> hdfs.sinks.user_events.hdfs.batchSize = 1000
> hdfs.sinks.user_events.hdfs.fileType = DataStream
> 
> It' works since 3 months without any problems and I don't change
> anything in this time.
> I use flume 1.3.0 and cdh 4.1.2
> 
> I hope some one can help me too resolve this issue.
> 
> Thanks & Regards
> Thomas
> 
> 



Mime
View raw message