flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Hübner <cont...@dhuebner.com>
Subject Non overlapping time slice partitioning
Date Wed, 24 Jun 2015 14:13:01 GMT
I am using clouderas flume example to consume the twitter sample stream and store it on HDFS.


https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>

They put the time a tweet was created at as a header to the flume event (“timestamp”).

From working with the twitter stream earlier I noticed that there usually is some lag between
the time I receive a tweet and the time the tweet was created.

My goal is to partition tweets by the time they were created. 

Will this timestamp header take care of this?
(using this configuration: 
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/ <hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/>
)

What happens (in an extreme case) when a tweet arrives a couple of minutes late. Will flume
reopen a file from the past hour and add it there? 
If not, how can I achieve a proper partitioning without overlaps between time slices (hours)?


Best
Dominik


Mime
View raw message