flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johny Rufus <jru...@cloudera.com>
Subject Re: Non overlapping time slice partitioning
Date Wed, 24 Jun 2015 17:38:54 GMT
Based on the value of the Hour in the event's time stamp header, Flume will
write the event to the active file in the corresponding partition. If the
file in that partition is still active (not yet renamed from the temporary
inUseSuffix extension), then the event goes to the active file, If the file
in that partition has already been rolled (based on rollInterval,
rollCount, rollSize or idleTimeout) then the event will be written to a new
file in that partition.
As long as the timestamp header points to the correct Hour, it will be
written to that partition.


On Wed, Jun 24, 2015 at 7:13 AM, Dominik Hübner <contact@dhuebner.com>

> I am using clouderas flume example to consume the twitter sample stream
> and store it on HDFS.
> https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
> They put the time a tweet was created at as a header to the flume event
> (“timestamp”).
> From working with the twitter stream earlier I noticed that there usually
> is some lag between the time I receive a tweet and the time the tweet was
> created.
> My goal is to partition tweets by the time they were created.
> Will this timestamp header take care of this?
> (using this configuration:
> TwitterAgent.sinks.HDFS.type = hdfs
> TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
> )
> What happens (in an extreme case) when a tweet arrives a couple of minutes
> late. Will flume reopen a file from the past hour and add it there?
> If not, how can I achieve a proper partitioning without overlaps between
> time slices (hours)?
> Best
> Dominik

View raw message