Based on the value of the Hour in the event's time stamp header, Flume will write the event to the active file in the corresponding partition. If the file in that partition is still active (not yet renamed from the temporary inUseSuffix extension), then the event goes to the active file, If the file in that partition has already been rolled (based on rollInterval, rollCount, rollSize or idleTimeout) then the event will be written to a new file in that partition.
As long as the timestamp header points to the correct Hour, it will be written to that partition.


On Wed, Jun 24, 2015 at 7:13 AM, Dominik Hübner <> wrote:
I am using clouderas flume example to consume the twitter sample stream and store it on HDFS. 

They put the time a tweet was created at as a header to the flume event (“timestamp”). 
From working with the twitter stream earlier I noticed that there usually is some lag between the time I receive a tweet and the time the tweet was created.

My goal is to partition tweets by the time they were created. 

Will this timestamp header take care of this?
(using this configuration: 
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

What happens (in an extreme case) when a tweet arrives a couple of minutes late. Will flume reopen a file from the past hour and add it there? 
If not, how can I achieve a proper partitioning without overlaps between time slices (hours)?