I am using clouderas flume example to consume the twitter sample stream and store it on HDFS.
They put the time a tweet was created at as a header to the flume event (“timestamp”).
From working with the twitter stream earlier I noticed that there usually is some lag between the time I receive a tweet and the time the tweet was created.
My goal is to partition tweets by the time they were created.
Will this timestamp header take care of this?
(using this configuration:
What happens (in an extreme case) when a tweet arrives a couple of minutes late. Will flume reopen a file from the past hour and add it there?
If not, how can I achieve a proper partitioning without overlaps between time slices (hours)?