flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Connolly Juhani <juhani_conno...@cyberagent.co.jp>
Subject Flume NG reliability and failover mechanisms
Date Fri, 13 Jan 2012 01:17:07 GMT

Coming into the new year we've been trying out flume NG, and run into
some questions. Tried to pick up what was possible from the javadoc and
source but pardon me if some of these are obvious.

1) Reading http://www.cloudera.com/blog/2011/12/apache-flume-architecture-of-flume-ng-2/
describes the reliability, but what happens if we lose a node?
1.1)Presumably the data stored in its channel is gone? 
1.2) If we restart the node and the channel is a persisting one(file or
jdbc based),  will it then happily start feeding data into the sink?

2) Is there some way to deliver data along multiple paths but make sure
it only gets persisted to a sink once? To avoid  loss of data to a dying
2.1) Will there be stuff equivalent to the E2E mode of OG?
2.2) Anything else planned but further down along the horizon? Didn't
see much at https://cwiki.apache.org/confluence/display/FLUME/Features+and+Use+Cases
but that doesn't look very up to date.

3) Using the hdfs sink, we're getting tons of really small files. I
suspect this is related to append, and having a poke around the source,
it turns out that append is only used(by  if hdfs.append.support is set
to true. The hdfs-default.xml name for this variable is
dfs.support.append . Is this intentional? Should we be adding
hdfs.append.support manually to our config, or is there something else
going on here(regarding all the tiny files)?

Any help with these issues would be greatly appreciated.

View raw message