A setup similar to what we have at foursquare would be:
Each of the 20 nodes behind a proxy runs a local flume-node. App-code sends logs via thrift to their local flumes.
1 machine acts as a collector. the 20 nodes send their data to the collector, the collector writes the data to hdfs/s3/whatever.
This works pretty well, but I'ld stress the following things if you plan on using rpc's at all:
1) Use version 0.9.3, or better yet, wait until 0.9.5, there are a couple of critical rpc bug fixes not in version 0.9.4 (we're about to deploy a version we built from the current 0.9.5 trunk)
2) Even version 0.9.3 has a bunch of rpc-based bugs which mean you'll have to restart nodes whenever you change their config, but this is manageable.
This setup works very well once it's up and running, and version 0.9.5 will make it much more bullet proof.
Generally the local flume-nodes consume minimal resources, you can really hit them hard without them causing an issue. Resource usage will not be a problem.
Hope that helps somewhat?
Foursquare | Software Engineer | Server Engineering Team
On Thursday, September 1, 2011 at 5:03 PM, Avinash Shahdadpuri wrote:
We have recently started using flume.
We have 20 servers behind a load balancer and want to see if we can avoid running flume node on all of them.
We are looking at an option of using the flume log4j appender & avrosource & dedicated flume nodes (machines just running flume)
1. We can use flume log4j appender to stream logs to a dedicated flume node running flume agent/collector. In this case, if the flume node goes down, we would lose the messages.
2. The other option is to flume log4j appender to stream logs on the same machine. In this case we would need an agent on flume node to read the remote server. The avrosource agent doesn't seem to be able to read from remote machine? Is there something else we can do here.
Has anyone come across this and do you have any recommendations to handle this.
Please let me know.