flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Stricks <...@wapolabs.com>
Subject Support for Flume Master Instability
Date Tue, 14 Feb 2012 22:33:26 GMT
Hi everyone,

Our Flume cluster was built around a single master when our load was much
lower, but over the past few months, we've seen orders of magnitude
growth.  We have around 75 agents and five collectors; I'm not sure what
our data throughput is right now. Our main agent sources are server logs
and syslog data, with autoE2E sinks. Collectors have autoCollector sources
and HDFS sinks in Amazon S3.

Each night when one of our databases backs up, several of our app boxes
(agents) are terminated due to failing an auto-scaling health test, and
replacements are spun up. Highly correlated with this cycling are 1) spikes
in Disk Write Bytes on our master from around 300KB to ~1.6MB; and 2) the
master crashing. We attribute the failures to being overwhelmed by the
number of agents simultaneously trying to connect to the master because
there aren't any errors being logged in the master's flume log; only 'exec
config' statements for the new boxes. It's sort of a silent failure that we
are struggling to understand. Our master is on an AWS m1.large box (7.5 GB
memory, 4 EC2 Compute Units [2 virtual cores with 2 EC2 Compute Units
each], 850 GB instance storage, 64-bit platform).

We are looking into ways to mitigate the effects of the database backup, so
that the agent boxes don't fail their health checks (assuming that is the
root cause). But in the meantime, we are having to restart the master
almost every night.

We have found that if we don't restart Flume on all of the collectors and
agents when we restart Flume on the master, and if we don't resubmit the
configurations to the master, we end up missing data--often despite the
fact that Flume is running on the agents and the master says the agents are
active on the web interface.  As a safety precaution until we figure out
why that's the case, we're just restarting the entire cluster.

There are a number of questions we are hoping to get help with.

First is if anyone has experienced their master crashing under the strain
of too many agents trying to connect to it? Why would a spike in Disk Read
Bytes be related to this? Does it make sense that too many attempts to
register with the master could cause such a failure?

Are there any configuration settings we can make in a single master
deployment so that it can handle this? Would storing the configuration
settings locally on the master help get the system back up faster?

Would adding two more masters help with the load? We would have to change
many of our config settings that aren't supported with multiple masters, so
we are hoping that there is a one-master solution.

Could using an external zookeeper configuration help at all? Any other
suggestions?

I really appreciate the help, since I don't have a ton of experience and am
hoping to get some much-needed stability.

Thanks a lot,

Jay

Mime
View raw message