flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Stricks <...@wapolabs.com>
Subject Re: Support for Flume Master Instability
Date Thu, 16 Feb 2012 21:48:29 GMT
Update on this thread about Master instability:

I wrote in an earlier post that we were experiencing a sort of silent error
when the master would crash, but without much information for us to dig
into after the fact.  During an outage the other day, the master's log had
"java.lang.OutOfMemoryError: unable to create new native thread"
exceptions, but this is not always written to the logs when the master
crashes. At least it was something though.

 At the next crash, I updated a few of the Java heap parameters and
increased the max user processes for root user and flume user to be
unlimited. Much of this advice came from a thread I found about the same
error for Hadoop (
http://www.apacheserver.net/major-hdfs-issues-at1195787.htm) and general
advice on fine-tuning garbage collection (
Despite all of the changes, the master crashed again; but these were the
UOPT parameters I changed:

(export UOPTS="-Xmx1g -Xms1g -Xss256k -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -verbose:gc
-Xloggc:/home/flume/gc.log" ; export FLUME_CONF_DIR=/usr/local/flume/conf ;
/usr/local/flume/bin/flume master > /var/log/flume/master_collector.log
2>&1 < /dev/null &)

Fortunately, I started to log the number of threads running in both the
Watchdog and Flume Master processes. I found that in a span of 40 minutes,
the number of threads went from 220 (which is about its normal number) to
22,238--which is how many it had within a minute of the crash.

I'm not sure why the unlimited max user processes, which I edited in
/etc/security/limits.conf and confirmed with an "ulimit -a" didn't work.
But the master had no errors in its logs during that spike to indicate why
it crashed. I still need to diagnose whether the garbage collection/heap
size issues are at play--but without any errors in the logs, it's very

What happened during those 40 minutes? We terminated six instances serving
as agents and launched six more. In the user-data scripts for these
instances, we install our configured Flume package, start the node, and
have it launch a Flume shell where it connects to the master and executes
it's machine-specific Flume config statements.

The master is now running using several UOPTS, which I adjusted after the
last crash, and included some of the settings in FLUME-473 (

(export UOPTS="-Xmx1024m -Xms1024m -Xss512k -XX:+DisableExplicitGC
-XX:SurvivorRatio=10 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=30
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -verbose:gc
-Xloggc:/home/flume/gc.log" ; export FLUME_CONF_DIR=/usr/local/flume/conf ;
/usr/local/flume/bin/flume master >> /var/log/flume/master_collector.log
2>&1 &)

This is the output of a "ulimit -a" currently:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 59721
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 63536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 1024000
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I am hoping that these parameters help.

Does the issue mentioned in a prior post titled "Collector node goes crazy
with thousands of threads" (
apply to masters as well? Is it so simple as just to upgrade to 0.9.4?
Anything else that people would recommend trying?

Very appreciative for the help from everyone.


On Tue, Feb 14, 2012 at 5:33 PM, Jay Stricks <jay@wapolabs.com> wrote:

> Hi everyone,
> Our Flume cluster was built around a single master when our load was much
> lower, but over the past few months, we've seen orders of magnitude
> growth.  We have around 75 agents and five collectors; I'm not sure what
> our data throughput is right now. Our main agent sources are server logs
> and syslog data, with autoE2E sinks. Collectors have autoCollector sources
> and HDFS sinks in Amazon S3.
> Each night when one of our databases backs up, several of our app boxes
> (agents) are terminated due to failing an auto-scaling health test, and
> replacements are spun up. Highly correlated with this cycling are 1) spikes
> in Disk Write Bytes on our master from around 300KB to ~1.6MB; and 2) the
> master crashing. We attribute the failures to being overwhelmed by the
> number of agents simultaneously trying to connect to the master because
> there aren't any errors being logged in the master's flume log; only 'exec
> config' statements for the new boxes. It's sort of a silent failure that we
> are struggling to understand. Our master is on an AWS m1.large box (7.5 GB
> memory, 4 EC2 Compute Units [2 virtual cores with 2 EC2 Compute Units
> each], 850 GB instance storage, 64-bit platform).
> We are looking into ways to mitigate the effects of the database backup,
> so that the agent boxes don't fail their health checks (assuming that is
> the root cause). But in the meantime, we are having to restart the master
> almost every night.
> We have found that if we don't restart Flume on all of the collectors and
> agents when we restart Flume on the master, and if we don't resubmit the
> configurations to the master, we end up missing data--often despite the
> fact that Flume is running on the agents and the master says the agents are
> active on the web interface.  As a safety precaution until we figure out
> why that's the case, we're just restarting the entire cluster.
> There are a number of questions we are hoping to get help with.
> First is if anyone has experienced their master crashing under the strain
> of too many agents trying to connect to it? Why would a spike in Disk Read
> Bytes be related to this? Does it make sense that too many attempts to
> register with the master could cause such a failure?
> Are there any configuration settings we can make in a single master
> deployment so that it can handle this? Would storing the configuration
> settings locally on the master help get the system back up faster?
> Would adding two more masters help with the load? We would have to change
> many of our config settings that aren't supported with multiple masters, so
> we are hoping that there is a one-master solution.
> Could using an external zookeeper configuration help at all? Any other
> suggestions?
> I really appreciate the help, since I don't have a ton of experience and
> am hoping to get some much-needed stability.
> Thanks a lot,
> Jay

View raw message