mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <>
Subject Review Request 69267: Fixed flaky SchedulerTest.MasterFailover.
Date Wed, 07 Nov 2018 01:26:40 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for mesos, Alexander Rukletsov and Greg Mann.

Bugs: MESOS-6949

Repository: mesos


This test was flaky because there is a double-master-detection race
after the master fails over.  This test uses the Standalone master
detector, which keeps a single Master PID in memory and always returns
that one PID as the leader.  This means there is almost no delay
between failing over the master and detecting a new leader.

The scheduler in this test tries to send a SUBSCRIBE call to the master
as soon as the master is detected.  Normally, there will only be two
total SUBSCRIBE calls during the test, before and after the master
failover.  However, the test also manually appoints the leader after
failing over the master.  This step races against the scheduler's own
retry logic, and can potentially cause a third SUBSCRIBE if the second
SUBSCRIBE has already started.

Because the scheduler in this test does not enable checkpointing, the
third SUBSCRIBE will actively disconnect the framework, causing the
master to remove the framework.  This removal also prevents the
framework from ever registering again, and thereby times out the test.

This fixes the test to prevent excess master detection events.

We could also change the HTTP scheduler driver to ignore these extra
master detection events when the master in question has not changed.


  src/tests/scheduler_tests.cpp 0ee5b77e5a667e37ac13553e15f634b2cb19ea65 



make check

GLOG_v=1 src/mesos-tests --gtest_filter="*SchedulerTest.MasterFailover*" --gtest_repeat=-1
--gtest_break_on_failure --verbose


Joseph Wu

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message