mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Klues" <>
Subject Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().
Date Tue, 19 Jan 2016 22:58:05 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for mesos, Ben Mahler and Greg Mann.

Bugs: MESOS-4409

Repository: mesos


Previously, it was possible for join() to return before a schedDriver
was actually fully stopped or aborted (breaking the semantics of the
join() call). The race came from a short circuit in join(), which
simply checked for status != DRIVER_RUNNING before returning. It appears
this short circuit was introduced to handle cases where initialize() or
start() ended up aborting before ever starting the driver to begin with.
However, it unintentionally covers cases where stop() or abort() were
called *after* the driver started running as well.

The problem is that stop() and abort() will change the status
to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
dispatched stop or abort events (which happen asynchronously in a
libprocess thread). Under normal operation, join() would wait for these
events to trigger a latch that allowed the join() call to return.
However, with the short circuit, join() exits immediately even if the
libprocess thread hasn't yet processed the stop() or abort() events.

This commit fixes the semantics of the join() call to avoid this race.
We considered removing the latch completely and replacing it with
process.wait(), but, unlike the latch, this wouldn't ensure that stop()
or abort() was ever called in the first place.


  src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 



Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 14.04.


Kevin Klues

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message