mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <>
Subject Review Request 55307: Improved handling of agents that restart but never re-register.
Date Sat, 07 Jan 2017 21:13:21 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for mesos and Vinod Kone.

Bugs: MESOS-6286

Repository: mesos


The master expected that if an agent responds to pings, it will
(eventually) register or re-register. However, if the agent hangs during
recovery, that assumption does not hold: the agent will continue to
respond to pings but won't attempt to re-register until recovery

To handle this case, the master now expects an agent to re-register
within `agent_reregister_timeout` if the master -> agent socket breaks;
if no re-registration is seen, the master will mark the agent
unreachable. This is a "backup" to handle the case where recovery hangs,
as explained above; more commonly, the agent will re-register (when it
receives a ping and notices the master believes it is disconnected) or
be marked unreachable because it fails to respond to pings.


  docs/ e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
  src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
  src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
  src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
  src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
  src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
  src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 



`make check`

Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.


Neil Conway

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message