mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <>
Subject Review Request 53239: Changed master to make use of "retired" agent IDs.
Date Thu, 27 Oct 2016 18:41:50 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for mesos and Vinod Kone.

Bugs: MESOS-5396

Repository: mesos


A retired agent ID will never attempt to re-register in the future;
moreover, any tasks/executors being managed by that agent ID are no
longer running. We can take advantage of this knowledge to avoid waiting
for `agent_reregister_timeout` to expire after master failover.

This is particularly important when agent removal rate-limiting is in
use: if a power failure causes the master to fail at the same time that
many agent hosts lose power, when power returns the master will failover
and all the agents will register anew and receive new agent IDs. With
agent removal rate-limiting, it may take a long time for the master to
mark all the old agent IDs as unreachable; in the meantime, explicit
reconciliation will not return any results, potentially leaving
frameworks in limbo for an extended period.

Note that we currently mark retired agents as unreachable; in the near
future, that will change to marking such agents "gone", once support for
that feature is completed.


  src/master/master.hpp 87186c6e733a686f96528b1722fda1c287e9c881 
  src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 
  src/tests/slave_recovery_tests.cpp 65fc18bc2732dc53581d39ee23368e347f0b2ca4 



`make check`

NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, which means we
emit misleading log messages and increment the wrong metrics. Will adjust based on initial
review comments.


Neil Conway

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message