mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <>
Subject Re: Review Request 53239: Changed master to make use of "retired" agent IDs.
Date Thu, 27 Oct 2016 22:46:30 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Oct. 27, 2016, 10:46 p.m.)

Review request for mesos and Vinod Kone.


Tweak metrics per TODO.

Bugs: MESOS-5396

Repository: mesos


A retired agent ID will never attempt to re-register in the future;
moreover, any tasks/executors being managed by that agent ID are no
longer running. We can take advantage of this knowledge to avoid waiting
for `agent_reregister_timeout` to expire after master failover.

This is particularly important when agent removal rate-limiting is in
use: if a power failure causes the master to fail at the same time that
many agent hosts lose power, when power returns the master will failover
and all the agents will register anew and receive new agent IDs. With
agent removal rate-limiting, it may take a long time for the master to
mark all the old agent IDs as unreachable; in the meantime, explicit
reconciliation will not return any results, potentially leaving
frameworks in limbo for an extended period.

Note that we currently mark retired agents as unreachable; in the near
future, that will change to marking such agents "gone", once support for
that feature is completed.

Diffs (updated)

  src/master/master.hpp 87186c6e733a686f96528b1722fda1c287e9c881 
  src/master/master.cpp 8692726d21812827f9e1fd9093d80fd260588ecb 
  src/tests/slave_recovery_tests.cpp 65fc18bc2732dc53581d39ee23368e347f0b2ca4 



`make check`

NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, which means we
emit misleading log messages and increment the wrong metrics. Will adjust based on initial
review comments.


Neil Conway

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message