mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 53239: Changed master to make use of "retired" agent IDs.
Date Mon, 19 Dec 2016 21:15:40 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53239/
-----------------------------------------------------------

(Updated Dec. 19, 2016, 9:15 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Tweak test.


Bugs: MESOS-5396
    https://issues.apache.org/jira/browse/MESOS-5396


Repository: mesos


Description
-------

A retired agent ID will never attempt to re-register in the future;
moreover, any tasks/executors being managed by that agent ID are no
longer running. We can take advantage of this knowledge to avoid waiting
for `agent_reregister_timeout` to expire after master failover.

This is particularly important when agent removal rate-limiting is in
use: if a power failure causes the master to fail at the same time that
many agent hosts lose power, when power returns the master will failover
and all the agents will register anew and receive new agent IDs. With
agent removal rate-limiting, it may take a long time for the master to
mark all the old agent IDs as unreachable; in the meantime, explicit
reconciliation will not return any results, potentially leaving
frameworks in limbo for an extended period.

Note that we currently mark retired agents as unreachable; in the near
future, that will change to marking such agents "gone", once support for
that feature is completed.


Diffs (updated)
-----

  src/master/master.hpp 89b3c394b268a8645885412aeb19980db8d73c69 
  src/master/master.cpp b664550d57ef9805bd23ea35ca7f9cd8f4b1ab78 
  src/tests/slave_recovery_tests.cpp 5b86c06803c59427c826b1b7039a5156a58e141b 

Diff: https://reviews.apache.org/r/53239/diff/


Testing
-------

`make check`

NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, which means we
emit misleading log messages and increment the wrong metrics. Will adjust based on initial
review comments.


Thanks,

Neil Conway


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message