mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 50705: Changed master to allow partitioned slaves to reregister.
Date Tue, 06 Sep 2016 14:57:49 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50705/
-----------------------------------------------------------

(Updated Sept. 6, 2016, 2:57 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Address review comments.


Bugs: MESOS-4049
    https://issues.apache.org/jira/browse/MESOS-4049


Repository: mesos


Description
-------

The previous behavior was to shutdown partitioned agents that attempt to
reregister---unless the master has failed over, in which case the
reregistration is allowed (when running in "non-strict" mode).

The new behavior is always to allow partitioned agents to reregister.
This is part of a longer-term project to allow frameworks to define
their own policies for handling tasks running on partitioned agents.

In particular, if a framework has the PARTITION_AWARE capability, any
tasks running on the partitioned agent will continue to run after
reregistration. If the framework is not PARTITION_AWARE, any tasks that
were running on such an agent will be killed after the agent reregisters
(unless the master has failed over). This is for backward compatibility
with the previous ("non-strict") behavior. Note that regardless of the
PARTITION_AWARE capability, the agent will not be shutdown, which is a
change from the previous Mesos behavior.

This commit also changes the master so that if an agent is removed and
then the master receives a message from that agent, the master will no
longer attempt to shutdown the agent. This is consistent with the goal
of getting the master out of the business of shutting down agents that
we suspect are unhealthy. Such an agent will eventually realize it is
not registered with the master (e.g., because it won't receive any pings
from the master), which will cause it to reregister.


Diffs (updated)
-----

  src/master/master.hpp c32c7e9d859ef73216354e2c03ecc07d0009b12f 
  src/master/master.cpp b2a19a645528e8fc1fd48f5ac9929d38c9a76b49 
  src/tests/master_tests.cpp 6cde15fcd6ca8ec40438c75aed980e83f8de9b86 
  src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 

Diff: https://reviews.apache.org/r/50705/diff/


Testing
-------

make check


Thanks,

Neil Conway


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message