mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 50705: Changed master to allow partitioned slaves to reregister.
Date Mon, 17 Jul 2017 17:52:53 GMT


> On July 15, 2017, 4:31 p.m., David McLaughlin wrote:
> > With the new code-path for mark unreachable after failover, this change introduced
a non-backwards compatible change - namely that TASK_LOST messages for each task on the agent
are no longer sent when the slaveLost message is sent. This means that frameworks (like Aurora)
no longer get the signal to schedule replacements for those tasks until they reconcile. Given
that the tasks will be marked as LOST as soon as the agent reregisters anyway, seems like
it's easy to maintain backwards compatibility here.

There is a discussion about this on the mailing list. Would you mind incorporating your feedback
there?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50705/#review180639
-----------------------------------------------------------


On Sept. 12, 2016, 10:05 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50705/
> -----------------------------------------------------------
> 
> (Updated Sept. 12, 2016, 10:05 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-4049
>     https://issues.apache.org/jira/browse/MESOS-4049
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The previous behavior was to shutdown partitioned agents that attempt to
> reregister---unless the master has failed over, in which case the
> reregistration is allowed (when running in "non-strict" mode).
> 
> The new behavior is always to allow partitioned agents to reregister.
> This is part of a longer-term project to allow frameworks to define
> their own policies for handling tasks running on partitioned agents.
> 
> In particular, if a framework has the PARTITION_AWARE capability, any
> tasks running on the partitioned agent will continue to run after
> reregistration. If the framework is not PARTITION_AWARE, any tasks that
> were running on such an agent will be killed after the agent reregisters
> (unless the master has failed over). This is for backward compatibility
> with the previous ("non-strict") behavior. Note that regardless of the
> PARTITION_AWARE capability, the agent will not be shutdown, which is a
> change from the previous Mesos behavior.
> 
> This commit also changes the master so that if an agent is removed and
> then the master receives a message from that agent, the master will no
> longer attempt to shutdown the agent. This is consistent with the goal
> of getting the master out of the business of shutting down agents that
> we suspect are unhealthy. Such an agent will eventually realize it is
> not registered with the master (e.g., because it won't receive any pings
> from the master), which will cause it to reregister.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 4992ab0a0bb5babbf6a4fa3e6eff3577590fc879 
>   src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 
>   src/tests/master_tests.cpp 6cde15fcd6ca8ec40438c75aed980e83f8de9b86 
>   src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 
> 
> 
> Diff: https://reviews.apache.org/r/50705/diff/10/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message