mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 50705: Changed master to allow partitioned slaves to reregister.
Date Wed, 24 Aug 2016 09:45:39 GMT


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 1760
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472658#file1472658line1760>
> >
> >     do you want to remove the code related to 
> >     "flags.registry_strict == true" (in a separate review)?

Yep, will be done as part of https://issues.apache.org/jira/browse/MESOS-5951 , but I'm inclined
to do that once the main functional changes are landed.


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5500
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472658#file1472658line5500>
> >
> >     s/unreachableCause/message/

I don't think `message` is a great name, because it isn't clear what "message" refers to (is
it a message from the slave? A message we want to send to frameworks? etc.) `unreachableCause`
is more verbose but I think it is clearer.


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5501
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472658#file1472658line5501>
> >
> >     s/registrarResult/future/

Similarly to above, I don't think `future` is a great name, because it isn't clear what value
`future` corresponds to. I chose `registrarResult` because the variable holds the result of
the registrar operation.


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/tests/partition_tests.cpp, lines 581-583
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472660#file1472660line581>
> >
> >     what does ping timeouts have to do with slave re-registering?

The slave will try to reregister when it hasn't seen a ping from the master in a suitable
length of time (`Slave::pingTimeout`).


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/tests/partition_tests.cpp, line 539
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472660#file1472660line539>
> >
> >     what's the guarantee that the slave will not try to re-register again?

The clock is paused, so AFAIK the agent won't have any reason to spontaneously reregister.
(We rely on this assumption in most of the other tests that partition agents by dropping PONGs,
I believe...).


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/tests/partition_tests.cpp, line 169
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472660#file1472660line169>
> >
> >     This test is huge. Can you split PARTITION_AWARE and non-PARTITION_AWARE into
separate tests?

It's only ~200 lines of code. I think there is some value in keeping the PARTITION_AWARE and
non-PARTITION_AWARE cases in the same test: in most situations, both cases should behave the
same, so we avoid redundancy; it also makes the situations where different behavior is expected
more clear. It would be easy for the two cases to get out of sync if they are written separately.


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 5521-5522
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472658#file1472658line5521>
> >
> >     Is this for backwards compatibility? If so, can you add a NOTE.
> >     
> >     Will there be new metrics for unreachability?

The current implementation still increments the `slave_removals` metrics when a slave is marked
unreachable (similarly to how the non-strict behavior will increment the `slave_removals`
metric when a slave is lost but later the slave might be allowed to reregister). We can rename
the metrics or add new ones, but I'd prefer to do that later, in a separate patch.

I added a `TODO` here for now.


> On Aug. 23, 2016, 12:56 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 4932-4934
> > <https://reviews.apache.org/r/50705/diff/4/?file=1472658#file1472658line4932>
> >
> >     If the slave was removed, `slave->tasks` should be empty?

No. In the current implementation, an unreachable slave appears in `slaves.removed` (because
it has been "removed" from the list of admitted agents). Importantly, we need to check-for
and remove tasks here: if an agent was marked unreachable but then it reregisters, any non-partition-aware
tasks running on the agent will be shutdown and should be removed from the master's in-memory
state.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50705/#review146438
-----------------------------------------------------------


On Aug. 13, 2016, 11:56 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50705/
> -----------------------------------------------------------
> 
> (Updated Aug. 13, 2016, 11:56 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-4049
>     https://issues.apache.org/jira/browse/MESOS-4049
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The previous behavior was to shutdown partitioned agents that attempt to
> reregister---unless the master has failed over, in which case the
> reregistration is allowed (when running in "non-strict" mode).
> 
> The new behavior is always to allow partitioned agents to reregister.
> This is part of a longer-term project to allow frameworks to define
> their own policies for handling tasks running on partitioned agents.
> 
> In particular, if a framework has the PARTITION_AWARE capability, any
> tasks running on the partitioned agent will continue to run after
> reregistration. If the framework is not PARTITION_AWARE, any tasks that
> were running on such an agent will be killed after the agent reregisters
> (unless the master has failed over). This is for backward compatibility
> with the previous ("non-strict") behavior. Note that regardless of the
> PARTITION_AWARE capability, the agent will not be shutdown, which is a
> change from the previous Mesos behavior.
> 
> This commit also changes the master so that if an agent is removed and
> then the master receives a message from that agent, the master will no
> longer attempt to shutdown the agent. This is consistent with the goal
> of getting the master out of the business of shutting down agents that
> we suspect are unhealthy. Such an agent will eventually realize it is
> not registered with the master (e.g., because it won't receive any pings
> from the master), which will cause it to reregister.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 6decff6f4b9c3434de030fd5c06df4c683a7abad 
>   src/master/master.cpp 0bd1a3490a86fede86a3f5f62ce4745b65aae258 
>   src/tests/master_tests.cpp 398164d09b8ef14f916122ed8780023c4a3cd0f6 
>   src/tests/partition_tests.cpp 0a72b345538ca3b9510fccf38ceb68ac71c2b473 
> 
> Diff: https://reviews.apache.org/r/50705/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message