mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B" <a...@mesosphere.io>
Subject Re: Review Request 29507: Added Configurable Slave Ping Timeouts
Date Mon, 22 Jun 2015 09:58:12 GMT


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > Overall it's looking pretty good! I'd suggest splitting out the tests given the
comments on the source changes are fairly minor.
> > 
> > Echoing Nik, it would be nice to have some sanity bound checks for this flag, to
prevent operators from burning themselves.

Will do. This next revision excludes the new tests, and I will address the test comments in
a subsequent patch/review.

Added flag validation using the cool new lambda style from https://reviews.apache.org/r/34943


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.hpp, lines 100-109
> > <https://reviews.apache.org/r/29507/diff/7/?file=973886#file973886line100>
> >
> >     Ditto Nik's comment for camelCase, but also, could you update these to use `Duration`
rather than directly using a `double`?

Removed `double ping_timeout_seconds` in favor of the nested submessage `connection.ping_timeout_seconds()`
as you suggested later.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 885-888
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line885>
> >
> >     Could you move this up and do it once before the switch?
> >     
> >     I see now why you probably wanted the explicit default, it's too bad we don't
have support to bind in `Option`s for optional fields in install handlers. How about we rely
on the implicit default: https://developers.google.com/protocol-buffers/docs/proto#optional
> >     
> >     Alternatively, you could have a nested message that contains the fields (e.g.
'Connection') which allows you to explicitly check for '`has_ping_timeout_seconds`').

Excellent points. I moved the check up before the switch, and I went with your nested submessage
approach instead of an optional default. I think that reads much better.
And yes, I wanted the explicit default so I could check for a registration from a pre-0.23
master, to maintain the old behavior. I just realized that even this was not enough, in case
this slave had registered with a newer master, then reregistered with an older master, so
I have explicitly added an else case to reset it to the old DEFAULT_MASTER_PING_TIMEOUT().


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/messages/messages.proto, lines 277-288
> > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line277>
> >
> >     The defaults here seem to suggest that we will rely on them (i.e. we don't check
if the field is set, rather we always grab the value). However, 0 is going to be a problematic
value! I'd suggest removing the defaults, and rather, adding a comment here for posterity
that indicates that these were added in 0.23.0.

Addressed with your suggestion of a nested submessage instead.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 923
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line923>
> >
> >     "once re-registered"?

Fixed.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 600-602
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600>
> >
> >     I'm a bit confused by this test, why are you expecting the first ping? Why are
you dropping the pongs?
> >     
> >     I would expect the pings to be dropped entirely here so that advancing the clock
after a slave is registered leads to a re-registration.
> >     
> >     Also, did you know about SlaveTest.PingTimeoutNoPings and SlaveTest.PingTimeoutSomePings
tests? Is there something being tested here that is not captured in these?

This test was intended to simulate a one-way partition where the slave can no longer reach
the master, after the slave initially successfully registered. Hence the PONGs are dropped,
but not the PINGs.


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 647-650
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647>
> >
> >     This sounds like testing two particular behaviors in a single test:
> >     
> >     (1) A registered slave that never receives any Ping messages will attempt to
reregister after the master's ping timeout.
> >     
> >     (2) Master will consider it shutdown after the same timeout.
> >     
> >     I believe we already have tests for both of these. For (1) we could just add
the shutdown message expectation to `PartitionTest.PartitionedSlave`. For (2), we have SlaveTest.PingTimeoutNoPings,
SlaveTest.PingTimeoutNoPings.

Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), and PartitionTest.PartitionedSlave
could be modified to check for the ShutdownMessage() to satisfy (2).
I originally started trying to write unit tests for registered/unregistered slaves that encountered
each type of a one-way partition, with both shorter and longer-than-default ping timeouts,
largely for my own understanding of the process. Testing longer timeouts proved unnecessary,
and testing unregistered slaves didn't actually test my feature. I posted these two since
they worked and explicitly tested the one-way partitions I was trying to test. But this one
can almost certainly be covered by the tests you mentioned. I will update in a follow-up patch.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86461
-----------------------------------------------------------


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout
and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message