mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B" <a...@mesosphere.io>
Subject Re: Review Request 29507: Added Configurable Slave Ping Timeouts
Date Sat, 27 Jun 2015 00:47:06 GMT


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 647-650
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line647>
> >
> >     This sounds like testing two particular behaviors in a single test:
> >     
> >     (1) A registered slave that never receives any Ping messages will attempt to
reregister after the master's ping timeout.
> >     
> >     (2) Master will consider it shutdown after the same timeout.
> >     
> >     I believe we already have tests for both of these. For (1) we could just add
the shutdown message expectation to `PartitionTest.PartitionedSlave`. For (2), we have SlaveTest.PingTimeoutNoPings,
SlaveTest.PingTimeoutNoPings.
> 
> Adam B wrote:
>     Yes, I see now that SlaveTest.PingTimeoutNoPings already satisfies (1), and PartitionTest.PartitionedSlave
could be modified to check for the ShutdownMessage() to satisfy (2).
>     I originally started trying to write unit tests for registered/unregistered slaves
that encountered each type of a one-way partition, with both shorter and longer-than-default
ping timeouts, largely for my own understanding of the process. Testing longer timeouts proved
unnecessary, and testing unregistered slaves didn't actually test my feature. I posted these
two since they worked and explicitly tested the one-way partitions I was trying to test. But
this one can almost certainly be covered by the tests you mentioned. I will update in a follow-up
patch.

It turns out that the shutdown message expectation is already covered in SlaveRecoveryTest.PartitionedSlave,
which is similar to, but in many ways a superset of PartitionTest.PartitionedSlave. I have
removed my two new tests completely and relied on SlaveRecoveryTest.PartitionedSlave and SlaveTest.PingTimeoutNoPings,
with the latter updated to use custom ping timeout values.
For the updated test, see new RR: https://reviews.apache.org/r/35958


> On June 3, 2015, 2:24 p.m., Ben Mahler wrote:
> > src/tests/partition_tests.cpp, lines 600-602
> > <https://reviews.apache.org/r/29507/diff/7/?file=973888#file973888line600>
> >
> >     I'm a bit confused by this test, why are you expecting the first ping? Why are
you dropping the pongs?
> >     
> >     I would expect the pings to be dropped entirely here so that advancing the clock
after a slave is registered leads to a re-registration.
> >     
> >     Also, did you know about SlaveTest.PingTimeoutNoPings and SlaveTest.PingTimeoutSomePings
tests? Is there something being tested here that is not captured in these?
> 
> Adam B wrote:
>     This test was intended to simulate a one-way partition where the slave can no longer
reach the master, after the slave initially successfully registered. Hence the PONGs are dropped,
but not the PINGs.

Even though we're dropping PONGs instead of PINGs here, it's still only testing that the Slave
reregisters after the timeout. If anything, it should be testing that a master that never
receives PONGs will shutdown the slave. This behavior is already tested by SlaveRecoveryTest.PartitionedSlave.
No new test needed.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86461
-----------------------------------------------------------


On June 26, 2015, 3:12 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated June 26, 2015, 3:12 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md aaf65bf 
>   docs/upgrades.md 73e503c 
>   src/master/constants.hpp 072d59c 
>   src/master/constants.cpp 997b792 
>   src/master/flags.hpp 55ed3a9 
>   src/master/flags.cpp 4377715 
>   src/master/master.cpp 0782b54 
>   src/messages/messages.proto 1c8d79e 
>   src/slave/constants.hpp 84927e5 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp f1cf3b8 
>   src/slave/slave.cpp b3e1ccc 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp e9002e8 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout
and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message