mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B" <a...@mesosphere.io>
Subject Re: Review Request 29507: Added Configurable Slave Ping Timeouts
Date Mon, 22 Jun 2015 08:59:05 GMT


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > LGTM - Would it make sense to have sane min/max values for the timeouts/counts?
> > 
> > I wonder it would make sense to have a test to exercise an upgrade path (the timeout
being different in the slaves, than in the master). Maybe I missed that in your first test.

Added flag validation using the cool new lambda style from https://reviews.apache.org/r/34943

It's not so easy to test the upgrade path in a unit test, since as soon as a slave from this
source base gets its SlaveRegistered ack from the master, it then has the same timeout as
the master (as designed). And if the slave never receives its registered ack, it will retry
registration after a /random/ backoff initially capped at 1min (--registration_backoff_factor).
If the random backoff_factor ends up being longer than the slave's ping timeout, when the
ping timeout expires, the slave will anyway re-detect the master and retry registration. As
I revisit the tests (in a follow-up patch), I will brainstorm ways to test the upgrade path.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/messages/messages.proto, line 279
> > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line279>
> >
> >     Why a zero default?

Originally added as a way for the recipient of this message to know that the sender didn't
set it, presumably because the sender is pre-0.23. Since we don't have a way for install handlers
to translate optional fields into Options instead of just pulling in the default value, I
couldn't think of a cleaner option. BenM, however, suggested pushing the ping timeout into
a nested submessage, so I could just check has_ping_timeout_seconds() instead. Went with that
instead.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/slave/slave.cpp, line 864
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line864>
> >
> >     If we removed the 0 default, we'd do:
> >     
> >     if (ping_timeout_seconds.isSome()) {
> >       // ...
> >     }
> >     
> >     Which seems a bit more logical to me :)

Unfortunately, we don't currently translate optional protobuf fields into Options in install
handlers, so I went for an explicit default. BenM suggested using a nested message so I can
check `if (connection.has_ping_timeout_seconds())` instead. Much more readable.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > docs/configuration.md, line 339
> > <https://reviews.apache.org/r/29507/diff/7/?file=973876#file973876line339>
> >
> >     As this will be markdown, how about formatting max_slave_ping_timeouts in ``
or as "--max_slave_ping_timeouts" ?

Fixed.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
-----------------------------------------------------------


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout
and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message