mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam B" <a...@mesosphere.io>
Subject Re: Review Request 29507: Added Configurable Slave Ping Timeouts
Date Fri, 26 Jun 2015 10:42:53 GMT


> On June 24, 2015, 6:41 p.m., Ben Mahler wrote:
> > Actually, we should think about one more thing, how does this interact with the
zookeeper session timeout?

The hardcoded individual ping timeout (15secs) was previously longer than the default zk session
timeout (10secs), but the zk session timeout is already configurable with no bounds/validation,
so users could already change it to be longer than an individual ping timeout, or even the
total ping timeout. Is this a bad idea? What would this mean?
This is mostly relevant to the slave-side (as a leader detector), in that a slave may timeout
and decide to check for a new leading master to reregister with, but the zk session may have
recently gone bad but not yet timed out. Since a zk session could go bad/timeout at any point
along the way, I don't think making the ping timeout configurable will introduce any new potential
errors. Rather, it would just be advisable to keep the zk session timeouts reasonably small
so that the zk session is likely to be healthy whenever a slave needs to detect a new leader.
We could introduce validation that the zk_session_timeout is shorter than the (total) ping
timeout, but I'm not sure that's even necessary.
Does this make sense? I'm no ZK expert, so I defer to those with more experience.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review89305
-----------------------------------------------------------


On June 26, 2015, 3:12 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated June 26, 2015, 3:12 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md aaf65bf 
>   docs/upgrades.md 73e503c 
>   src/master/constants.hpp 072d59c 
>   src/master/constants.cpp 997b792 
>   src/master/flags.hpp 55ed3a9 
>   src/master/flags.cpp 4377715 
>   src/master/master.cpp 0782b54 
>   src/messages/messages.proto 1c8d79e 
>   src/slave/constants.hpp 84927e5 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp f1cf3b8 
>   src/slave/slave.cpp b3e1ccc 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp e9002e8 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout
and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message