mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Review Request 54909: Fixed spurious registration bug in framework and agent.
Date Thu, 22 Dec 2016 11:49:09 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/#review159945
-----------------------------------------------------------


Fix it, then Ship it!




I'm going to add this comment block above each of the four `Clock::cancel`s.
```
// Cancel the pending registration timer to avoid spurious attempts
// at reregistration. `Clock::cancel` is idempotent, so this call
// is safe even if no timer is active or pending.
```


src/sched/sched.cpp (lines 335 - 339)
<https://reviews.apache.org/r/54909/#comment230998>

    We should move the `Clock::cancel` down into the `if (master.isSome())` block, as that
is where the registration loop is started.



src/slave/slave.hpp (line 799)
<https://reviews.apache.org/r/54909/#comment230999>

    Nit: We standardized how we write `reregistered` in comments a while ago.  No hyphens
anymore :)



src/slave/slave.cpp (lines 923 - 924)
<https://reviews.apache.org/r/54909/#comment230997>

    We also need to cancel the agent registration timer here (as mentioned in the commit description)


- Joseph Wu


On Dec. 20, 2016, 7:09 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54909/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 7:09 p.m.)
> 
> 
> Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, and Joseph
Wu.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
> will resolve the issue, and fix tests that surfaced it (as well as some
> associated errors that cause them to be flaky when this symbol is not
> defined).
> 
> Currently when a new master is detected and no credential is provided,
> the agent and framework (which have very similar registration code
> paths) will attempt to (re)register after some random initial `delay`,
> to avoid a "thundering herd" problem.  It is hence possible to have
> spurious double-(re)registrations, since a new master could be detected
> after we add the `delay`d registration, but before we execute it.
> 
> In a degenerate case, suppose a single agent has a registration delay
> of one minute.  A master is brought up, to which, the agent successfully
> registers.  Prior to this commit, the agent will still have a scheduled
> registration loop (`doReliableRegistration`) with a backoff factor.
> If the master goes down and a new master is brought up, the agent
> will race against itself (two ongoing loops of `doReliableRegistration`)
> to register with the new master.  If the first loop reaches the new
> master first, authentication will fail and cause the agent to commit
> suicide.
> 
> To resolve this problem, we store the value of the `Timer` returned by
> `delay` in `doReliableRegistration` and cancel it when we have either
> registered, or need to start a new cycle of registration. This solution
> is implemented in both the framework code and the agent code.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 
>   src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
>   src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
>   src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 
>   src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 
> 
> Diff: https://reviews.apache.org/r/54909/diff/
> 
> 
> Testing
> -------
> 
> `make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to catch
intermittent failures, which is how we caught the failing test in `reservation_tests.cpp`.
Note that this bug was discovered when we added a `delay` to the call to `authenticate` in
`slave::detected` (in order to get it to match the behavior of the non-authenticated call
to `doReliableRegistration`.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message