mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Clemmer <clemmer.alexan...@gmail.com>
Subject Re: Review Request 54909: Fixed spurious registration bug in framework and agent.
Date Wed, 21 Dec 2016 03:09:34 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/
-----------------------------------------------------------

(Updated Dec. 21, 2016, 3:09 a.m.)


Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, and Joseph Wu.


Changes
-------

Let's fix the spurious framework registration bug while we're at it! :)


Summary (updated)
-----------------

Fixed spurious registration bug in framework and agent.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description (updated)
-------

This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
will resolve the issue, and fix tests that surfaced it (as well as some
associated errors that cause them to be flaky when this symbol is not
defined).

Currently when a new master is detected and no credential is provided,
the agent and framework (which have very similar registration code
paths) will attempt to (re)register after some random initial `delay`,
to avoid a "thundering herd" problem.  It is hence possible to have
spurious double-(re)registrations, since a new master could be detected
after we add the `delay`d registration, but before we execute it.

In a degenerate case, suppose a single agent has a registration delay
of one minute.  A master is brought up, to which, the agent successfully
registers.  Prior to this commit, the agent will still have a scheduled
registration loop (`doReliableRegistration`) with a backoff factor.
If the master goes down and a new master is brought up, the agent
will race against itself (two ongoing loops of `doReliableRegistration`)
to register with the new master.  If the first loop reaches the new
master first, authentication will fail and cause the agent to commit
suicide.

To resolve this problem, we store the value of the `Timer` returned by
`delay` in `doReliableRegistration` and cancel it when we have either
registered, or need to start a new cycle of registration. This solution
is implemented in both the framework code and the agent code.


Diffs (updated)
-----

  src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 
  src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
  src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
  src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 
  src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 

Diff: https://reviews.apache.org/r/54909/diff/


Testing
-------

`make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to catch intermittent
failures, which is how we caught the failing test in `reservation_tests.cpp`. Note that this
bug was discovered when we added a `delay` to the call to `authenticate` in `slave::detected`
(in order to get it to match the behavior of the non-authenticated call to `doReliableRegistration`.


Thanks,

Alex Clemmer


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message