mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 59685: Fixed flakiness in OneWayPartitionTest.MasterToSlave.
Date Wed, 31 May 2017 18:04:23 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59685/
-----------------------------------------------------------

(Updated May 31, 2017, 6:04 p.m.)


Review request for mesos.


Changes
-------

Add comment.


Repository: mesos


Description (updated)
-------

The test did not pause the clock. This allowed the following sequence of
events to occur, with low probability:

  (1) Agent sends register message M1 to master.
  (2) Agent register timer expires, sends register message M2 to master.
  (3) Master sees M1 and adds agent with ID A1.
  (4) Agent gets SlaveRegisteredMessage with ID A1.
  (5) Test case injects `exited` event for agent; master marks agent as
      disconnected
  (6) Master sees M2; since the agent is currently disconnected, the
      master removes A1 and adds the agent with ID A2.
  (7) Agent gets SlaveRegisteredMessage with ID A2. Since this is
      unexpected, it exits ("Registered but got wrong id").

This commit fixes the test case to pause the clock; this prevents the
second registration attempt in step (2) above.

The scenario described above might occur in an actual Mesos deployment,
albeit with very low probability. This would result in a Mesos agent
shutting down immediately after initial registration. MESOS-7596 has
been created to track this issue.


Diffs (updated)
-----

  src/tests/partition_tests.cpp 4ff428564d1fa6cb96e6f8ec8edc331da88a3eb6 


Diff: https://reviews.apache.org/r/59685/diff/2/

Changes: https://reviews.apache.org/r/59685/diff/1-2/


Testing
-------

`./src/mesos-tests --gtest_filter="OneWayPartitionTest.MasterToSlave" --gtest_repeat=10000
--gtest_break_on_failure`

Without this change, the test fails once per ~300 iterations.


Thanks,

Neil Conway


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message