mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guangya Liu <gyliu...@gmail.com>
Subject Review Request 52465: Fixed the race in master update slave.
Date Sat, 01 Oct 2016 01:08:49 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52465/
-----------------------------------------------------------

Review request for mesos and Benjamin Mahler.


Repository: mesos


Description
-------

The reason that we need `updateSlave` first and then rescind offer
is because of a race condition case: there may be a batch allocation
triggered between rescind offer and `updateSlave`. In this case, the
order will be rescind offer -> batch allocation -> update slave. This
order will cause some issues when the oversubscribed resources was
shrinked.

Suppose the oversubscribed resources was shrinked from 2 to 1, then
after rescind offer finished, the batch allocation will allocate the
old 2 oversubscribed resources again, then update slave will update
the total oversubscribed resources to 1. This will cause the agent
host have some time overcommitted due to the tasks can still use 2
oversubscribed resources but not 1 oversubscribed resources, once
the tasks using the 2 oversubscribed resources finished, everything
goes back.

If we update slave first then rescind offer, the order will be update
slave -> batch allocation -> rescind offer, this order will have no
problem when shrinking resources. Suppose the oversubscribed resources
was shrinked from 2 to 1, then update slave will update total
oversubscribed resources to 1 directly, then the batch allocation will
not allocate any oversubscribed resources since there are more
allocated than total oversubscribed resources, then rescind offer
will rescind all offers using oversubscribed resources. This will
not lead the agent host to be overcommitted.


Diffs
-----

  src/master/master.cpp c83ee2f9fa05372748ff5056229fbe2bf06bfabb 

Diff: https://reviews.apache.org/r/52465/diff/


Testing
-------

Test with https://reviews.apache.org/r/51621/

```
./bin/mesos-tests.sh  --gtest_filter="OversubscriptionTest.RescindRevocableOffer" --gtest_repeat=100
```


Thanks,

Guangya Liu


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message