mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 52465: Fixed the race in master update slave.
Date Fri, 07 Oct 2016 19:16:25 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52465/#review151848
-----------------------------------------------------------


Ship it!




Ship It!

- Benjamin Mahler


On Oct. 6, 2016, 2:23 a.m., Guangya Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52465/
> -----------------------------------------------------------
> 
> (Updated Oct. 6, 2016, 2:23 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-6317
>     https://issues.apache.org/jira/browse/MESOS-6317
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The reason that we need `updateSlave` first and then rescind offer
> is because of a race condition case: there may be a batch allocation
> triggered between rescind offer and `updateSlave`. In this case, the
> order will be rescind offer -> batch allocation -> update slave. This
> order will cause some issues when the oversubscribed resources was
> decreased.
> 
> Suppose the oversubscribed resources was decreased from 2 to 1, then
> after rescind offer finished, the batch allocation will allocate the
> old 2 oversubscribed resources again, then update slave will update
> the total oversubscribed resources to 1. This will cause the agent
> host have some time overcommitted due to the tasks can still use 2
> oversubscribed resources but not 1 oversubscribed resources, once
> the tasks using the 2 oversubscribed resources finished, everything
> goes back.
> 
> If we update slave first then rescind offer, the order will be update
> slave -> batch allocation -> rescind offer, this order will have no
> problem when shrinking resources. Suppose the oversubscribed resources
> was shrinked from 2 to 1, then update slave will update total
> oversubscribed resources to 1 directly, then the batch allocation will
> not allocate any oversubscribed resources since there are more
> allocated than total oversubscribed resources, then rescind offer
> will rescind all offers using oversubscribed resources. This will
> not lead the agent host to be overcommitted.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 02a2fb29bdd8484fc90e5cb033ac29b49a141860 
>   src/tests/oversubscription_tests.cpp 3dd34ea78ac795a6b0d342dcae86642c51841eea 
> 
> Diff: https://reviews.apache.org/r/52465/diff/
> 
> 
> Testing
> -------
> 
> make
> make check
> 
> ```
> ./bin/mesos-tests.sh  --gtest_filter="OversubscriptionTest.RescindRevocableOffer*" --gtest_repeat=20
> ```
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message