mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 52465: Fixed the race in master update slave.
Date Thu, 06 Oct 2016 00:07:09 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52465/#review151601
-----------------------------------------------------------


Fix it, then Ship it!




Thanks for fixing this! Glad that you added a second test for the decreasing case.

Can you file a bug for this issue?


src/master/master.cpp (lines 5554 - 5573)
<https://reviews.apache.org/r/52465/#comment220022>

    How about moving this up to above the update to the agent's resources?
    
    Also I think we can make this a bit more succinct:
    
    ```
    // NOTE: We must *first* update the agent's resources before we
    // recover the resources. If we recovered the resources first,
    // an allocation could trigger between recovering resources and
    // updating the agent in the allocator. This would lead us to
    // re-send out the stale oversubscribed resources!
    ```



src/tests/oversubscription_tests.cpp (line 443)
<https://reviews.apache.org/r/52465/#comment220023>

    This says "if" but we should say something like "In this test the oversubscribed resources
are increased, so the master will send out ...".



src/tests/oversubscription_tests.cpp (line 515)
<https://reviews.apache.org/r/52465/#comment220034>

    Do you need this WillRepeatedly?



src/tests/oversubscription_tests.cpp (line 523)
<https://reviews.apache.org/r/52465/#comment220026>

    Not yours, but can we move this pause to the top?



src/tests/oversubscription_tests.cpp (lines 533 - 535)
<https://reviews.apache.org/r/52465/#comment220025>

    Mind lining this up against the parenthesis?
    
    ```
    EXPECT_EQ(createRevocableResources("cpus", "2"),
              Resources(offers3.get()[0].resources()));
    ```



src/tests/oversubscription_tests.cpp (lines 542 - 544)
<https://reviews.apache.org/r/52465/#comment220027>

    ```
      // Advance the clock to trigger a batch allocation, this will
      // allocate the oversubscribed resources that were rescinded.
    ```



src/tests/oversubscription_tests.cpp (lines 558 - 563)
<https://reviews.apache.org/r/52465/#comment220029>

    How about s/shrinked/decreased/ ?



src/tests/oversubscription_tests.cpp (lines 560 - 562)
<https://reviews.apache.org/r/52465/#comment220030>

    Same comment here as above about the use of "if".



src/tests/oversubscription_tests.cpp (line 631)
<https://reviews.apache.org/r/52465/#comment220035>

    Do you need this WillRepeatedly?



src/tests/oversubscription_tests.cpp (lines 633 - 634)
<https://reviews.apache.org/r/52465/#comment220031>

    s/shrinked/decreased/



src/tests/oversubscription_tests.cpp (line 639)
<https://reviews.apache.org/r/52465/#comment220032>

    Can this be moved up to the beginning of the test? 
    
    (Eventually we want to force a paused clock for all tests to eliminate implicit reliance
on time progressing, leads to too many flaky tests).



src/tests/oversubscription_tests.cpp (lines 646 - 648)
<https://reviews.apache.org/r/52465/#comment220033>

    Ditto comment from the test above.


- Benjamin Mahler


On Oct. 1, 2016, 9:34 a.m., Guangya Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52465/
> -----------------------------------------------------------
> 
> (Updated Oct. 1, 2016, 9:34 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The reason that we need `updateSlave` first and then rescind offer
> is because of a race condition case: there may be a batch allocation
> triggered between rescind offer and `updateSlave`. In this case, the
> order will be rescind offer -> batch allocation -> update slave. This
> order will cause some issues when the oversubscribed resources was
> shrinked.
> 
> Suppose the oversubscribed resources was shrinked from 2 to 1, then
> after rescind offer finished, the batch allocation will allocate the
> old 2 oversubscribed resources again, then update slave will update
> the total oversubscribed resources to 1. This will cause the agent
> host have some time overcommitted due to the tasks can still use 2
> oversubscribed resources but not 1 oversubscribed resources, once
> the tasks using the 2 oversubscribed resources finished, everything
> goes back.
> 
> If we update slave first then rescind offer, the order will be update
> slave -> batch allocation -> rescind offer, this order will have no
> problem when shrinking resources. Suppose the oversubscribed resources
> was shrinked from 2 to 1, then update slave will update total
> oversubscribed resources to 1 directly, then the batch allocation will
> not allocate any oversubscribed resources since there are more
> allocated than total oversubscribed resources, then rescind offer
> will rescind all offers using oversubscribed resources. This will
> not lead the agent host to be overcommitted.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c83ee2f9fa05372748ff5056229fbe2bf06bfabb 
>   src/tests/oversubscription_tests.cpp 3dd34ea78ac795a6b0d342dcae86642c51841eea 
> 
> Diff: https://reviews.apache.org/r/52465/diff/
> 
> 
> Testing
> -------
> 
> make
> make check
> 
> ```
> GLOG_v=1 ./bin/mesos-tests.sh  --gtest_filter="OversubscriptionTest.RescindRevocableOffer*"
--verbose
> ```
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message