mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Review Request 67403: Handled race condition when removing maintenance windows.
Date Mon, 04 Jun 2018 18:25:10 GMT


> On May 31, 2018, 9:41 a.m., Vinod Kone wrote:
> > Can you add a unit test for this?
> 
> Benno Evers wrote:
>     It's tricky because we need very precise control over the scheduling, and I'm not
sure our testing infrastructure provides it. But I'll look into it.
> 
> Vinod Kone wrote:
>     I see.  The bug is in the allocator, so you cannot use a mock allocator unfortunately
for control. Consider pausing the clock to have better control in the test.
> 
> Benno Evers wrote:
>     After discussing with Benjamin Bannier, we came to the conclusion that it's currently
not possible to write a unit test for this scenario, because we're lacking the capability
to intercept a dispatch and re-insert it into the event queue at a later time.
> 
> Joseph Wu wrote:
>     I gave writing the test a shot... and I think it might be possible, but the resulting
test would be too fragile to be a regression test.
>     
>     Here's my (not working yet) attempt: https://github.com/kaysoky/mesos/commit/29c6a1807d65d01440b7c67a73062ae9af892afe
> 
> Benno Evers wrote:
>     Do you plan to continue working on that, or should we go ahead and commit the fix?

I'll commit this patch shortly.

The test is more of an experiment to see how bad a test for this scenario would look like
:D


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67403/#review204121
-----------------------------------------------------------


On June 1, 2018, 7:17 a.m., Benno Evers wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67403/
> -----------------------------------------------------------
> 
> (Updated June 1, 2018, 7:17 a.m.)
> 
> 
> Review request for mesos, Joseph Wu and Vinod Kone.
> 
> 
> Bugs: MESOS-7966
>     https://issues.apache.org/jira/browse/MESOS-7966
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When executing the `Master::inverseOffers()` callback, it
> could happen that the maintenance window the reverse offer
> referred to was already removed by a concurrent call to
> to the maintenance endpoint of Mesos.
> 
> In this case, we must not send out a reverse offer, because
> having outstanding inverse offers for an agent without
> any scheduled maintenance window will lead to a crash in
> the allocator when attempting to remove this offer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp ba3f8746ea393c8655fcd5ceaace099f68df0b19 
> 
> 
> Diff: https://reviews.apache.org/r/67403/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Set up the reproduction environment locally and ran `while :; python call.py; done` for
about a minute. (see linked ticket)
> 
> 
> Thanks,
> 
> Benno Evers
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message