mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Review Request 70778: Fixed a deadlock in libprocess.
Date Thu, 06 Jun 2019 20:18:17 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70778/#review215721
-----------------------------------------------------------




3rdparty/libprocess/include/process/process.hpp
Lines 420-421 (original), 420-423 (patched)
<https://reviews.apache.org/r/70778/#comment302534>

    In order to help prevent issues in the future, consider expanding this comment to note
that for an interface which enqueues events into a process's queue and takes ownership of
the event in all cases, callers can go through the process manager via `ProcessManager::deliver()`.
(Although the `deliver()` interface is a bit ambiguous, since we have two overloads, one of
which deletes un-enqueued events, the other of which does not.



3rdparty/libprocess/src/process.cpp
Lines 377-380 (original), 377-380 (patched)
<https://reviews.apache.org/r/70778/#comment302533>

    Similar to the note on `ProcessBase::enqueue()` saying that callers retain ownership of
un-enqueued events, could you add a comment here noting that the process manager takes ownership
of the event passed into `deliver()`?



3rdparty/libprocess/src/process.cpp
Line 1594 (original), 1594 (patched)
<https://reviews.apache.org/r/70778/#comment302535>

    Does this change have the desired effect? Since it's calling into the `deliver()` overload
which accepts a `ProcessBase*`, the event won't be deleted if it's not successfully enqueued?


- Greg Mann


On June 4, 2019, 8:09 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70778/
> -----------------------------------------------------------
> 
> (Updated June 4, 2019, 8:09 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9808
>     https://issues.apache.org/jira/browse/MESOS-9808
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9808, when we deliver a DispatchEvent to a TERMINATING
> Process, the DispatchEvent is destructed while holding a
> TERMINATING process reference. This can lead to a deadlock
> because the DispatchEvent destruction can run additional code
> (e.g. destructors of objects that have been bound into the
> dispatch) that ultimately invokes terminate (or dispatch, or send)
> while the original TERMINATING process reference is still held. If
> the TERMINATING process referenced is being cleaned up, we may
> deadlock between (1) `ProcessManager::cleanup()` holding the
> `processes_mutex` and spinning for all references to go away and
> (2) the terminate (or dispatch, or send) in ProcessManager::use
> trying to resolve the destination Process by locking
> `processes_mutex`.
> 
> Example from MESOS-9808:
> 
>   Thread 19:
>     Running ReaderProcess
>     ReaderProcess::_consume, sets a Promise
>     Dispatches to mock API subscriber Process. **It's TERMINATING,
>       so DispatchEvent gets deleted while holding a reference to
>       the mock api subscriber!**
>     Destructs the process::Loop in mock API subscriber Process
>     Deletes the Reader, Reader terminates ReaderProcess
>     **Trying to lock `processes_mutex` in `ProcessManager::use()`
>       of ReaderProcess (deadlock)**
> 
>   Thread 5:
>     ProcessManager::cleanup of Mock API Subscriber Process,
>       looping on references to go away (spinning forever)
> 
> Here, thread 5 is holding `processes_mutex` locked and spinning
> waiting for references to the Mock API Subscriber Process to go
> away, but thread 19 is blocked also trying to lock `processes_mutex`
> while holding a reference to the Mock API Subscriber Process!
> 
> The fix here is to not allow arbitrary code to execute while a
> ProcessReference is held. This is already done in other locations
> but this spot was missed. The reason we haven't hit this bug yet
> is that it's rather uncommon for a dispatch to have bound
> objects whose destructors lead to a `terminate()` call (today
> `terminate()` always resolves the process reference, even if
> it's already resolved, whereas `dispatch()` uses cached process
> references).
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/include/process/process.hpp 7c255acc21c695eba37062a3dcf72ce33f650cd0

>   3rdparty/libprocess/src/event_queue.hpp 999d5520dc5824e47ba5cf676cf28cb6ca37ceb2 
>   3rdparty/libprocess/src/process.cpp a8d9151c91892b55d31cc7c3fb4c7243c41d9d20 
> 
> 
> Diff: https://reviews.apache.org/r/70778/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message