mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Budnik <abud...@mesosphere.com>
Subject Re: Review Request 69684: Fixed the FD leak if containerizer::_launch() failed or discarded.
Date Tue, 08 Jan 2019 14:29:14 GMT


> On Jan. 8, 2019, 8:22 a.m., Gilbert Song wrote:
> > src/slave/containerizer/mesos/containerizer.cpp
> > Lines 1372 (patched)
> > <https://reviews.apache.org/r/69684/diff/2/?file=2118319#file2118319line1372>
> >
> >     Two cases:
> >     
> >     1) if `prepare(containerId, None())` fails, this `.repair` works, since it would
trigger `FDWrapper` to destructs itself.
> >     2) but if `_launch()` fails, `.repair` is no-op. it does not hurt but hard for
contributors to understand since those FD close are implicit in destructors.
> >     
> >     If we do not care `_launch()` is failed or not, we just want to safeguard the
`READY/FAILURE` cases for `prepare(containerId, None())`, should we move the .repair up (e.g.,
right after `prepare(containerId, None())`)?

I've moved `.repair` handler right after the callback that returns `ioSwitchboard->extractContainerIO(containerId)`.
> right after prepare(containerId, None())

In this case, when the future is discarded after calling `defer(self(), <prepare callback>)`,
but before `prepare callback` is executed, then neither `prepare callback` nor `.repair callback`
is called.


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69684/#review211756
-----------------------------------------------------------


On Jan. 7, 2019, 3:27 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69684/
> -----------------------------------------------------------
> 
> (Updated Jan. 7, 2019, 3:27 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6632
>     https://issues.apache.org/jira/browse/MESOS-6632
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> For the period between IOSB server is up and container process exec,
> if the containerizer launch returns failure or discarded, the FD used
> for signaling from the container process to the IOSB finish redirect
> will be leaked, which would cause the IOSB stuck at `io::redirect`
> forever. It would block the containerizer from cleaning up this
> container at IOSB.
> 
> This issue could be exposed if there are frequent check containers
> launch on an agent with heavy loads.
> 
> This patch fixes the issue by handling discard of a `launch` future,
> so the container IO is cleaned up and therefore all FDs are closed.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/containerizer.cpp a5cf2da55c046c5c45e0c2ca3400f64de12de62b

> 
> 
> Diff: https://reviews.apache.org/r/69684/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing:
> 
> Add `return Failure("Test!");` to `IOSwitchboard::_prepare()`: https://github.com/apache/mesos/blob/7a378597d6a4359de0b1eabb563d4c12cdb4290e/src/slave/containerizer/mesos/io/switchboard.cpp#L657
> 
> 1) Compile without this patch applied
> 2) Run `src/mesos-tests --verbose --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to a hanging IOSB (15 seconds timeout) after trying to destroy
IOSB on the HTTP error "Test!"
> 
> 1) Compile with this patch applied
> 2) Run `src/mesos-tests --verbose --gtest_filter=ContentType/AgentAPITest.AttachContainerInputRepeat/0`
> 3) This test fails due to the HTTP error "Test!" and terminates immediately (as expected)
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message