mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilbert Song <songzihao1...@gmail.com>
Subject Re: Review Request 70609: Added missing onDiscard handler in timeout case for `cgroups::destroy`.
Date Thu, 09 May 2019 08:55:00 GMT


> On May 8, 2019, 5:25 p.m., Gilbert Song wrote:
> > src/linux/cgroups.cpp
> > Lines 1605-1616 (original), 1605-1617 (patched)
> > <https://reviews.apache.org/r/70609/diff/1/?file=2143893#file2143893line1605>
> >
> >     Seems like the commit description `onDiscarded` does not align with the implementation
`onDiscard`. If we call `onDiscard` here, it will shortcut the `onAny` below. Seems to me
both do not make sense here.
> >     
> >     I do not see any `hasDiscard` handler or `onDiscard` callback associated with
this future. Probably we should consider remove `_destroy()`?

ok, please ignore my proposal above.

Still, the patch seems neither a fix nor a workaround. we cannot simulate the stuck issue
in that way.

There are two cases:
1. freezer cgroup - it is fine to call future.discard() and .onAny because destructor of the
internal::destroyer will discard the promise.
2. systemd (our case in this issue) - cgroups::remove() is a blocking call, which means the
.after() 1 min timeout was not triggered yet. We were stucking at cgroups::remove(). https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1574

Probably need to do more experiments to understand why systemd hierarchy stuck at our cgroups::remove().


- Gilbert


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70609/#review215143
-----------------------------------------------------------


On May 8, 2019, 6:28 a.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70609/
> -----------------------------------------------------------
> 
> (Updated May 8, 2019, 6:28 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jie Yu, and Qian Zhang.
> 
> 
> Bugs: MESOS-9306
>     https://issues.apache.org/jira/browse/MESOS-9306
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, when cgroup destruction took longer than the given timeout,
> we called discard on the future. However, only `onAny` callback was
> subscribed on this future, so DISCARDED state was not handled.
> This patch adds missing `onDiscarded` handler.
> 
> 
> Diffs
> -----
> 
>   src/linux/cgroups.cpp 73646c9eb39948192acedb67e3d2fb13acb14b30 
> 
> 
> Diff: https://reviews.apache.org/r/70609/diff/1/
> 
> 
> Testing
> -------
> 
> 1. In order to imitate hanging `cgroups::destroy`, the following code have been added
to the beggining of the `destroy()` function:
> ```
> Future<Nothing> destroy(const string& hierarchy, const string& cgroup)
> {
>   Owned<Promise<Nothing>> promise(new Promise<Nothing>());
>   return promise->future()
>   ...
> ```
> 
> Observed behaviour _without_ this patch applied:
>  Container is stuck in `DESTROING` state, the last message in logs is:
> ```
> I0508 09:12:29.141111 21426 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/e2066055-69b4-4272-8b8d-352e308aaaca'
> ```
> 
> Observed behaviour _with_ this patch applied:
>  Container finishes deinitialization after 1 minute timeout:
> ```
> E0508 09:13:29.143476 21423 slave.cpp:6591] Termination of executor 'a' of framework
f7d6437a-beae-45eb-80ab-ef92e839f352-0000 failed: Failed to kill all processes in the container:
Timed out after 1mins
> ```
> 
> 2. sudo make check
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message