mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rukletsov <ruklet...@gmail.com>
Subject Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.
Date Tue, 20 Mar 2018 12:53:35 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review199540
-----------------------------------------------------------



A high level question I have is the semantics of a task kill. You retry `SIGKILL` but not
`SIGTERM`. Why is it so? Do you expect frameworks to retry kills at some point of time (which
should be less that the kill policy timeout to make sense) if they do not receive a terminal
status update? Since the escalation timer is started anyway, is it correct to transition the
container back to `killing=false`? This can also lead to multiple `SIGKILL` requests going
to the agent for the same container, which is likely not a big deal, but maybe is worth checking?

My suggestion would be to retry everything ourselves and do not expect frameworks to do this
for us. When we faced a similar problem with the docker executor, we decided that once the
task has been transitioned to `killed`, there is no way back, even if the kill attempt failed.


src/launcher/default_executor.cpp
Lines 1242-1243 (original)
<https://reviews.apache.org/r/65695/#comment279827>

    Is this change related to your chain? Or does this fix a different issue with the broken
invariant?



src/launcher/default_executor.cpp
Lines 1278 (patched)
<https://reviews.apache.org/r/65695/#comment279826>

    This now means that the executor sends `TASK_KILLING` status update and then might unset
the `killing` flag. Can you please add a comment for `DefaultExecutor::Container::killing`
saying this?


- Alexander Rukletsov


On March 19, 2018, 4:36 p.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
> 
> (Updated March 19, 2018, 4:36 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
> 
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
> 
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 
> 
> 
> Diff: https://reviews.apache.org/r/65695/diff/3/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message