mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rukletsov <ruklet...@gmail.com>
Subject Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.
Date Wed, 21 Feb 2018 12:42:31 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review197841
-----------------------------------------------------------




src/docker/executor.cpp
Lines 436-437 (original), 436-437 (patched)
<https://reviews.apache.org/r/65713/#comment278129>

    This comment is not strictly corrected: `run.isSome() == true` only means that we have
launched a docker cli `run` command and nothing more. Can you please update the comment? And
maybe also leave a note how `run` and `inspect` relate to task states where these fields are
declared? Thanks!
    
    As a side note (no call to action), a proper state machine with exlicit states would be
great to have here.



src/docker/executor.cpp
Lines 451-455 (patched)
<https://reviews.apache.org/r/65713/#comment278132>

    So instead of checking that inspect **always** finishes after `DOCKER_INSPECT_TIMEOUT`,
we only check that if we want to kill the container, either via kill or reap. I can see the
reason for this (we may want to give `docker inspect` enough time to crank up its whale machinery
if the user gives us this time), but maybe still print a warning? For example, hang the following
at the inspect call site:
    ```
    inspect
      .after(DOCKER_INSPECT_TIMEOUT, [=](const Future<Nothing>& f) {
        LOG(WARNING) << "docker inspect has not finished after " << DOCKER_INSPECT_TIMEOUT;
        return f;
      });



src/docker/executor.cpp
Line 509 (original), 518-519 (patched)
<https://reviews.apache.org/r/65713/#comment278130>

    "Docker stop timed out after " << KILL_RETRY_INTERVAL;
    
    Also no need to specify the container name if you follow the suggestion below.



src/docker/executor.cpp
Lines 523-527 (patched)
<https://reviews.apache.org/r/65713/#comment278131>

    Instead of this, why not make `stop.onFailed` below `stop.recover`?


- Alexander Rukletsov


On Feb. 20, 2018, 5:11 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2018, 5:11 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and Vinod Kone.
> 
> 
> Bugs: MESOS-8574
>     https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/1/
> 
> 
> Testing
> -------
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version string) (interface{},
error) {
> +       time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +       rand.Seed(time.Now().UTC().UnixNano())
> +       if rand.Intn(2) == 0 {
> +               time.Sleep(20 * time.Second)
> +       }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), &Self::retryKill, task->task_id(), offer.agent_id());`
after https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
>     killTask(taskId, agentId);
>     delay(Seconds(6), self(), &Self::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh --resources="cpus:10000;mem:1000000"
--work_dir='/home/abudnik/mesos/build/var/agent-1' --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute --master="127.0.1.1:5050"
--name="a" --containerizer=docker --docker_image="ubuntu:xenial" --command="sleep 9999"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message