mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Buffington" ...@jaybuff.com>
Subject Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
Date Thu, 07 May 2015 18:47:17 GMT


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/examples/docker_no_executor_framework.cpp, line 109
> > <https://reviews.apache.org/r/33249/diff/4/?file=952333#file952333line109>
> >
> >     is this something you like to commit?

Whops!  That snuck in.  I'll fix.


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/slave/containerizer/docker.cpp, line 1187
> > <https://reviews.apache.org/r/33249/diff/4/?file=952336#file952336line1187>
> >
> >     This seems to not respect anymore the explicit state we try to track what stage
the containerizer was running when we destroy, and what stage it failed.
> >     
> >     Does the error message has clear message saying it failed to fetch/pull/run?

tracking the stage the container is in when we destroyed is done by the container->state
enum now, right?  

Yes, the status message on TASK_FAIL that the scheduler gets back now looks like this:

    Failed to launch container: Failed to 'docker pull bogus-image:latest': exit status =
exited with status 1 stderr = time="2015-05-07T17:45:35Z" level=fatal msg="Error: image library/bogus-image:latest
not found"
    
I tested fetch failures by providing a bogus URI and run failures by starting the slave with
--docker=/tmp/docker which looked like this:

    $ cat /tmp/docker
    #!/bin/bash

    if [[ $1 == "run" ]]; then
        echo "docker run failure!" 1>&2
        exit 123
    fi

    exec docker $*


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/slave/slave.cpp, line 3125
> > <https://reviews.apache.org/r/33249/diff/4/?file=952339#file952339line3125>
> >
> >     We shouldn't need to destroy if none of the containerizers return true.

Ah, that's right.  I will just remove this line.


- Jay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82832
-----------------------------------------------------------


On May 7, 2015, 9:17 a.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated May 7, 2015, 9:17 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
>     https://issues.apache.org/jira/browse/MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   src/examples/docker_no_executor_framework.cpp d5385d9295d30a6039bab9938f3270006582d3da

>   src/slave/containerizer/containerizer.hpp 56c088abb036aa26c323c3142fd27ea29ed51d4f

>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 5e5f13ed8a71ff9510b40b6032d00fd16d312622

>   src/slave/containerizer/mesos/containerizer.cpp f2587280dc0e1d566d2b856a80358c7b3896c603

>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message