mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu" <yujie....@gmail.com>
Subject Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
Date Sat, 02 May 2015 00:34:44 GMT


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure `containerizer->wait`
here will return a failure (or a Termination with some reason) when `containerizer->launch`
fails. In that way, the `executorTerminated` will properly send status updates to the slave
(TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems here. The
problem I am refering to is the fact that we don't send status update to the scheduler if
containerizer launch fails until executor reregistration timeout happens. Since for docker
containerizer, someone might use a very large timeout value, ideally, the slave should send
a status update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact that the scheduler
cannot disinguish between the case where the task has failed vs. the case where the configuration
of a task is not correct, because in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a containerizer->destroy(..)
in executorLaunched when containerizer->launch fails. In that way, it's going to trigger
containerizer->wait and thus send status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in Termination
(https://issues.apache.org/jira/browse/MESOS-2035) and let sendExecutorTerminatedStatusUpdate
to propagate the termination reason to the scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in docker containerizer
at least we also need to make sure termination message is set correctly as currently it doesn't
contain all the error information that we pass back to the launch future.
> 
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related.  Yes, we
need to send statusUpdates as soon as containerizer launch fails and yes, we need to set the
reason so the scheduler can distinguish a slave failure from a bad request.  However, my intention
with this patch is not to address either of those two problems.
>     
>     My goal with this patch is to simply send the containerizer launch failure message
back to the scheduler.  I am using Aurora with the docker containerizer.  There are a myriad
of reasons that dockerd could fail to "docker run" a container.  Currently, when that fails
the user sees a useless and incorrect "Abnormal Executor Termination" message in the Aurora
web ui.  With this patch they see the stderr output of "docker run."  This way they can report
meaningful error messages to the operations team.
>     
>     I can update this patch to address the other two issues, but the key is that when
the containerizer launch fails we have to send a statusUpdate with a message that contains
future.failure().  Before this patch we were only logging it.  The scheduler needs to get
that error message.
> 
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able to solve
the third problem cleanly. Check out `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer
should properly set the message and reason fields in the Termination protobuf (basically why
the container gets terminated and what's the error message). The slave will forward the reason
and message to the scheduler through status update.
>     
>     I chatted with BenM about this yesterday, and there are a couple of notes I want
to write down here.
>     
>     1. We probably need multiple levels for TaskStatus::Reason. In other words, we probably
need a "repeated Reason reasons" field in status update message. For instance, for a containerizer
launch failure, we probably need two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED;
2) the second level reason REASON_DOCKER_PULL_FAILURE;
>     
>     2. We probably want to allow extension to TaskStatus::Reason enum. For example, some
framework/executor may want to add customized reasons. We could leverage the protobuf extension
support to achieve that (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     
>     3. The semantics around Termination is broken currently and we need to clean it up.
For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     
>     Another issue about Termination is that containerizer->wait sometimes returns
a Failure (e.g., launch failed, or destroy failed), that means we cannot get the reason field
and message field from Termination anymore. Right now, if that happens, we always set the
reason to be REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor termination"
in the status update. I think this is a hack and IMO, the containerizer should always return
a valid Termination protobuf to the slave so that the slave can send a meaningful status update
to the framework.
> 
> Jay Buffington wrote:
>     Thanks for these comments, Jie.  I'll work implementing your proposal.
>     
>     I was unware of the existance of the Termination protobuf.  I am confused by these
comments (which I think are just flat out wrong). https://github.com/apache/mesos/blob/c36d5996327ca765f49c211d489371c99ef8e090/src/slave/slave.cpp#L3177
which say:
>     
>        // A termination failure indicates the containerizer could not destroy a
>        // container.
>        // TODO(idownes): This is a serious error so consider aborting the slave if
>        // this occurs.
>        
>     I don't understand when you would ever return Failure() in your containerizer without
doing container->termination.set(termination);
>     
>     Thanks!
> 
> Jie Yu wrote:
>     > I don't understand when you would ever return Failure() in your containerizer
without doing container->termination.set(termination);
>     
>     Yeah, I think that's a tech debt and we should correct it. Let me know if you need
any help for this.
> 
> Jay Buffington wrote:
>     I finally sat down to work through what you said, Jie.
>     
>     Check this out:
>     https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1194-L1202
>     
>     When pull() fails destroy() is called.  Destroy correctly sets Termination, but then
it deletes the container!  Then when we call containerizer->wait() in slave.cpp, we get
back a Failure("Unknown container: ...) instead of the termination future like we should.
>     
>     IMHO the solution here isn't to split up the code from destroy() into to methods:
one that sets the termination (call this launchFailed()) and then make destroy() just erase
the container.
>     
>     Then we should only call launchFailed() from launch (here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L636)

>     
>     When launch fails we should call containerizer->destroy() from slave's executorLaunched
method.

An alternative, can we simply just move `containerizer->wait(containerId)` in `Slave::executorLaunched`
to `Framework::launchExecutor` right after `launch.onAny(defer(slave, &Slave::executorLaunched,
...);`?


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On April 21, 2015, 5:14 p.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated April 21, 2015, 5:14 p.m.)
> 
> 
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied a bogus
docker image url and saw the "docker pull" failure stderr message in Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message