mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Buffington" ...@jaybuff.com>
Subject Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
Date Wed, 22 Apr 2015 17:18:40 GMT


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure `containerizer->wait`
here will return a failure (or a Termination with some reason) when `containerizer->launch`
fails. In that way, the `executorTerminated` will properly send status updates to the slave
(TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems here. The
problem I am refering to is the fact that we don't send status update to the scheduler if
containerizer launch fails until executor reregistration timeout happens. Since for docker
containerizer, someone might use a very large timeout value, ideally, the slave should send
a status update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact that the scheduler
cannot disinguish between the case where the task has failed vs. the case where the configuration
of a task is not correct, because in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a containerizer->destroy(..)
in executorLaunched when containerizer->launch fails. In that way, it's going to trigger
containerizer->wait and thus send status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in Termination
(https://issues.apache.org/jira/browse/MESOS-2035) and let sendExecutorTerminatedStatusUpdate
to propagate the termination reason to the scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in docker containerizer
at least we also need to make sure termination message is set correctly as currently it doesn't
contain all the error information that we pass back to the launch future.

Sorry for the confusion.   There are three problems that are all related.  Yes, we need to
send statusUpdates as soon as containerizer launch fails and yes, we need to set the reason
so the scheduler can distinguish a slave failure from a bad request.  However, my intention
with this patch is not to address either of those two problems.

My goal with this patch is to simply send the containerizer launch failure message back to
the scheduler.  I am using Aurora with the docker containerizer.  There are a myriad of reasons
that dockerd could fail to "docker run" a container.  Currently, when that fails the user
sees a useless and incorrect "Abnormal Executor Termination" message in the Aurora web ui.
 With this patch they see the stderr output of "docker run."  This way they can report meaningful
error messages to the operations team.

I can update this patch to address the other two issues, but the key is that when the containerizer
launch fails we have to send a statusUpdate with a message that contains future.failure().
 Before this patch we were only logging it.  The scheduler needs to get that error message.


- Jay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated April 21, 2015, 10:14 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied a bogus
docker image url and saw the "docker pull" failure stderr message in Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message