mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niklas Nielsen" <...@qni.dk>
Subject Re: Review Request 34720: Added kill executor correction to slave.
Date Thu, 04 Jun 2015 20:06:24 GMT


> On June 4, 2015, 10:58 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 4400-4413
> > <https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400>
> >
> >     This is getting a little hairy. As the TODO says we really ought bubble this
up via the Termination protobuf. Have you looked into it?
> 
> Jie Yu wrote:
>     +1
>     
>     It really gets nasty here. I remembered that jaybuff said that they plan to fix that.
Maybe you want to sync with them first.
> 
> Niklas Nielsen wrote:
>     Mind adding a few references? Where is the Termination proto?
> 
> Jie Yu wrote:
>     Some references for the context:
>     https://issues.apache.org/jira/browse/MESOS-2035
>     https://reviews.apache.org/r/33249/ (see discussions in the review)
>     https://issues.apache.org/jira/browse/MESOS-2657
> 
> Niklas Nielsen wrote:
>     Can you be a bit more concrete about your concerns and suggested path forward? Are
you suggesting to land 2035 first? Is it the qosKilled bool that concerns you?
> 
> Jie Yu wrote:
>     My concern is that 'killed' field in Termination is overloaded. If it is true, it
means either 1) user initiated killTask 2) killed due to container limit has been reached
(e.g., memory, it's buggy already since we also have disk limit now). 3) killed due to QoS
controller.
>     
>     Therefore, relying on this field to decide what 'reason' and 'status' to send back
is very unintuitive and hard to understand. Just feel that this is a tech debt we should fix
asap.
>     
>     I don't have a suggestion yet. I need to think about it.

There are a few things we need to address:

1) What status should QoS killed tasks have? Thought TASK_LOST is better than TASK_FAILED.
   Adding new terminal statuses requires support in frameworks and my intuition was that we
should try to exercise the retry logic in the frameworks with TASK_LOST.

2) How should the framework be able tell the difference between different events.
   In this case, I thought it deserved it's own reason (as those are fine grained).

Didn't put more thought into this as I mimicked the existing code, so I am open to any change
you guys have in mind.
While I agree with all the above, it seems like a larger issue with the containerizer API
(and not well suited to be addressed here alone?)

> Therefore, relying on this field to decide what 'reason' and 'status' to send back is
very unintuitive and hard to understand. Just feel that this is a tech debt we should fix
asap.

This patch issues a containerizer->destroy() and this is where it ends; in order to distinguish
the cause of kill, it looks like we need to change the containerizer API?


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
-----------------------------------------------------------


On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34720/
> -----------------------------------------------------------
> 
> (Updated June 4, 2015, 10:43 a.m.)
> 
> 
> Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and Vinod Kone.
> 
> 
> Bugs: MESOS-2653
>     https://issues.apache.org/jira/browse/MESOS-2653
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
>   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
> 
> Diff: https://reviews.apache.org/r/34720/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message