mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <benjamin.bann...@mesosphere.io>
Subject Re: Review Request 68147: Added agent support to remove local resource providers.
Date Wed, 31 Oct 2018 14:38:45 GMT


> On Aug. 18, 2018, 12:29 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 7934 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line7934>
> >
> >     `OPERATION_GONE_BY_OPERATOR` is not a terminal state:
> >     https://github.com/apache/mesos/blob/808485da01387bd27a51cb82a90b1f8301d613ee/include/mesos/mesos.proto#L2342-L2349
> 
> Benjamin Bannier wrote:
>     From https://reviews.apache.org/r/68410/#comment_rc290945-207560,
>     
>     > ... It seems that OPERATOR_GONE_BY_OPERATOR does not carry the semantics we
want. Looking at the other possible error codes, it seems we could either use OPERATION_ERROR
or OPERATION_FAILED instead where OPERATION_ERROR possibly better fits our requirements as
the operation's side effects might have materialized. That would mean we should drop this
patch completely.  
>     > WDYT?
> 
> Chun-Hung Hsiao wrote:
>     I'm not sure if either is a good choice:
>     
>     `OPERATION_ERROR` usually means some validation failure before applying the operation,
the consumed resource remains (i.e., can be used by other operations), but the operation is
non-retryable (i.e., the caller must fix the error).
>     `OPERATION_FAILED` usually means something goes wrong after applying the operation,
the consumed resource remains, and the operation is retryable.
>     
>     In the case of mark RP gone, although we can argue that any operation becomes an
non-retryable one and we don't care about the consumed resoure because it will then be removed.
There still are some issues:
>     
>     1. The reason of the operation being non-retryable is not due to any client-side
error, but a server-side "error," and this seems inconsistent to the current `OPERATION_ERROR`
usage.
>     2. We transition an operation to `OPERATION_GONE_BY_OPERATOR` when marking an agent
as gone, but to `OPERATION_ERROR` when marking a local RP as gone. This sounds inconsistent
to me from an API perspective.
>     3. For `OPERATION_ERROR` and `OPERATION_FAILED`, Mesos provides reliable retry (at
least under "normal" circumstances), and expects acks. For `OPERATION_GONE_BY_OPERATOR`, we
could just do a best-effort notification (the implementation is missing though) and don't
need any ack, similar to `TASK_GONE_BY_OPERATOR`. In the case of mark RP as gone, we probably
want to go with the later.
>     
>     So it seems better if we still go with `OPERATION_GONE_BY_OPERATOR` and make the
proper master changes. And we need to make sure that we call `allocator->updateSlave` before
calling `allocator->recoverResources` to avoid the consumed resource being offered to other
framework in between.

I added changes to use `OPERATION_GONE_BY_OPERATOR` here, and to treat it as terminal.


> On Aug. 18, 2018, 12:29 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 8195 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line8195>
> >
> >     Let's validate that there is no task using the resources provided by this RP
before doing the removal, or fail the call.
> 
> Benjamin Bannier wrote:
>     I believe this should be treated as orthogonal issue. It might e.g., happen that
the user triggering the RP removal cannot kill tasks or prevent them from being rescheduled.
I'd suggest to leaves as is.
>     
>     WDYT?
> 
> Chun-Hung Hsiao wrote:
>     The scenario I am worried about is that when the RP is marked as gone, the RP may
fail to tear itself down if there is any task using its resources.
> 
> Benjamin Bannier wrote:
>     We don't seem to have a good story on incompatible resource changes ATM. If we e.g.,
update a RP config in an incompatible way, we make not attempt to detect or rectify this in
any way, either. At least in the agent we were in general we were operating under the assumption
that tasks would die on incompatible changes to their resources.
>     
>     I'd suggest to drop this issue for now and potentially file a JIRA to track draining
and situations where it might be needed.

Dropping this, see above comment.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68147/#review207559
-----------------------------------------------------------


On Oct. 25, 2018, 12:47 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68147/
> -----------------------------------------------------------
> 
> (Updated Oct. 25, 2018, 12:47 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gastón Kleiman, Greg Mann, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8403
>     https://issues.apache.org/jira/browse/MESOS-8403
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds support for triggering permanent removal of local
> resource providers. We also add authorization and tests as part of this
> patch.
> 
> 
> Diffs
> -----
> 
>   include/mesos/agent/agent.proto 74488e873cbf99ca487403b70691912cf3788288 
>   include/mesos/v1/agent/agent.proto 5d1ab6fd7f6f21593444467c66ed35ee90aea07f 
>   src/common/protobuf_utils.cpp 2ae37ac610894652a0214022a0cf04c4365396dd 
>   src/master/master.cpp f88c7c1f03f0de7236aad9e3bf4bfac82e91bc65 
>   src/slave/http.hpp 5b113fa2f7e3421d4219e0ece113010937b204c1 
>   src/slave/http.cpp a4db532cc6477c386fcd9bf563895214e95a475a 
>   src/slave/slave.hpp 0bd340176e2a8cefdfa7ef71e059441fb171aff6 
>   src/slave/slave.cpp 74f6fb9036a9ac4f587f53ec2df04eeb4c167bfb 
>   src/slave/validation.cpp df5e1373dbe497bc859455dcaf4e064e923bd72e 
>   src/tests/api_tests.cpp fdd9f871f75617fc26a28679e2a1e41f506c6133 
> 
> 
> Diff: https://reviews.apache.org/r/68147/diff/10/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message