mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 68147: Added agent support to remove local resource providers.
Date Mon, 20 Aug 2018 18:48:04 GMT


> On Aug. 17, 2018, 10:29 p.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 7934 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line7934>
> >
> >     `OPERATION_GONE_BY_OPERATOR` is not a terminal state:
> >     https://github.com/apache/mesos/blob/808485da01387bd27a51cb82a90b1f8301d613ee/include/mesos/mesos.proto#L2342-L2349
> 
> Benjamin Bannier wrote:
>     From https://reviews.apache.org/r/68410/#comment_rc290945-207560,
>     
>     > ... It seems that OPERATOR_GONE_BY_OPERATOR does not carry the semantics we
want. Looking at the other possible error codes, it seems we could either use OPERATION_ERROR
or OPERATION_FAILED instead where OPERATION_ERROR possibly better fits our requirements as
the operation's side effects might have materialized. That would mean we should drop this
patch completely.  
>     > WDYT?

I'm not sure if either is a good choice:

`OPERATION_ERROR` usually means some validation failure before applying the operation, the
consumed resource remains (i.e., can be used by other operations), but the operation is non-retryable
(i.e., the caller must fix the error).
`OPERATION_FAILED` usually means something goes wrong after applying the operation, the consumed
resource remains, and the operation is retryable.

In the case of mark RP gone, although we can argue that any operation becomes an non-retryable
one and we don't care about the consumed resoure because it will then be removed. There still
are some issues:

1. The reason of the operation being non-retryable is not due to any client-side error, but
a server-side "error," and this seems inconsistent to the current `OPERATION_ERROR` usage.
2. We transition an operation to `OPERATION_GONE_BY_OPERATOR` when marking an agent as gone,
but to `OPERATION_ERROR` when marking a local RP as gone. This sounds inconsistent to me from
an API perspective.
3. For `OPERATION_ERROR` and `OPERATION_FAILED`, Mesos provides reliable retry (at least under
"normal" circumstances), and expects acks. For `OPERATION_GONE_BY_OPERATOR`, we could just
do a best-effort notification (the implementation is missing though) and don't need any ack,
similar to `TASK_GONE_BY_OPERATOR`. In the case of mark RP as gone, we probably want to go
with the later.

So it seems better if we still go with `OPERATION_GONE_BY_OPERATOR` and make the proper master
changes. And we need to make sure that we call `allocator->updateSlave` before calling
`allocator->recoverResources` to avoid the consumed resource being offered to other framework
in between.


> On Aug. 17, 2018, 10:29 p.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 8195 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line8195>
> >
> >     Let's validate that there is no task using the resources provided by this RP
before doing the removal, or fail the call.
> 
> Benjamin Bannier wrote:
>     I believe this should be treated as orthogonal issue. It might e.g., happen that
the user triggering the RP removal cannot kill tasks or prevent them from being rescheduled.
I'd suggest to leaves as is.
>     
>     WDYT?

The scenario I am worried about is that when the RP is marked as gone, the RP may fail to
tear itself down if there is any task using its resources.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68147/#review207559
-----------------------------------------------------------


On Aug. 20, 2018, 6:47 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68147/
> -----------------------------------------------------------
> 
> (Updated Aug. 20, 2018, 6:47 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gastón Kleiman, Greg Mann, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8403
>     https://issues.apache.org/jira/browse/MESOS-8403
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds support for triggering permanent removal of local
> resource providers. We also add authorization and tests as part of this
> patch.
> 
> 
> Diffs
> -----
> 
>   include/mesos/agent/agent.proto 74488e873cbf99ca487403b70691912cf3788288 
>   include/mesos/v1/agent/agent.proto 5d1ab6fd7f6f21593444467c66ed35ee90aea07f 
>   src/slave/http.hpp 78200874cb03a3d5af4ee0443c03a1d94cd494cc 
>   src/slave/http.cpp 1b6d266a1a7821c9de6871cbca43317b3c392a32 
>   src/slave/slave.hpp 0420109ac93e1249906c52437e5859c5ee033fb6 
>   src/slave/slave.cpp 679394a549cdfe84d64a102164c8652ad96f1eb2 
>   src/slave/validation.cpp df5e1373dbe497bc859455dcaf4e064e923bd72e 
>   src/tests/api_tests.cpp 9c9fa9105a27ac7bb7f30d7eb67512d1e1d63d15 
> 
> 
> Diff: https://reviews.apache.org/r/68147/diff/7/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message