mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <benjamin.bann...@mesosphere.io>
Subject Re: Review Request 68147: Added agent support to remove local resource providers.
Date Wed, 16 Jan 2019 10:51:32 GMT


> On Aug. 18, 2018, 12:29 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Lines 8195 (patched)
> > <https://reviews.apache.org/r/68147/diff/6/?file=2074578#file2074578line8195>
> >
> >     Let's validate that there is no task using the resources provided by this RP
before doing the removal, or fail the call.
> 
> Benjamin Bannier wrote:
>     I believe this should be treated as orthogonal issue. It might e.g., happen that
the user triggering the RP removal cannot kill tasks or prevent them from being rescheduled.
I'd suggest to leaves as is.
>     
>     WDYT?
> 
> Chun-Hung Hsiao wrote:
>     The scenario I am worried about is that when the RP is marked as gone, the RP may
fail to tear itself down if there is any task using its resources.
> 
> Benjamin Bannier wrote:
>     We don't seem to have a good story on incompatible resource changes ATM. If we e.g.,
update a RP config in an incompatible way, we make not attempt to detect or rectify this in
any way, either. At least in the agent we were in general we were operating under the assumption
that tasks would die on incompatible changes to their resources.
>     
>     I'd suggest to drop this issue for now and potentially file a JIRA to track draining
and situations where it might be needed.
> 
> Benjamin Bannier wrote:
>     Dropping this, see above comment.
> 
> Chun-Hung Hsiao wrote:
>     Reopening this. Here's my suggestion:
>     
>     1. Create a ticket for a proper design to drain the resource provider.
>     2. For now, let's fail the call here if `slave->resourceProviders.at(resourceProviderId)->totalResources`
is not empty. In other words, the operator must first reconfigure the resource provider to
"drain" itself before making the `MARK_RESOURCE_PROVIDER_GONE` call.
> 
> Benjamin Bannier wrote:
>     Okay.
>     
>     1. I created https://issues.apache.org/jira/browse/MESOS-9522.
>     2. We had discussed this some time ago, and I think agreed that preventing operators
from marking resource providers as gone whenever a task was running on them was not good (especially
if operators cannot drain the RP). We have prior art in `MARK_AGENT_GONE` which asks operators
to drain agent before marking as gone. I would suggest for now with MESOS-9522 not implemented
we allow operators to mark resource providers as gone even if some of their resources are
in use; if needed they can drain them by filtering all tasks using their resources. This also
wouldn't expose a race.
>     
>     Does this address your concerns?
> 
> Chun-Hung Hsiao wrote:
>     I agreed that checking if there is any task running on resources of an RP when marking
it as gone would lead to some complexity, so I'm suggesting to use a stronger condition: there
must be no resource.
>     
>     I'm not convinced that asking the operator to drain the RP is a bad idea. My main
concern is that, since `MARK_RESOURCE_PROVIDER_GONE` cannot be undone, we should only perform
it when it is safe to do, so the operator can have a chance to fix the problem. Also, this
restriction would ensure that no framework will have an inconsistent view of cluster resources.
WDYT?

Sorry for not reading your comment -- only allowing marking RPs without resources as gone
sounds good to me. Updated this and the follow-up patch & marking as _Fixed_.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68147/#review207559
-----------------------------------------------------------


On Jan. 16, 2019, 11:51 a.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68147/
> -----------------------------------------------------------
> 
> (Updated Jan. 16, 2019, 11:51 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gastón Kleiman, Greg Mann, and Jan Schlicht.
> 
> 
> Bugs: MESOS-8403
>     https://issues.apache.org/jira/browse/MESOS-8403
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds support for triggering permanent removal of local
> resource providers. We also add authorization and tests as part of this
> patch.
> 
> 
> Diffs
> -----
> 
>   include/mesos/agent/agent.proto 74488e873cbf99ca487403b70691912cf3788288 
>   include/mesos/v1/agent/agent.proto 5d1ab6fd7f6f21593444467c66ed35ee90aea07f 
>   src/common/protobuf_utils.cpp a0159fed8d325808c5e8519da06173441debbbbb 
>   src/master/master.cpp 2339207149a85578ea47cf66f28392182f9075f2 
>   src/slave/http.hpp 5b113fa2f7e3421d4219e0ece113010937b204c1 
>   src/slave/http.cpp e0dc09103e9186aa3f7328d2052fc6747f5be9bb 
>   src/slave/slave.hpp 2eadf5fce9a314f1ec0ac5d51820c6381f5f1468 
>   src/slave/slave.cpp 10cbc190acc7eea5734efa0066541168545a66b6 
>   src/slave/validation.cpp df5e1373dbe497bc859455dcaf4e064e923bd72e 
>   src/tests/api_tests.cpp 79a2d2966ff6533629aeeefaaaf19e849b76e0e6 
> 
> 
> Diff: https://reviews.apache.org/r/68147/diff/14/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message