mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.
Date Wed, 25 Apr 2018 17:19:41 GMT


> On April 23, 2018, 10:43 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Line 4512 (original), 4515-4527 (patched)
> > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515>
> >
> >     I was just wondering - perhaps this code belongs in the validation function
for the ACCEPT call? Since this is something that we can easily confirm synchronously in the
top-level scheduler API handler and return 400, rather than calling into the accept handlers.
> 
> Zhitao Li wrote:
>     I'm fine in either case as long as it provides better consistency and less code duplication
(which I think is a problem). I guess the reason there is a split is due to that AuthZ is
async right now.
>     
>     I actualy wonder what happens if we move the slave connectivity checks to `Master::accept()`:
since master to agent connection through network can be broken at any moment, I do not see
checking that in `_accept()` is 100% safe anyway, and `send()` calls to agent need to handle
failures anyway.
>     
>     Alternatively, we can consider moving authorization checks as early as we can, and
condense all other validations after AuthZ returned (assuming AuthZ do not depends on these
validations).
> 
> Zhitao Li wrote:
>     (replied too fast) I feel both work can reduce the code duplicate, and makes it less
ambiguous for future development.
> 
> Chun-Hung Hsiao wrote:
>     IMO in general it's better to have validations before authN/authZ, because validations
is usually more light-weight than authN/authZ. Also, validations should be synchronous, meaning
that if the call is invalid, we will get rid of it faster from the memory (rather than waiting
for the authN/authZ to finish).
> 
> Chun-Hung Hsiao wrote:
>     But in this case since this is temporary, I'm fine with putting the logic here.
> 
> Chun-Hung Hsiao wrote:
>     Question: Are these operations usable with this restriction? It seems to me that
with this restriction, the framework will need to maintain one more extra state: 1. reserve
enough CPUs, mem and free disk (otherwise it is possible that the framework cannot get enough
CPUs and mem after growing the volume); 2. grow the persistent volume; 3. use the grown volume.
If we implement a proper validation right now, 1 and 2 can be combined, and there will be
no need to rewrite the framework in the future. Can you explain again what the concern of
implementing the correct validation logic again?
> 
> Zhitao Li wrote:
>     Re: order of valiation vs AuthZ: I agree they should be consistency, but we already
have splits between `accept()` and `_accept()` and framework could see operations dropped
either before or after AuthZ too. Given that 1) AuthZ will be async by nature and 2) certain
states could change between `accept()` and `_accept()` (offer validaty, agent connectivity),
I feel that only check AuthZ on `accept()` but delay any actual validation to `_accept()`
results in most consistent behavior. Indeed this could increase calls to authorizor but I
don't know whether that can be quantified.
>     
>     Re: usability of framework: I generally believe that framework always needs to model
each state to handle corner cases even if it uses chained operations (because that can have
partial success). That's also why I hope to eventually not support chained operations at all.
In the final form of the code, whatever "correct validation logic" we write now will not be
necessary, and we can implicitly rely on `offeredResources` containment checks chained operations
(because `converted` resources will not be added back immediately). Implementing this right
now would require us to add extra variables like `observedOfferedResources` and `actualOfferedResources`
and check them on each `case` of operations, which is not be as clean or as local as current
patch (which also easier to back out).
> 
> Chun-Hung Hsiao wrote:
>     I'm not sure if I get it. Instead of applying the conversion on `_offeredResources`,
we could just do
>     ```
>     _offeredResources -= consumed;
>     ```
>     Wouldn't this solve the issue? Am I missing anything?
> 
> Zhitao Li wrote:
>     This code block (https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502)
requires `_offeredResources` properly tracks what allocator needs to recover. I believe we
still need to handle `converted` part of `grow` and `shrink`, otherwise allocator could see
inconsitent views.
> 
> Chun-Hung Hsiao wrote:
>     Thanks for point it out Zhitao.
>     
>     Just briefly chatted with Greg, and it seems that I misunderstood what he suggested
:(
>     He suggested to check the size of operations here: https://github.com/apache/mesos/blob/master/src/master/validation.cpp#L547
>     which will lead to the entire call being dropped: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2350
>     And we don't need to send status updates for each LAUNCH and LAUNCH_GROUP. It seems
cleaner to me.
>     
>     re: usability of framework. If your framework team has no problem using the call
this way, or doesn't mind updating their frameworks, then  it's probably fine to have this
restriction ;)
> 
> Zhitao Li wrote:
>     I like the suggestion to perform the validation there, although I want to know how
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2350 handles recovery of
resources from offers if it just drops the operation w/o going through offered resources in
each offers (or maybe it does not ?)
> 
> Chun-Hung Hsiao wrote:
>     Ah, good point. I don't think it will recover the offer. The framework is still holding
the offer in this case. Do you think this is a reasonable behavior?

And sorry I made a mistake on this. For V1 it's not that master.cpp#L2350 will drop the call;
it's here that the master will respond with a 400 Bad Request: https://github.com/apache/mesos/blob/master/src/master/http.cpp#L956


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66568/#review201785
-----------------------------------------------------------


On April 23, 2018, 8:07 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66568/
> -----------------------------------------------------------
> 
> (Updated April 23, 2018, 8:07 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-4965
>     https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> These two operations are intended to be non-speculative eventually but
> are implemented as speculative right now. To avoid frameworks opt-in to
> dangerous behavior, we require that accept can only contain one
> `GROW_VOLUME` or `SHRINK_VOLUME` and no other operations.
> 
> This is implemented by reuse existing code which already drops `LAUNCH`
> or `LAUNCH_GROUP` with proper reason and message.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
> 
> 
> Diff: https://reviews.apache.org/r/66568/diff/6/
> 
> 
> Testing
> -------
> 
> Behavior tested in https://reviews.apache.org/r/66569/.
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message