mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 70132: Do not implicitly decline speculatively converted resources.
Date Thu, 25 Apr 2019 22:26:52 GMT


> On April 23, 2019, 10:47 a.m., Benjamin Bannier wrote:
> > docs/scheduler-http-api.md
> > Line 132 (original), 132 (patched)
> > <https://reviews.apache.org/r/70132/diff/5/?file=2140649#file2140649line132>
> >
> >     What do you think of getting rid of "implicitly declined" behavior for "cancelling
operations"?
> >     
> >     It seems that behavior is more driven by the implementation than intuitive api
behavior; it e.g., forces frameworks to reason differently about operations executed in isolation
vs. executed together. It seems having the identical behavior for both cases would both be
easier to explain and also program against. The behavior that seems to make most sense for
me would be to only ever implictly decline "untouched resources", e.g., if accepting offered
`cpus:4` with `RESERVE(cpus:2, role) && UNRESERVE(cpus:2, role)` we would implicitly
decline only `cpus:2`.
> 
> Chun-Hung Hsiao wrote:
>     It seems to me that "cancelling operations" as something that are both 1. very rare
and 2. make little sense for frameworks, so I'm more like delivering a fix for common cases
without making the alrealy-messy code path more complicated. WDYT? Also @bmahler what's your
opinion on @bbannier's suggestion? IIRC you mentioned something like some are designed behaviors
before, but I didn't know the context.
> 
> Benjamin Mahler wrote:
>     Thanks for bringing this up, it's certainly a bit bizarre of a use case. I think
the more common case is UNRESERVE on its own, where it still seems a bit bizarre that the
"untouched" resources are declined with the filter and the UNRESERVE resources are not filtered.
That seems a bit arbitrary to me, but I'm not sure what to do about it without allowing the
framework to be explicit about which part it wants to "decline and filter" when accepting,
and this requires an interface change.
>     
>     Personally I would consider RESERVE+UNRESERVE to be "touching" those resources, but
I don't think we should worry about it in this patch (I assume that wasn't your intent anyway,
and you were more wanting to raise this topic for discussion?)
> 
> Benjamin Bannier wrote:
>     What I worry most is that this edge case makes explaining suggested framework behavior
harder ("should any of the offer operations in a single accept call cancel each other out
you will not get offered the resources again until the default offer filter timeout expires
(the timeout isn't up to you here)" -> framework defensively revives after each accept
call if it has more work to do). Instead we would like frameworks to focus on getting their
offer handling and decline behavior correct and only ever revive in exceptional scenarios
(e.g., even "_new_ work arrived").
>     
>     Since this patch tries to fix incorrect master behavior we should make sure to get
the behavior somewhat right or else risk that frameworks implement suboptimal behavior which
will be hard to unlearn. That being said, the fact that no framework author complained when
this bug was introduced makes me worry that they either do not care about how fast offers
arrive or already implement a overly pessimistc approach (e.g., revive whenever there is more
work to do in their state machine).

The timeout is still controlled by the scheduler even in the case where operations cancel
each other out.

We're trying to move to a world where frameworks are more expilict about what filters they
want to set up. With this in mind, it seems to me that neither the current fix nor what you
suggested fits into this goal well, because here we're *guessing* what the frameworks' intentions.
Say if a framework reserves two CPUs and unreserves the two reserved CPUs and then reserves
two CPUs again, there is no way to infer if the framework is trying to use just two CPUs or
four CPUs. This could become really messy in terms of both semantics and implementation.

Since as you said we're not aware of any complain about this, I'd say let's keep the logic
simple and determine the declined resources based on the end result of an `ACCEPT` call. Dropping
this for now.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70132/#review214812
-----------------------------------------------------------


On April 25, 2019, 10:14 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70132/
> -----------------------------------------------------------
> 
> (Updated April 25, 2019, 10:14 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, and Meng Zhu.
> 
> 
> Bugs: MESOS-9616
>     https://issues.apache.org/jira/browse/MESOS-9616
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a framework accepts an offer to perform pipelined
> operations, e.g., reserving resource, without a final consumer, the
> converted resources will be implicitly declined. This is an undesired
> behavior as the framework might want to reserve one resource first but
> launch a task later in the next allocation cycle. This patch fixes this
> behavior.
> 
> But, if the framework accepts an offers with multiple operations that
> cancel out each other, the resources consumed by these operations are
> still considered unused and will be declined.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp ad54ae217863a08f4e6d743b39c176b171353084 
>   src/tests/slave_tests.cpp b1c3a01031b917fb9773c8c890a8f88838870559 
> 
> 
> Diff: https://reviews.apache.org/r/70132/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> More testing done in r/70537.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message