mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Van Remoortere" <joris.van.remoort...@gmail.com>
Subject Re: Review Request 40351: Quota: Added rescinding offers for set quota requests.
Date Wed, 25 Nov 2015 17:42:00 GMT


> On Nov. 18, 2015, 7:51 a.m., Qian Zhang wrote:
> > src/master/quota_handler.cpp, line 180
> > <https://reviews.apache.org/r/40351/diff/3/?file=1128793#file1128793line180>
> >
> >     Why do we want to rescind the offeres that do not contribute to satisfying quota
request?
> 
> Alexander Rukletsov wrote:
>     Because we may rescind more than necessary to satisfy quota request (remember minimal
agent count). If we have a check in place, this will effectively prevent us from doing so.
Does it make sense to you?
> 
> Qian Zhang wrote:
>     Suppose the quota request is to request 20GB disk for a role, and there is an offer
which only include 2 CPU & 2GB memory and has no disk resources at all, so we will rescind
this offer too? This seems a little unfair to me.
>     And can you please clarify a little more about why we want to rescind offers from
at least `numF` agents? The reason is that we want to ensure each framework in that role will
have a chance to get an offer in next allocation cycle?
> 
> Alexander Rukletsov wrote:
>     That's correct, we will rescind that offer and yes, it's a bit unfair. Let me explain
why I decided to remove this check. Suppose we a quota request is for 6 CPUs for role with
3 frameworks. The first offer we rescind is 10 CPUs, 10GB MEM. Technically, we have enough
resources to satisfy quota, but we would like to rescind offers from at least 2 more agents.
Having a check in place will prevent us from doing so. Do you think greedy rescinding can
be a problem?
>     
>     Yes, we would like to facilitate allocation for each framework in the role, for which
quota is set.
> 
> Qian Zhang wrote:
>     The most unclear in my mind is why we need to rescind offers from at least numF agents,
i.e., in your example above, why do we want to rescind offers from at least 2 more agents
after quota has been satisfied? Can you please let me know the motivation behind it? I think
quota is kind of global concept which should not have direct relation with agent and framework,
it should stay in role level. So I am not sure why we want to facilitate allocation for each
framework in the role, is that something that we mentioned in design doc? Maybe I forget ...
:-)
> 
> Alexander Rukletsov wrote:
>     Nope, it wasn't in the design doc, that's something we decided recently. The main
motivation is to improve user experience and simplify debugging. Because the built-in allocator
is used in 99% of clusters, it makes sense to exploit some knowledge about how it works. Because
of coarse-grained allocations, to facilitate fairness we may want to rescind from more agents
than necessary to satisfy quota numbers.

`why do we want to rescind offers from at least 2 more agents after quota has been satisfied?`
Just to be clear: it's not numF or more agents *on top of* quota. It's at least numF agents
in case the quota itself doesn't already rescind offers from that many.

I'm not sure this is really "un-fair", as these are *offers*, and not *allocations*. We are
not pre-empting tasks. If the resources in the offers that are rescinded are not needed for
quota, then they will be re-offered using the same fair-sharing logic that they were before.
In fact, this is *more* fair, as we might end up making better offers due to information that
has changed in the cluster.

The argument for the `numF` condition that Alex is making is one I pushed for. We often end
up debugging clusters around new features, even not so new features. Although the `numF` condition
by no means guarantees that every framework in the role will receive an offer, it does increase
the chances greatly. The fact that they will receive any offer at all means we will see messages
flowing to the framework, and hopefully log lines at the framework after receiving the offer.
If the offer is still too small to launch a task, at least we will see a message at the framework
level to that regard. **what we are optimizing for** is the ability to eliminate quickly (in
most cases) the possibility that there is a bug in quota because the framework didn't receive
any offers.

Please let me know if this is not clear, as I believe it is very important. The more of us
understand why this extra condition is here, the fewer framework writers and cluster operators
will be coming on IRC / dev list with debug logs that don't allow us to easily eliminate quota
as the source of the problem.


- Joris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40351/#review106977
-----------------------------------------------------------


On Nov. 24, 2015, 4:29 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40351/
> -----------------------------------------------------------
> 
> (Updated Nov. 24, 2015, 4:29 p.m.)
> 
> 
> Review request for mesos, Bernd Mathiske, Joerg Schad, Joris Van Remoortere, Joseph Wu,
and Qian Zhang.
> 
> 
> Bugs: MESOS-3912
>     https://issues.apache.org/jira/browse/MESOS-3912
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp e5e0ed01a56d869cc535687c8dbb6b99f6295b66 
>   src/master/quota_handler.cpp b8e501be43de6bc02aebfa5bd415b4212a96da31 
> 
> Diff: https://reviews.apache.org/r/40351/diff/
> 
> 
> Testing
> -------
> 
> make check (Mac OS X 10.10.4)
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message