mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Rukletsov <ruklet...@gmail.com>
Subject Re: Review Request 42222: Added a comment on allocator recovery.
Date Tue, 02 Feb 2016 14:22:43 GMT


> On Jan. 15, 2016, 6:50 a.m., Ben Mahler wrote:
> > Thanks Alex! I ended up going over the structure of the recover code and left some
higher level comments. There also appears to be a bug that will crash the master that I marked
as an issue :)
> > 
> > Have we convinced ourselves that we want to act on partial state when quota is *not*
set? It seems to complicate things here, and acting on partial state w/o quota has some issues
as well?
> 
> Alexander Rukletsov wrote:
>     Thanks a bunch for a thorough review, Ben! I have opened a separate ticket to track
the bug and other comments you had left: https://issues.apache.org/jira/browse/MESOS-4417.

> Have we convinced ourselves that we want to act on partial state when quota is not set?

I think it's fine, because the allocator does not provide any guarantees without quota, neither
it enforces the sharing scheme computed by the sorter. There can be other scenarios other
than recovery that may lead to biased cluster sharing. Do you have any specific use cases
in mind why shall we *always* recover?


> On Jan. 15, 2016, 6:50 a.m., Ben Mahler wrote:
> > src/master/allocator/mesos/hierarchical.cpp, line 149
> > <https://reviews.apache.org/r/42222/diff/1/?file=1195000#file1195000line149>
> >
> >     It now seems odd to start the allocation loop within initialize rather than
within recover, we'll do the following:
> >     
> >     
> >     intialize -> unpauses, starts allocation loop but the allocation loop should
do nothing
> >     recover -> pauses allocations, waits for condition to be met, resumes allocation.
> >     
> >     
> >     It seems a simpler way to think about this would be to just start allocating
after the recovery condition is met. This would consist of (1) started = true* and (2) the
initial delay to 'batch' that starts the delay loop.
> >     
> >     * Although keeping the pause concept may be useful if we want to support pausing/resuming
allocation from endpoints.

When we were thinking about how to implement it in the most consistent way, we have figured
out, that the widely used pattern is:
`initialize()` -> [optional] `recover()`. Both `initialize()` and `recover()` are called
from the client code. Hence we decided it would be great to ensure things work if `recover()`
has not been called. We can also always call `recover()` even though there is no quota to
recover.

Anyway I favour to keep the pause concept regardless which workflow we opt for.


> On Jan. 15, 2016, 6:50 a.m., Ben Mahler wrote:
> > src/master/allocator/mesos/hierarchical.cpp, line 168
> > <https://reviews.apache.org/r/42222/diff/1/?file=1195000#file1195000line168>
> >
> >     Should this comment be moved down?
> >     
> >     I'm also not quite sure what it's expressing, is it a re-hash of the lifecycle
documentation?
> >     
> >     ```
> >        * Because it is hard to define recovery for a running allocator, this
> >        * method should be called after `initialize()`, but before actual
> >        * allocation starts (i.e. `addSlave()` is called).
> >     ```
> >     
> >     For example:
> >     
> >     ```
> >     CHECK(initialized);
> >     
> >     // Recovery should be called after initialize and before other calls.
> >     CHECK_EQ(0u, slaves.size());
> >     CHECK_EQ(0, quotaRoleSorter->count());
> >     ```
> >     
> >     Could we extend our existing 'initialized' state boolean into an enum to express
UNINITIALIZED, INITIALIZED, RECOVERED? As far as I can tell, the existing methods with `CHECK(initialized)`
are also interested in checking that recover was called?

I think the question here is whether we want recovery be optional or not. If it is optional,
tracking state transitions may become tedious.


> On Jan. 15, 2016, 6:50 a.m., Ben Mahler wrote:
> > src/master/allocator/mesos/hierarchical.cpp, lines 171-172
> > <https://reviews.apache.org/r/42222/diff/1/?file=1195000#file1195000line171>
> >
> >     Not for this review, but looks like we should update `Sorter::count` to return
a size_t.
> >     
> >     The following CHECK implies that `_expectedAgentCount` should just be a size_t?
> >     
> >     ```
> >     CHECK(_expectedAgentCount >= 0);
> >     ```
> >     
> >     With a size_t we don't have to do this kind of check :)

This is a general question: shall we use `size_t` for non-negative integers or follow google
style guide (and protobufs!) and use int. I'm not sure we have an agreement on this one.


> On Jan. 15, 2016, 6:50 a.m., Ben Mahler wrote:
> > src/master/allocator/mesos/hierarchical.cpp, line 468
> > <https://reviews.apache.org/r/42222/diff/1/?file=1195000#file1195000line468>
> >
> >     Hm.. did you need the static cast here?

I believe so, one is `size_t` and the other one is `int`.


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42222/#review114628
-----------------------------------------------------------


On Jan. 12, 2016, 11:06 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42222/
> -----------------------------------------------------------
> 
> (Updated Jan. 12, 2016, 11:06 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Joris Van Remoortere.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.cpp df8bccaf2b8cfc0cb5ca18d4867371ae7a84c12f

> 
> Diff: https://reviews.apache.org/r/42222/diff/
> 
> 
> Testing
> -------
> 
> None: not a functional change.
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message