mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 64930: Unified the marking agent unreachable logic in the master.
Date Tue, 09 Jan 2018 08:06:07 GMT


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > The structure of the new `markUnreachable()` seems to be `[recover slave info depending
on bool parameter] -> common code path to apply registry operation -> [notification/cleanup
depending on bool parameter]`. Intuitively I'd try to get rid of the bool parameter and only
put the shared part into `markUnreachable()`, but on the other hand the code might get even
more complicated when you attempt to do this, so it should be fine to leave the general structure
as is.

Thanks for the review!

Yes, it's a little unfortunate, I tried to unify as much as possible (you'll notice that it
now consistently checks against agent states, for example). I initially considered getting
rid of the boolean by inferring the state from whether the agent is in `recovered` or `registered`,
however since it's asynchronously called it seems like there may be some risky cases (e.g.
agent moved from `recovered` to `registered` in the interim, so it should be canceled, but
we infer we're marking a registered agent as unreachable!).

My thinking was that these can potentially go away longer term if we treat recovered agents
more consistently vs steady state agents by storing a `Slave*` for them (similar to how we
now store a `Framework*` for frameworks recovered from agent information).

Let me know if you have any thoughts!


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.hpp
> > Lines 512 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930013#file1930013line512>
> >
> >     Whats actually the reasoning behind crashing the master instead of returning
a failed future?

Currently any registry failure is fatal: if we cannot persist any information, we can't make
progress on agent lifecycle events, configuration updates, etc. Rather than trying to ensure
that each caller handles the future failure in a consistent way, I opted to just consistently
abort within this function.

Thoughts?


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 2024 (original), 2025 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line2026>
> >
> >     Double space after 'within'.

Thanks for catching these!


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8277 (original), 8218 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8307>
> >
> >     Do we need an `onDiscarded()` handler on an `undiscardable()` future?

The undiscardable wrapper just prevents the discard request from propagating inward / breaking
any .then chains. It's still possible for there to be a discarded future returned, so I felt
it was safer to handle it.


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8304 (original), 8235 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8336>
> >
> >     After a similar a change I got the following comment from @dzhuk, which I'll
relay here because I think it's justified:
> >     
> >     > I think `CHECK` should be used to validate invariants that we have from
some other context. but error here is something that can happen during normal operation. and
the original version would produce a meaningful message in log

Hm.. I think the current invariant is actually that these should always succeed given the
way we apply registry operations in the master, is that not the case? Do you have a concrete
suggestion for a follow up? Are you suggesting logging and continuing? Probably warrants a
more subtantial conversation outside of this review.


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Lines 8251 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8352>
> >
> >     Not sure what our coding style says about this, but we should think about preferring
`.find()` over `.contains()` to avoid the double lookup. (which gcc  cannot optimize away
even at `-O3`, according to some quick local testing)

We tend to avoid `.find` since it results in less readable code (` == _.end()` and iterator
indirection), but we have made the optimization in some performance critical code paths. This
doesn't seem to be one so I tried to make it more readable and consistent with our usage of
`contains` throughout the code base.

There's also `Option<T> hashmap::get`, which if we update to return a reference should
be as efficient as `.find`? Anyway, we can discuss separately from this review.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64930/#review194946
-----------------------------------------------------------


On Jan. 3, 2018, 10:48 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64930/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2018, 10:48 p.m.)
> 
> 
> Review request for mesos, Benno Evers and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The logic for marking an agent unreachable in the master had two
> very similar code paths that differed slightly across failover
> and steady state cases. This patch uses a single code path.
> 
> Unfortunately, some slight differences were necessary, and a
> failover boolean was introduced to accomodate them. Specifically,
> the failover and steady state cases expect the agent to reside
> in the recovered and registered lists, respectively.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 03eb178fa1af7d55ae387e6cb42cdc8d721a2196 
>   src/tests/slave_tests.cpp fb49077d7cb81db450d9fa24f75dbd9c79ef645c 
> 
> 
> Diff: https://reviews.apache.org/r/64930/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message