mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benno Evers <bev...@mesosphere.com>
Subject Re: Review Request 64930: Unified the marking agent unreachable logic in the master.
Date Wed, 10 Jan 2018 17:54:40 GMT


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > The structure of the new `markUnreachable()` seems to be `[recover slave info depending
on bool parameter] -> common code path to apply registry operation -> [notification/cleanup
depending on bool parameter]`. Intuitively I'd try to get rid of the bool parameter and only
put the shared part into `markUnreachable()`, but on the other hand the code might get even
more complicated when you attempt to do this, so it should be fine to leave the general structure
as is.
> 
> Benjamin Mahler wrote:
>     Thanks for the review!
>     
>     Yes, it's a little unfortunate, I tried to unify as much as possible (you'll notice
that it now consistently checks against agent states, for example). I initially considered
getting rid of the boolean by inferring the state from whether the agent is in `recovered`
or `registered`, however since it's asynchronously called it seems like there may be some
risky cases (e.g. agent moved from `recovered` to `registered` in the interim, so it should
be canceled, but we infer we're marking a registered agent as unreachable!).
>     
>     My thinking was that these can potentially go away longer term if we treat recovered
agents more consistently vs steady state agents by storing a `Slave*` for them (similar to
how we now store a `Framework*` for frameworks recovered from agent information).
>     
>     Let me know if you have any thoughts!

Sounds good to me.


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.hpp
> > Lines 512 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930013#file1930013line512>
> >
> >     Whats actually the reasoning behind crashing the master instead of returning
a failed future?
> 
> Benjamin Mahler wrote:
>     Currently any registry failure is fatal: if we cannot persist any information, we
can't make progress on agent lifecycle events, configuration updates, etc. Rather than trying
to ensure that each caller handles the future failure in a consistent way, I opted to just
consistently abort within this function.
>     
>     Thoughts?

It seems to me like now you have the opposite problem, i.e. ensuring that each caller actually
wants to consistently abort. But I was mostly just curious.


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8277 (original), 8218 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8307>
> >
> >     Do we need an `onDiscarded()` handler on an `undiscardable()` future?
> 
> Benjamin Mahler wrote:
>     The undiscardable wrapper just prevents the discard request from propagating inward
/ breaking any .then chains. It's still possible for there to be a discarded future returned,
so I felt it was safer to handle it.

Ok, then. The naming of `undiscardable()` seems a bit unfortunate in this case :D


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Line 8304 (original), 8235 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8336>
> >
> >     After a similar a change I got the following comment from @dzhuk, which I'll
relay here because I think it's justified:
> >     
> >     > I think `CHECK` should be used to validate invariants that we have from
some other context. but error here is something that can happen during normal operation. and
the original version would produce a meaningful message in log
> 
> Benjamin Mahler wrote:
>     Hm.. I think the current invariant is actually that these should always succeed given
the way we apply registry operations in the master, is that not the case? Do you have a concrete
suggestion for a follow up? Are you suggesting logging and continuing? Probably warrants a
more subtantial conversation outside of this review.

I agree that we should crash here, I was just referring to reporting the error via `LOG(FATAL)`
with a human-readable error message vs. `CHECK()`-ing here. If I understand the registrar
code correctly, the future could be failed for external reasons (e.g. disk full), which probably
shouldn't translate to assertion failures in the logs.


> On Jan. 8, 2018, 5:31 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Lines 8251 (patched)
> > <https://reviews.apache.org/r/64930/diff/1/?file=1930014#file1930014line8352>
> >
> >     Not sure what our coding style says about this, but we should think about preferring
`.find()` over `.contains()` to avoid the double lookup. (which gcc  cannot optimize away
even at `-O3`, according to some quick local testing)
> 
> Benjamin Mahler wrote:
>     We tend to avoid `.find` since it results in less readable code (` == _.end()` and
iterator indirection), but we have made the optimization in some performance critical code
paths. This doesn't seem to be one so I tried to make it more readable and consistent with
our usage of `contains` throughout the code base.
>     
>     There's also `Option<T> hashmap::get`, which if we update to return a reference
should be as efficient as `.find`? Anyway, we can discuss separately from this review.

>From a consistency point of view, I would actually argue that since we use `find()` in
some places anyways, and will not remove from these places for performance reasons, we might
as well use it everywhere. But I agree that this isn't in the scope of this review.


- Benno


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64930/#review194946
-----------------------------------------------------------


On Jan. 9, 2018, 8:06 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64930/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2018, 8:06 a.m.)
> 
> 
> Review request for mesos, Benno Evers and Vinod Kone.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The logic for marking an agent unreachable in the master had two
> very similar code paths that differed slightly across failover
> and steady state cases. This patch uses a single code path.
> 
> Unfortunately, some slight differences were necessary, and a
> failover boolean was introduced to accomodate them. Specifically,
> the failover and steady state cases expect the agent to reside
> in the recovered and registered lists, respectively.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 1610f4331225235e85f5a83e5338870cef99a5c8 
>   src/tests/slave_tests.cpp 159192053087e971746943a1bc17a76143a2d9af 
> 
> 
> Diff: https://reviews.apache.org/r/64930/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message