mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 53202: Avoided CHECK failure with pre-1.0 agents.
Date Thu, 27 Oct 2016 18:25:25 GMT


> On Oct. 26, 2016, 2:43 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp, lines 6036-6037
> > <https://reviews.apache.org/r/53202/diff/1/?file=1546421#file1546421line6036>
> >
> >     Assuming frameworks are not partition-aware based on the agent verison doesn't
feel right.
> >     
> >     Ultimately it doesn't seem to make a difference in terms of messages Mesos sends:
if the framework is not connected, no update is sent in this method. Later when it reconnects
and reconciles, Master checks its capability and decides: `(5) Task is unknown, slave is unreachable:
TASK_UNREACHABLE` or `TASK_LOST` if the frameworks is not partition-aware.
> >     
> >     Would it make sense to set the state to `TASK_UNREACHABLE` in this case? Looks
like the only differences it makes are:
> >     
> >     - - metrics: Regardless of framework capabilities, the agent is indeed unreachable:
`TASK_UNREACHABLE` is more in line with the (1.1) master's logic and the metrics don't reflect
100% of what the master sends out anyways.
> >     - documentation: we set the state because it makes sense to the master and not
by guessing the framework's capabilities. also worth-mentioning is the fact that this doesn't
violate the API semantics: partition-awareness is checked at reconciliaton time.
> 
> Neil Conway wrote:
>     I don't think either option is ideal. I opted for `TASK_LOST` because:
>     
>     (a) sending `TASK_LOST` is still the default behavior; partition-awareness is an
"experimental" feature in 1.1, and in any case is guarded behind a capability.
>     (b) if you have (very) old agents, you're more likely to have old frameworks that
haven't enabled partition-awareness.
> 
> Jiang Yan Xu wrote:
>     OK I am not adamant about `TASK_LOST` vs `TASK_UNREACHABLE` as long as it's not sent
out but w.r.t the rationale:
>     
>     a) I don't feel this has to do with the feature being "experimental", imagine we
are at 1.2 and we've promoted this feature as stable and we have marked the following
>     
>     ```
>       // In Mesos 1.2, this will only be sent when the framework does NOT
>       // opt-in to the PARTITION_AWARE capability.
>       TASK_LOST = 5;
>     ```
>     we still have this problem, would we solve it differently? If not, we might as well
don't take "experimental" into consideration here.
>     
>     b) this is more about documentation but even if we use `TASK_LOST`, can we avoid
saying we are basing our decision on the likelyhood of old frameworks because of old agents?
i.e., my original comment: *we set the state because it makes sense to the master and not
by guessing the framework's capabilities. also worth-mentioning is the fact that this doesn't
violate the API semantics: partition-awareness is checked at reconciliaton time*.
> 
> Neil Conway wrote:
>     Re: (a), we can only guarantee that `TASK_LOST` will not be sent to partition-aware
frameworks if all the agents in the cluster have been upgraded to Mesos >= 1.2 -- otherwise,
an old agent might generate a `TASK_LOST` status update directly (e.g., due to an agent-local
task launch failure).
>     
>     Re: (b) sure, that was only a minor point -- but I think we should err on the side
of not breaking old frameworks if we can't know the right thing to do, at least for 1.1.

a) Ok I only just realized this now. Thanks for the clarification!
b) True, we should err on the side of not breaking old frameworks but here we don't break
them either way. Can we update the comment to not say "assuming the framework isn't partition-aware"
but rather something like "because TASK_LOST is still the default state (as of Mesos 1.1)
in this situation, even if the disconnected framework is partition-aware"?

I just think the decision here is really made but by assuming the framework is not partition-aware.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53202/#review153939
-----------------------------------------------------------


On Oct. 27, 2016, 10:16 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53202/
> -----------------------------------------------------------
> 
> (Updated Oct. 27, 2016, 10:16 a.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6483
>     https://issues.apache.org/jira/browse/MESOS-6483
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> We don't guarantee compatibility with pre-1.0 agents. However, since it
> is easy to avoid a CHECK failure in the master when an old agent
> re-registers, it seems worth doing so.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 
> 
> Diff: https://reviews.apache.org/r/53202/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Disabled the code that fills-in `frameworks.recovered`; verified that `PartitionTest.DisconnectedFramework`
dies with a `CHECK` failure if this RR is not applied but passes this with RR applied.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message