mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 53202: Avoided CHECK failure with pre-1.0 agents.
Date Wed, 26 Oct 2016 22:32:40 GMT


> On Oct. 26, 2016, 2:43 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp, lines 6036-6037
> > <https://reviews.apache.org/r/53202/diff/1/?file=1546421#file1546421line6036>
> >
> >     Assuming frameworks are not partition-aware based on the agent verison doesn't
feel right.
> >     
> >     Ultimately it doesn't seem to make a difference in terms of messages Mesos sends:
if the framework is not connected, no update is sent in this method. Later when it reconnects
and reconciles, Master checks its capability and decides: `(5) Task is unknown, slave is unreachable:
TASK_UNREACHABLE` or `TASK_LOST` if the frameworks is not partition-aware.
> >     
> >     Would it make sense to set the state to `TASK_UNREACHABLE` in this case? Looks
like the only differences it makes are:
> >     
> >     - - metrics: Regardless of framework capabilities, the agent is indeed unreachable:
`TASK_UNREACHABLE` is more in line with the (1.1) master's logic and the metrics don't reflect
100% of what the master sends out anyways.
> >     - documentation: we set the state because it makes sense to the master and not
by guessing the framework's capabilities. also worth-mentioning is the fact that this doesn't
violate the API semantics: partition-awareness is checked at reconciliaton time.
> 
> Neil Conway wrote:
>     I don't think either option is ideal. I opted for `TASK_LOST` because:
>     
>     (a) sending `TASK_LOST` is still the default behavior; partition-awareness is an
"experimental" feature in 1.1, and in any case is guarded behind a capability.
>     (b) if you have (very) old agents, you're more likely to have old frameworks that
haven't enabled partition-awareness.

OK I am not adamant about `TASK_LOST` vs `TASK_UNREACHABLE` as long as it's not sent out but
w.r.t the rationale:

a) I don't feel this has to do with the feature being "experimental", imagine we are at 1.2
and we've promoted this feature as stable and we have marked the following

```
  // In Mesos 1.2, this will only be sent when the framework does NOT
  // opt-in to the PARTITION_AWARE capability.
  TASK_LOST = 5;
```
we still have this problem, would we solve it differently? If not, we might as well don't
take "experimental" into consideration here.

b) this is more about documentation but even if we use `TASK_LOST`, can we avoid saying we
are basing our decision on the likelyhood of old frameworks because of old agents? i.e., my
original comment: *we set the state because it makes sense to the master and not by guessing
the framework's capabilities. also worth-mentioning is the fact that this doesn't violate
the API semantics: partition-awareness is checked at reconciliaton time*.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53202/#review153939
-----------------------------------------------------------


On Oct. 26, 2016, 12:51 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53202/
> -----------------------------------------------------------
> 
> (Updated Oct. 26, 2016, 12:51 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6483
>     https://issues.apache.org/jira/browse/MESOS-6483
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> We don't guarantee compatibility with pre-1.0 agents. However, since it
> is easy to avoid a CHECK failure in the master when an old agent
> re-registers, it seems worth doing so.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 
> 
> Diff: https://reviews.apache.org/r/53202/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Disabled the code that fills-in `frameworks.recovered`; verified that `PartitionTest.DisconnectedFramework`
dies with a `CHECK` failure if this RR is not applied but passes this with RR applied.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message