mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 53202: Avoided CHECK failure with pre-1.0 agents.
Date Thu, 27 Oct 2016 14:29:39 GMT


> On Oct. 26, 2016, 9:43 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp, lines 6036-6037
> > <https://reviews.apache.org/r/53202/diff/1/?file=1546421#file1546421line6036>
> >
> >     Assuming frameworks are not partition-aware based on the agent verison doesn't
feel right.
> >     
> >     Ultimately it doesn't seem to make a difference in terms of messages Mesos sends:
if the framework is not connected, no update is sent in this method. Later when it reconnects
and reconciles, Master checks its capability and decides: `(5) Task is unknown, slave is unreachable:
TASK_UNREACHABLE` or `TASK_LOST` if the frameworks is not partition-aware.
> >     
> >     Would it make sense to set the state to `TASK_UNREACHABLE` in this case? Looks
like the only differences it makes are:
> >     
> >     - - metrics: Regardless of framework capabilities, the agent is indeed unreachable:
`TASK_UNREACHABLE` is more in line with the (1.1) master's logic and the metrics don't reflect
100% of what the master sends out anyways.
> >     - documentation: we set the state because it makes sense to the master and not
by guessing the framework's capabilities. also worth-mentioning is the fact that this doesn't
violate the API semantics: partition-awareness is checked at reconciliaton time.
> 
> Neil Conway wrote:
>     I don't think either option is ideal. I opted for `TASK_LOST` because:
>     
>     (a) sending `TASK_LOST` is still the default behavior; partition-awareness is an
"experimental" feature in 1.1, and in any case is guarded behind a capability.
>     (b) if you have (very) old agents, you're more likely to have old frameworks that
haven't enabled partition-awareness.
> 
> Jiang Yan Xu wrote:
>     OK I am not adamant about `TASK_LOST` vs `TASK_UNREACHABLE` as long as it's not sent
out but w.r.t the rationale:
>     
>     a) I don't feel this has to do with the feature being "experimental", imagine we
are at 1.2 and we've promoted this feature as stable and we have marked the following
>     
>     ```
>       // In Mesos 1.2, this will only be sent when the framework does NOT
>       // opt-in to the PARTITION_AWARE capability.
>       TASK_LOST = 5;
>     ```
>     we still have this problem, would we solve it differently? If not, we might as well
don't take "experimental" into consideration here.
>     
>     b) this is more about documentation but even if we use `TASK_LOST`, can we avoid
saying we are basing our decision on the likelyhood of old frameworks because of old agents?
i.e., my original comment: *we set the state because it makes sense to the master and not
by guessing the framework's capabilities. also worth-mentioning is the fact that this doesn't
violate the API semantics: partition-awareness is checked at reconciliaton time*.

Re: (a), we can only guarantee that `TASK_LOST` will not be sent to partition-aware frameworks
if all the agents in the cluster have been upgraded to Mesos >= 1.2 -- otherwise, an old
agent might generate a `TASK_LOST` status update directly (e.g., due to an agent-local task
launch failure).

Re: (b) sure, that was only a minor point -- but I think we should err on the side of not
breaking old frameworks if we can't know the right thing to do, at least for 1.1.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53202/#review153939
-----------------------------------------------------------


On Oct. 26, 2016, 7:51 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53202/
> -----------------------------------------------------------
> 
> (Updated Oct. 26, 2016, 7:51 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6483
>     https://issues.apache.org/jira/browse/MESOS-6483
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> We don't guarantee compatibility with pre-1.0 agents. However, since it
> is easy to avoid a CHECK failure in the master when an old agent
> re-registers, it seems worth doing so.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 23ddb995b4ad0fcdb589974308a2e81ececdad31 
> 
> Diff: https://reviews.apache.org/r/53202/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Disabled the code that fills-in `frameworks.recovered`; verified that `PartitionTest.DisconnectedFramework`
dies with a `CHECK` failure if this RR is not applied but passes this with RR applied.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message