mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Review Request 69980: Modified when master responds to operation status updates.
Date Wed, 20 Feb 2019 00:48:46 GMT


> On Feb. 14, 2019, 12:31 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 8686-8708 (original), 8686-8712 (patched)
> > <https://reviews.apache.org/r/69980/diff/1/?file=2125184#file2125184line8686>
> >
> >     Consider the case of a terminal-but-unacknowledged operation which has been
sent to the master by a reregistered agent and which has its ID set. Since we only place non-terminal
operations in `orphanedOperations`, we will get `frameworkWillAcknowledge == true` here. If
this framework never reregisters, the I think we could end up in a state where the agent retries
terminal updates for that operation forever.
> >     
> >     For such updates, I think the master needs to either:
> >     1) have a way to determine that this is a terminal-but-unacknowledged orphaned
operation (i.e. place it in `orphanedOperations`), or
> >     2) fall back to default behavior of acknowledging updates for operations that
it doesn't recognize.
> >     
> >     WDYT?
> 
> Joseph Wu wrote:
>     This is a good point.  Orphan operations must be include terminal and non-terminal
operations.  With the chain as it is right now, it is only possible to produce a terminal
orphan operation by adding a non-terminal orphan, and then receiving a terminal update.  This
transition is a bit weird, since we have a case where an UpdateSlaveMessage can contain a
terminal operation, belonging to an unknown framework.  As long as the `Master::updateSlave()`
method marks this as an orphan, the master will be able to adopt the orphan.  (This is choice
1).

Fixed with a change here: https://reviews.apache.org/r/69960/diff/3/


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69980/#review212838
-----------------------------------------------------------


On Feb. 19, 2019, 4:47 p.m., Joseph Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69980/
> -----------------------------------------------------------
> 
> (Updated Feb. 19, 2019, 4:47 p.m.)
> 
> 
> Review request for mesos, Benno Evers, Gastón Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-9542
>     https://issues.apache.org/jira/browse/MESOS-9542
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When dealing with orphaned operation status updates, there are two
> cases the master must deal with:
> - The simple case is when the master knows the framework is completed.
>   These status updates can be acknowledged by the master.
> - However, a completed framework can be rotated out of the master's
>   memory.  In addition, after master failover, if an agent reregisters
>   before the framework, an operation can appear to be orphaned until
>   the framework reregisters.
> 
> This adds a fixed delay between agent reregistration and when the
> master acknowledges operation status updates from unknown frameworks.
> The delay should give frameworks ample time to reregister.
> 
> The delay is based on agent reregistration in order to mitigate the
> delay of acknowledging status updates of frameworks rotated out of
> the completed frameworks buffer.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp b0ab9187b8c672180e2ffb8b63cb7349dbe43ac4 
>   src/master/master.cpp 106d924bf16231b3bda3fb719db68c01d73644ee 
> 
> 
> Diff: https://reviews.apache.org/r/69980/diff/2/
> 
> 
> Testing
> -------
> 
> TODO: This case needs unit tests.
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message