mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 54495: Ensured master always relinks during scheduler re-registration.
Date Wed, 07 Dec 2016 22:29:21 GMT


> On Dec. 7, 2016, 9:53 p.m., Joseph Wu wrote:
> > src/master/master.cpp, lines 2841-2843
> > <https://reviews.apache.org/r/54495/diff/1/?file=1579042#file1579042line2841>
> >
> >     Do you want to force a relink too?
> >     
> >     i.e. give this as the second argument: `process::RemoteConnection::RECONNECT`

Per discussion on Slack with Joseph, it seems we don't need to force a reconnect here. Because
the master will promptly send a (re-)registered message to the framework; if the socket is
half-open, that should eventually result in an error due to the socket send. This will result
in another `exited` event, at which point we'll correctly mark the framework as disconnected
again and send it another `FrameworkErrorMessage`.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54495/#review158408
-----------------------------------------------------------


On Dec. 7, 2016, 8:04 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54495/
> -----------------------------------------------------------
> 
> (Updated Dec. 7, 2016, 8:04 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6676
>     https://issues.apache.org/jira/browse/MESOS-6676
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> In the following scenario:
>   * Master sees a re-registration attempt from a PID-based scheduler,
>   * The scheduler was previously registered with the master,
>   * and the "force" flag is not set
> 
> The master neglected to re-link with the scheduler. For example, this
> might happen if:
> 
>   * The master sees an ExitedEvent for the framework and marks it
>     disconnected.
>   * The master sends a FrameworkErrorMessage to the framework but this
>     message is dropped, e.g., due to a transient network failure.
>   * The scheduler attempts to re-register with the master, e.g., because
>     it detects (spuriously) that the current leading master has changed.
> 
> This is problematic, because it might leave the master -> scheduler
> connection using an ephemeral socket.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de 
> 
> Diff: https://reviews.apache.org/r/54495/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Note that it would be _great_ to write a unit test for this situation (as well as a class
of related failure conditions), but the current testing infrastructure doesn't make that easy.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message