mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Review Request 71285: Fixed recovery of agent resources and operations after crash.
Date Thu, 22 Aug 2019 20:41:44 GMT


> On Aug. 22, 2019, 2:27 a.m., James Peach wrote:
> > src/slave/slave.cpp
> > Line 7888 (original)
> > <https://reviews.apache.org/r/71285/diff/2/?file=2161848#file2161848line7920>
> >
> >     Can you explain this more?
> >     
> >     In `Slave::applyOperation()`, there is a call to `checkpointResourceState()`
before the operation is applied. If the agent crashed after writing the operation checkpoint,
wouldn't we need to apply it here? It's not clear to me how we can tell whether it needs to
be applied or not ...

Thank you for this - I had eliminated this line after encountering an issue with doubly-applied-operations
during my testing, but the assumptions I was making were incorrect. It turns out I had forgotten
that now that we're doing the two-phase-commit of checkpointed operations as well, we need
to recover operations from the target file when it's present. I've updated the review accordingly.

thanks James!


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71285/#review217373
-----------------------------------------------------------


On Aug. 22, 2019, 8:40 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71285/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2019, 8:40 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, James Peach, and Joseph Wu.
> 
> 
> Bugs: MESOS-9875
>     https://issues.apache.org/jira/browse/MESOS-9875
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fixes an issue where the agent may incorrectly send an
> OPERATION_FINISHED update for a failed offer operation
> following agent failover and recovery.
> 
> The agent previously relied on the difference between the
> set of checkpointed operations and the set of operation
> IDs recovered from the operation status update manager to
> apply any operations which had not been applied due to an
> ill-timed agent failover.
> 
> However, this approach did not work in the case where a
> persistent volume failed to be successfully created by
> syncCheckpointedResources(). In order to handle this
> case, this patch changes the agent code to continue with
> the old approach of a two-phase-commit of persistent
> volumes to disk, where the agent will fail to complete
> recovery if syncCheckpointedResources() cannot be
> executed successfully after failover.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp e077587fd02bd8e35fee7ce12ae436e3dca25e47 
>   src/slave/paths.cpp 28a7cf9f9c70fb31eeefe2e823cd7e19ffcf126a 
>   src/slave/slave.cpp 1d0ec9d2428c3ffa28ad3e960b74f171013cf0c2 
>   src/slave/state.cpp cd3fac72dd57da21ed5ac46b17066531af26d42a 
> 
> 
> Diff: https://reviews.apache.org/r/71285/diff/3/
> 
> 
> Testing
> -------
> 
> Tested manually by doing the following:
> 
> 1) Run a master
> 2) Run an agent with statically reserved resources for use in the following step
> 3) Run the persistent volume example framework to create a volume on the agent, which
leads to checkpointing of resources
> 4) Use `chattr -R +i /agent/volumes/dir` to make the agent's persistent volume directory
immutable
> 5) Run the persistent volume example framework again; it will fail and the agent will
crash
> 6) Restart the agent; confirm that it continues to crash
> 7) Use `chattr -R -i /agent/volumes/dir` to remove the immutable attribute
> 8) Restart the agent; confirm that it recovers successfully, with the persistent volume
created on disk
> 9) Run the persistent volume example framework again; it should succeed
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message