mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mesos Reviewbot <revi...@mesos.apache.org>
Subject Re: Review Request 71285: Fixed recovery of agent resources and operations after crash.
Date Fri, 23 Aug 2019 10:59:35 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71285/#review217398
-----------------------------------------------------------



Patch looks great!

Reviews applied: [71284, 71285]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose
--disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1';
./support/docker-build.sh

- Mesos Reviewbot


On Aug. 22, 2019, 8:40 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71285/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2019, 8:40 p.m.)
> 
> 
> Review request for mesos, Gastón Kleiman, James Peach, and Joseph Wu.
> 
> 
> Bugs: MESOS-9875
>     https://issues.apache.org/jira/browse/MESOS-9875
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fixes an issue where the agent may incorrectly send an
> OPERATION_FINISHED update for a failed offer operation
> following agent failover and recovery.
> 
> The agent previously relied on the difference between the
> set of checkpointed operations and the set of operation
> IDs recovered from the operation status update manager to
> apply any operations which had not been applied due to an
> ill-timed agent failover.
> 
> However, this approach did not work in the case where a
> persistent volume failed to be successfully created by
> syncCheckpointedResources(). In order to handle this
> case, this patch changes the agent code to continue with
> the old approach of a two-phase-commit of persistent
> volumes to disk, where the agent will fail to complete
> recovery if syncCheckpointedResources() cannot be
> executed successfully after failover.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp e077587fd02bd8e35fee7ce12ae436e3dca25e47 
>   src/slave/paths.cpp 28a7cf9f9c70fb31eeefe2e823cd7e19ffcf126a 
>   src/slave/slave.cpp 1d0ec9d2428c3ffa28ad3e960b74f171013cf0c2 
>   src/slave/state.cpp cd3fac72dd57da21ed5ac46b17066531af26d42a 
> 
> 
> Diff: https://reviews.apache.org/r/71285/diff/3/
> 
> 
> Testing
> -------
> 
> Tested manually by doing the following:
> 
> 1) Run a master
> 2) Run an agent with statically reserved resources for use in the following step
> 3) Run the persistent volume example framework to create a volume on the agent, which
leads to checkpointing of resources
> 4) Use `chattr -R +i /agent/volumes/dir` to make the agent's persistent volume directory
immutable
> 5) Run the persistent volume example framework again; it will fail and the agent will
crash
> 6) Restart the agent; confirm that it continues to crash
> 7) Use `chattr -R -i /agent/volumes/dir` to remove the immutable attribute
> 8) Restart the agent; confirm that it recovers successfully, with the persistent volume
created on disk
> 9) Run the persistent volume example framework again; it should succeed
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message