mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Review Request 71285: Fixed recovery of agent resources and operations after crash.
Date Wed, 14 Aug 2019 00:53:34 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71285/
-----------------------------------------------------------

Review request for mesos, Gastón Kleiman, James Peach, and Joseph Wu.


Bugs: MESOS-9875
    https://issues.apache.org/jira/browse/MESOS-9875


Repository: mesos


Description
-------

Fixes an issue where the agent may incorrectly send an
OPERATION_FINISHED update for a failed offer operation
following agent failover and recovery.

The agent previously relied on the difference between the
set of checkpointed operations and the set of operation
IDs recovered from the operation status update manager to
apply any operations which had not been applied due to an
ill-timed agent failover.

However, this approach did not work in the case where a
persistent volume failed to be successfully created by
`syncCheckpointedResources()`. In order to handle this
case, this patch changes the agent code to continue with
the old approach of a two-phase-commit of persistent
volumes to disk, where the agent will fail to complete
recovery if `syncCheckpointedResources()` cannot be
executed successfully after failover.


Diffs
-----

  src/slave/paths.hpp e077587fd02bd8e35fee7ce12ae436e3dca25e47 
  src/slave/paths.cpp 28a7cf9f9c70fb31eeefe2e823cd7e19ffcf126a 
  src/slave/slave.cpp 74eb45744d6603b91676e812ed008a7b1ab39a49 
  src/slave/state.cpp cd3fac72dd57da21ed5ac46b17066531af26d42a 


Diff: https://reviews.apache.org/r/71285/diff/1/


Testing
-------


Thanks,

Greg Mann


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message