mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Review Request 69977: Improved agent operation recovery process.
Date Fri, 22 Feb 2019 01:19:35 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69977/#review213064
-----------------------------------------------------------


Fix it, then Ship it!




Noticed an orthogonally related documentation/comment discrepancy.  It would be fine to fix
in a separate patch.


src/slave/slave.cpp
Lines 7337 (patched)
<https://reviews.apache.org/r/69977/#comment298936>

    The SLRP uses this method like this:
    ```
    Try<list<string>> operationPaths =
      slave::paths::getOperationPaths(
          slave::paths::getResourceProviderPath(
              metaDir, slaveId, info.type(), info.name(), info.id()));
    ```
    
    Which results in a directory tree (taken from `slave/paths.hpp`) like:
    ```
    //   |   |-- slaves
    //   |       |-- latest (symlink)
    //   |       |-- <slave_id>
    //   |           |-- slave.info
    //   |           |-- resource_providers
    //   |           |   |-- <type>
    //   |           |       |-- <name>
    //   |           |           |-- latest (symlink)
    //   |           |           |-- <resource_provider_id>
    //   |           |               |-- resource_provider.state
    //   |           |               |-- operations
    //   |           |                   |-- <operation_uuid>
    //   |           |                       |-- operation.updates
    ```
    
    But this usage suggests the existence of another directory structure like:
    ```
    //   |   |-- slaves
    //   |       |-- latest (symlink)
    //   |       |-- <slave_id>
    //   |           |-- slave.info
    //   |           |-- resource_providers
    //   |           |   |-- ...
    //   |           |-- operations
    //   |           |   |-- <operation_uuid>
    //   |           |       |-- operation.updates
    ```
    
    This is presumably created by the agent's `operationStatusUpdateManager`.  But `slave/paths.hpp`
does not have the updated directory structure.  It would also be helpful to note if there
is or is not any duplication between these two directories.



src/slave/slave.cpp
Lines 7358-7359 (patched)
<https://reviews.apache.org/r/69977/#comment298937>

    It would be nice to put the explanation from your commit message:
    ```
    This prevents the agent from asking the operation status update manager
    to recover streams that haven't been created yet.
    ```
    Into this logic block somewhere.  i.e.
    ```
    // NOTE: Operations found in the `ResourceState`, but not in the
    // operation update streams directory are streams to be created.
    // These operations have nothing to recover.
    ```


- Joseph Wu


On Feb. 20, 2019, 2:58 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69977/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 2:58 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8054
>     https://issues.apache.org/jira/browse/MESOS-8054
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch makes the agent walk the operation status update streams
> directories in order to generate the list of streams to recover, instead
> of generating it from the checkpointed `ResourceState` message.
> 
> This prevents the agent from asking the operation status update manager
> to recover streams that haven't been created yet.
> 
> The patch also makes the agent garbage collect operation status update
> streams if no correspondng operation is present in the checkpointed
> state. This can happen after the agent fails over while processing the
> acknowledgement of a terminal operation status update.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69977/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message