mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gastón Kleiman <gas...@mesosphere.io>
Subject Re: Review Request 69977: Improved agent operation recovery process.
Date Fri, 22 Feb 2019 01:55:59 GMT


> On Feb. 21, 2019, 5:19 p.m., Joseph Wu wrote:
> > src/slave/slave.cpp
> > Lines 7337 (patched)
> > <https://reviews.apache.org/r/69977/diff/2/?file=2125997#file2125997line7341>
> >
> >     The SLRP uses this method like this:
> >     ```
> >     Try<list<string>> operationPaths =
> >       slave::paths::getOperationPaths(
> >           slave::paths::getResourceProviderPath(
> >               metaDir, slaveId, info.type(), info.name(), info.id()));
> >     ```
> >     
> >     Which results in a directory tree (taken from `slave/paths.hpp`) like:
> >     ```
> >     //   |   |-- slaves
> >     //   |       |-- latest (symlink)
> >     //   |       |-- <slave_id>
> >     //   |           |-- slave.info
> >     //   |           |-- resource_providers
> >     //   |           |   |-- <type>
> >     //   |           |       |-- <name>
> >     //   |           |           |-- latest (symlink)
> >     //   |           |           |-- <resource_provider_id>
> >     //   |           |               |-- resource_provider.state
> >     //   |           |               |-- operations
> >     //   |           |                   |-- <operation_uuid>
> >     //   |           |                       |-- operation.updates
> >     ```
> >     
> >     But this usage suggests the existence of another directory structure like:
> >     ```
> >     //   |   |-- slaves
> >     //   |       |-- latest (symlink)
> >     //   |       |-- <slave_id>
> >     //   |           |-- slave.info
> >     //   |           |-- resource_providers
> >     //   |           |   |-- ...
> >     //   |           |-- operations
> >     //   |           |   |-- <operation_uuid>
> >     //   |           |       |-- operation.updates
> >     ```
> >     
> >     This is presumably created by the agent's `operationStatusUpdateManager`.  But
`slave/paths.hpp` does not have the updated directory structure.  It would also be helpful
to note if there is or is not any duplication between these two directories.

I noticed yesterday that the streams that correspond to operations affecting agent's default
resouces are currently created under `meta/operations/` instead of under `meta/slaves/<slave_id>/operations/`.

After discussing it with Greg, we both agreed that the operations should be stored in the
latter directory.

I will follow up with a patch that changes the place in which streams are stored; that patch
will also update the ascii drawing in `slave/paths.hpp`: https://issues.apache.org/jira/browse/MESOS-9597


> On Feb. 21, 2019, 5:19 p.m., Joseph Wu wrote:
> > src/slave/slave.cpp
> > Lines 7358-7359 (patched)
> > <https://reviews.apache.org/r/69977/diff/2/?file=2125997#file2125997line7362>
> >
> >     It would be nice to put the explanation from your commit message:
> >     ```
> >     This prevents the agent from asking the operation status update manager
> >     to recover streams that haven't been created yet.
> >     ```
> >     Into this logic block somewhere.  i.e.
> >     ```
> >     // NOTE: Operations found in the `ResourceState`, but not in the
> >     // operation update streams directory are streams to be created.
> >     // These operations have nothing to recover.
> >     ```

I agree, so I added a long comment at the beginning of the block that generates the list of
IDs to recover.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69977/#review213064
-----------------------------------------------------------


On Feb. 21, 2019, 5:52 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69977/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 5:52 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8054
>     https://issues.apache.org/jira/browse/MESOS-8054
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch makes the agent walk the operation status update streams
> directories in order to generate the list of streams to recover, instead
> of generating it from the checkpointed `ResourceState` message.
> 
> This prevents the agent from asking the operation status update manager
> to recover streams that haven't been created yet.
> 
> The patch also makes the agent garbage collect operation status update
> streams if no correspondng operation is present in the checkpointed
> state. This can happen after the agent fails over while processing the
> acknowledgement of a terminal operation status update.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69977/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message