mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.
Date Mon, 20 Jun 2016 17:16:25 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review138314
-----------------------------------------------------------



In terms of testing, if we don't crash the agent within `syncCheckpointedResources()` but
rather return a failure when its fails during recovery, we can capture this in `Slave::__recover`
and verify the failed future right?


src/slave/paths.hpp (lines 87 - 88)
<https://reviews.apache.org/r/48313/#comment203484>

    Could you add the new file to this figure?



src/slave/slave.hpp (lines 426 - 428)
<https://reviews.apache.org/r/48313/#comment203490>

    Preceding underscores are mostly used for separating similarly named continuations.



src/slave/slave.cpp (line 2506)
<https://reviews.apache.org/r/48313/#comment203489>

    Nit: the master and the agent?



src/slave/slave.cpp (line 2511)
<https://reviews.apache.org/r/48313/#comment203666>

    "checkpoint resources target"



src/slave/slave.cpp (line 2513)
<https://reviews.apache.org/r/48313/#comment203491>

    The first argument is already a member variable and doesn't need to be passed around right?
    
    It's unclear from the method name whether the "sync" implies syncing the two arguments
or syncing the disk state and `newCheckpointedResources`. Also it's a bit unintuitive that
the resources info file is checkpointed in the method but member variable `checkpointedResources`
is assigned outside instead.
    
    Would the following semantics be clearer?
    
    1. Use this method to do one thing: bring the filesystem in sync with the newCheckpointedResources
and return an Error if anything goes wrong.
    
    ```
    Try<Nothing> syncCheckpointedResources(const Resources& newCheckpointedResources);
    ```
    
    2. On the caller side: if the result is an error, EXIT. Otherwise checkpoint the resource
info file and remove the target file. Note that we need to do similar things in `recover()`
but in there you would fail the future instead of directly EXIT.



src/slave/slave.cpp (lines 2518 - 2523)
<https://reviews.apache.org/r/48313/#comment203505>

    Use `os::rm()` to remove?



src/slave/slave.cpp (lines 2520 - 2521)
<https://reviews.apache.org/r/48313/#comment203815>

    If this fails, we don't know how it has failed and if the agent crashes next and restarts,
it's possible for the agent to recover some non-empty resources from the target and erroneously
starts deleting directories. Possible?



src/slave/slave.cpp (lines 2552 - 2553)
<https://reviews.apache.org/r/48313/#comment203818>

    Can we directly return in case of an error instead of using variables like this?



src/slave/slave.cpp (line 2564)
<https://reviews.apache.org/r/48313/#comment203817>

    No cleanups actually, just checking if it's empty right?



src/slave/slave.cpp (lines 2583 - 2585)
<https://reviews.apache.org/r/48313/#comment203819>

    This comment is supposed to be revised together with /r/48315/?



src/slave/slave.cpp (line 4736)
<https://reviews.apache.org/r/48313/#comment203543>

    We need to differentiate between the target being empty (everthing is unreserved) and
there's no target (changes are processed and committed) right?



src/slave/slave.cpp (line 4741)
<https://reviews.apache.org/r/48313/#comment203833>

    Here we don't want recovery to crash the agent here. Everything else returns a falure
in recovery and the agent exits in `__recover()`. We should do the same for reattempting the
sync.
    
    Have `syncCheckpointedResources()` return a Try can help with this.



src/slave/slave.cpp (lines 4748 - 4752)
<https://reviews.apache.org/r/48313/#comment203663>

    Same as in `checkpointResources()`.



src/slave/state.hpp (line 278)
<https://reviews.apache.org/r/48313/#comment203534>

    Here we can have 
    
    ```
    Resources resources;
    Option<Resources> target;
    ```
    
    If ResourcesState is some, it always has `resources` (which could be empty), it doesn't
always have `target`, which only exists before the requested target is committed.
    
    Right?



src/slave/state.hpp (lines 305 - 307)
<https://reviews.apache.org/r/48313/#comment203533>

    It's a bit strange to say that the entire state consists of the slave state, the resources
state and the target resources state.
    
    IMO both the info and the target should belong to `ResourcesState resources` (in the filesystem
they are both under "resources" directory. It'll be great if the in-memory state and the filesystem
hierarchy are structured similarly.



src/slave/state.cpp (lines 704 - 707)
<https://reviews.apache.org/r/48313/#comment203537>

    Instead of this, we can factor out the common code that reads the `Resources` objects.


- Jiang Yan Xu


On June 15, 2016, 4 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 15, 2016, 4 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 3ed026e4b7e886ae738567369ee5750591ef97d9 
>   src/slave/slave.cpp 0af04d6fe53f92e03905fb7b3bec72b09d5e8e57 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message