mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anindya Sinha <anindya_si...@apple.com>
Subject Re: Review Request 48313: Creation and deletion of persistent volumes across agent restart.
Date Fri, 01 Jul 2016 21:39:52 GMT


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > In terms of testing, if we don't crash the agent within `syncCheckpointedResources()`
but rather return a failure when its fails during recovery, we can capture this in `Slave::__recover`
and verify the failed future right?
> 
> Jiang Yan Xu wrote:
>     Echoing Neil on /r/48315/, can we add a test?

Added in https://reviews.apache.org/r/48315/.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2520-2521
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2520>
> >
> >     If this fails, we don't know how it has failed and if the agent crashes next
and restarts, it's possible for the agent to recover some non-empty resources from the target
and erroneously starts deleting directories. Possible?
> 
> Anindya Sinha wrote:
>     is it possible? `os::rm()` for files calls `unlink()` which clears the inode entry
for that file (assuming no other links to that file), although the data may still exist on
disk (until another write operation overwrites the actual bits). And isn't clearing the inode
entry an atomic operation? So I do not think there is a case where the data is partial.
> 
> Jiang Yan Xu wrote:
>     I was referring to the use of `state::checkpoint(paths::getResourcesTargetPath(metaDir),
Resources())` to clean up target checkpointed resources. If we use `os::rm()` it should be
no problem.

Ok. Using os::rm().


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2552-2553
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2552>
> >
> >     Can we directly return in case of an error instead of using variables like this?
> 
> Anindya Sinha wrote:
>     There are 3 error conditions here so to consolidate return from one place which I
believe is better than return with the same failures from 3 places... hence, the variables.
I can certainly consolidate `isFailed` and `additionalReason` into `Option<string> additionalReason
= None()` to capture any errors (and not need `isFailed` in this case).
>     
>     I just realized this RR has 1 condition only, but when the changes for RR 48315 is
added, there are 3 conditions. Let me know if you want me to remove `additionalReason` from
this RR and add it in RR 48315. Again, all the RRs 48313 through to 48315 go together.
> 
> Jiang Yan Xu wrote:
>     The thing is that the 3 places don't return the same failure, right? They are errors
from three different actions. Combining with /r/48315/, the three errors are (forgive my line
length):
>     
>     ```
>     return Error("Failed to create the persistent volume" + volume.disk().persistence().id()
+ "at '" + path + "': " + mkdir.error());
>     
>     return Error("Failed to determine the disk status of persistent volume" + volume.disk().persistence().id()
+ "at '" + path + "': " + empty.error());
>     
>     return Error("New persistent volume" + volume.disk().persistence().id() + "at '"
+ path + "' already exists on disk and isn't empty"
>     ```
>     
>     The only thing they share is `"persistent volume " + volume.disk().persistence().id()
+ "at '" + path + "'` so I would use a variable for that:
>     
>     `string volumeDescription = "persistent volume " + volume.disk().persistence().id()
+ "at '" + path + "'`;
>     
>     As a reader I think this is clearer than tracing `additionalReason.isSome()` or `isFailed`
in multiple branches to determine where they lead to.
>     
>     Speaking of the relationship between /r/48313/ and /r/48315/, mutliple places in
this review point to the later review and you have to use a TODO. Would it be cleaner to pull
out /r/48314/ as the first review in the chain and combine the other two? Up to you but if
it's more effort to seperate reviews cleanly then leaving them as one is OK too (and it makes
it easier to review).

TBH, I do not see how using `additionalReason` to return `Error()` from a single place is
confusing at all. The multiple branches are setting the actual error which is returned from
a single place which I think is a very common pattern.
Obviously, your recommendation of returning `Error()` directly from the 3 conditions is fine
also.
To avoid unnecessary to and fro on this, I updated the code though.

Regarding the sequence of RRs, let us keep the 3 RRs like we have been doing.


> On June 20, 2016, 5:16 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp, lines 2583-2591
> > <https://reviews.apache.org/r/48313/diff/4/?file=1420538#file1420538line2583>
> >
> >     This comment is supposed to be revised together with /r/48315/?
> 
> Anindya Sinha wrote:
>     Since RRs 48313 through to 48315 go together, do  you want to change this comment
in this RR?
> 
> Jiang Yan Xu wrote:
>     I should say that the comment about "directory should be empty" seems to not only
belong to review /r/48315/ but also right above where the code is? How about just remove the
TODO here?

Ok. Removed this comment from the `DESTROY` section. The `CREATE` section has a comment as
to why the directory may exist even if it is a new `CREATE` operation.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48313/#review138314
-----------------------------------------------------------


On June 20, 2016, 11:41 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48313/
> -----------------------------------------------------------
> 
> (Updated June 20, 2016, 11:41 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5448
>     https://issues.apache.org/jira/browse/MESOS-5448
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the agent receives CheckpointedResourcesMessage, we store the
> target checkpoint on disk. On successful create and destroy of
> persistent volumes as a part of handling this messages, we commit
> the checkpoint on the disk, and clear the target checkpoint.
> 
> However, incase of any failure we do not commit the checkpoint to
> disk, and exit the agent. When the agent restarts and there is a
> target checkpoint present on disk which differs from the committed
> checkpoint, we retry to sync the target and committed checkpoint.
> On success, we reregister the agent with the master, but in case it
> fails, we do not commit the checkpoint and the agent exists.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 339e539863c678b6ed4d4670d75c7ff4c54daa79 
>   src/slave/paths.cpp 03157f93b1e703006f95ef6d0a30afae375dcdb5 
>   src/slave/slave.hpp 58ff2bfac6918d989ab36b67cf6ba2f3657c8356 
>   src/slave/slave.cpp 4bf01f2b020f5e975fb57cffcd19865d7431eac2 
>   src/slave/state.hpp 0de2a4ee4fabaad612c4526166157b001c380bdb 
>   src/slave/state.cpp 04c3d42040f186507a0e484a3ee616f1b1a77ea8 
> 
> Diff: https://reviews.apache.org/r/48313/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message