mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone" <vinodk...@gmail.com>
Subject Re: Review Request 40177: Re-checkpoint frameworks after agent recovery.
Date Tue, 17 Nov 2015 19:35:10 GMT


> On Nov. 11, 2015, 9:28 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 4244-4247
> > <https://reviews.apache.org/r/40177/diff/1/?file=1122973#file1122973line4244>
> >
> >     why do it here instead of in recoverFramework() #4363? that feels more consistent
with #1345.
> 
> James Peach wrote:
>     I did this after recovery because the original code did not write framework checkpoints
if the slave was in RECOVERING state. I did not see a reason for that, but decided to preserve
the behavior as much as possible just in case.

originally, the slave didn't checkpoint framework during recovery stage because it was not
needed. if it is creating a framework object during recovery, it is because it read the checkpointed
data. so no need to checkpoint again. 

but due to the compatibility issue you found, the slave can re-checkpoint framework info during
recovery because the framework info is *updated*. so i would recommend moving this down to
#1345 and do re-checkpoint if necessary.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40177/#review106142
-----------------------------------------------------------


On Nov. 12, 2015, 5:41 a.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40177/
> -----------------------------------------------------------
> 
> (Updated Nov. 12, 2015, 5:41 a.m.)
> 
> 
> Review request for mesos, Kapil Arya and Vinod Kone.
> 
> 
> Bugs: MESOS-3834
>     https://issues.apache.org/jira/browse/MESOS-3834
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When performing an upgrade cycle, it is possible for a 0.24 and
> later agent to recover from a framework checkpoint written by 0.22
> or earlier. In this case, we need to compatibly accept a missing
> FrameworkID, and then rewrite the framework checkpoint so that
> subsequent upgrades don't hit the same problem.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp ec2dfa99e6b553e2bcd82d12db915ae8625075a1 
>   src/slave/slave.cpp ac2d0e0153721a66495cd6539b25f5b3cee9d979 
> 
> Diff: https://reviews.apache.org/r/40177/diff/
> 
> 
> Testing
> -------
> 
> make check on CentOS 6.7.
> Manual testing with a rolling upgrade from 0.22
> 
> 
> Thanks,
> 
> James Peach
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message