mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Megha Sharma <mshar...@apple.com>
Subject Re: Review Request 56895: Allow agents to recover slave state post a reboot.
Date Fri, 09 Jun 2017 04:38:34 GMT


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > No tests!?

I have added a new test, so in total this change has two tests: one verifying that the state
is recovered correctly and agentId is retained post the agent host reboot given the recovery
finishes without errors and a second one to verify that no state is recovered and only the
agentId is retained if the recovery fails after a reboot.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5956 (original), 5967 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5967>
> >
> >     Add a comment here saying that we do this for backwards compatibiity, i.e.,
in Mesos <= 1.3 a rebooted agent did not recover checkpointed disk and registered as a
new agent.

Fixed


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 237 (original), 237 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693977#file1693977line237>
> >
> >     why this change in this review? looks independent.

Actually, this was done to address Neil's comment about the variable name being too generic
which seemed quite reasonable. See the comment below.

`Can we rename _ack to something that identifies we're waiting for the agent to see the status
update acknowledgment?`


- Megha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
-----------------------------------------------------------


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> -----------------------------------------------------------
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
>     https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/composing.cpp a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp f3e6210eccd4a6b445ffd4447e69526d424ea36d

>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message