mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Megha Sharma <mshar...@apple.com>
Subject Re: Review Request 56895: Allow agents to recover slave state post a reboot.
Date Fri, 09 Jun 2017 13:35:46 GMT


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5954 (original), 5965 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5965>
> >
> >     Previously, looks like any errors during the partial recovery after a reboot
resulted in agent flapping (e.g., checkpointed resources recovery failure, fetcher recovery
failure), but now looks like the agent will remove the symlink and move along? That seems
like a change in behavior

Now we are only cleaning the symlink and continuing if the failure was because of slaveInfo
mismatch. There's no recovery failure caused by fetcher recovery anymore (removed in ToT)
so the only class of failure left is checkpointed resources failure which will not be affected
by agent symlink cleanup so no cleanup has been performed and it will cause the agent to flap
so no change in behavior here.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5962 (original)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5978>
> >
> >     So, we are continuing if there is a recovery error and the agent has not rebooted?
That doesn't sound right?

Fixed to continue after symlink cleanup if the recovery error is because of slaveInfo mismatch,
any other errors will cause the agent to behave same as before.


- Megha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
-----------------------------------------------------------


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> -----------------------------------------------------------
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
>     https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/composing.cpp a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp f3e6210eccd4a6b445ffd4447e69526d424ea36d

>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message