mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 56895: Allow agents to recover slave state post a reboot.
Date Tue, 13 Jun 2017 00:15:32 GMT


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5954 (original), 5965 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5965>
> >
> >     Previously, looks like any errors during the partial recovery after a reboot
resulted in agent flapping (e.g., checkpointed resources recovery failure, fetcher recovery
failure), but now looks like the agent will remove the symlink and move along? That seems
like a change in behavior
> 
> Megha Sharma wrote:
>     Now we are only cleaning the symlink and continuing if the failure was because of
slaveInfo mismatch. There's no recovery failure caused by fetcher recovery anymore (removed
in ToT) so the only class of failure left is checkpointed resources failure which will not
be affected by agent symlink cleanup so no cleanup has been performed and it will cause the
agent to flap so no change in behavior here.

What is `ToT` ?

If this issue is fixed, then please mark it so. If you are unsure then feel free to ask further
questions here that will help us in resolving it.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Line 5962 (original)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693973#file1693973line5978>
> >
> >     So, we are continuing if there is a recovery error and the agent has not rebooted?
That doesn't sound right?
> 
> Megha Sharma wrote:
>     Fixed to continue after symlink cleanup if the recovery error is because of slaveInfo
mismatch, any other errors will cause the agent to behave same as before.

please mark it fixed. see above.


> On May 23, 2017, 9:32 p.m., Vinod Kone wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 237 (original), 237 (patched)
> > <https://reviews.apache.org/r/56895/diff/4-6/?file=1693977#file1693977line237>
> >
> >     why this change in this review? looks independent.
> 
> Megha Sharma wrote:
>     Actually, this was done to address Neil's comment about the variable name being too
generic which seemed quite reasonable. See the comment below.
>     
>     `Can we rename _ack to something that identifies we're waiting for the agent to see
the status update acknowledgment?`

please do such changes in a different review. see my comment in the next review.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review175852
-----------------------------------------------------------


On June 9, 2017, 4:27 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> -----------------------------------------------------------
> 
> (Updated June 9, 2017, 4:27 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
>     https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> With partition awareness, the agents are now allowed to re-register
> after they have been marked Unreachable. The executors are anyway
> terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master
> and reconcile the lost executors. This is a pre-requisite for
> supporting persistent/restartable tasks in mesos.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/composing.cpp a003e1b80dc9b4dec5b3fbbadb2daecf855c90c7 
>   src/slave/containerizer/docker.cpp 9f84109d7de22a39ace6e44e0c7d8d501bcb24de 
>   src/slave/containerizer/mesos/containerizer.cpp f3e6210eccd4a6b445ffd4447e69526d424ea36d

>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
>   src/tests/reservation_tests.cpp 6e9c215382ef41700921a673669ac1a7975e9b7f 
>   src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message