mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 51653: Handled agents failing health checks multiple times.
Date Mon, 12 Sep 2016 23:01:07 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51653/#review148619
-----------------------------------------------------------


Fix it, then Ship it!





src/master/master.cpp (line 5835)
<https://reviews.apache.org/r/51653/#comment216077>

    s/WARNING/INFO/ because this is expected?



src/tests/partition_tests.cpp (lines 1983 - 1984)
<https://reviews.apache.org/r/51653/#comment216078>

    I wonder if it confuses users that there are 2 slave unreachable operations scheduled
but only 1 slave got removed.


- Vinod Kone


On Sept. 12, 2016, 4:01 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51653/
> -----------------------------------------------------------
> 
> (Updated Sept. 12, 2016, 4:01 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-5965
>     https://issues.apache.org/jira/browse/MESOS-5965
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Now that we wait for the agent to be removed from the registry before
> stopping the SlaveObserver, it is possible for an agent to fail health
> checks multiple times if the registry operation takes longer than
> `agent_ping_timeout`.
> 
> This commit updates the master logic to handle this by ignoring health
> check failures while the registry operation to mark the agent
> unreachable is still in progress.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 
>   src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 
> 
> Diff: https://reviews.apache.org/r/51653/diff/
> 
> 
> Testing
> -------
> 
> make check on OSX and Linux.
> 
> `./src/mesos-tests --gtest_filter="Strict/PartitionTest.FailHealthChecksTwice/0" --gtest_repeat=1000
--gtest_break_on_failure`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message