mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 51653: Handled agents failing health checks multiple times.
Date Tue, 13 Sep 2016 13:10:54 GMT


> On Sept. 12, 2016, 11:01 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5835
> > <https://reviews.apache.org/r/51653/diff/4/?file=1496870#file1496870line5835>
> >
> >     s/WARNING/INFO/ because this is expected?

I opted for `WARNING` because, although this situation can occur, we expect it to occur quite
rarely in practice. So it doesn't _necessarily_ indicate a problem, but if you see it more
than once in the logs, it probably bears investigating. In comparison to a lot of the stuff
we log at `INFO`, which is generally not very important for admins to pay attention to.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51653/#review148619
-----------------------------------------------------------


On Sept. 12, 2016, 4:01 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51653/
> -----------------------------------------------------------
> 
> (Updated Sept. 12, 2016, 4:01 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-5965
>     https://issues.apache.org/jira/browse/MESOS-5965
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Now that we wait for the agent to be removed from the registry before
> stopping the SlaveObserver, it is possible for an agent to fail health
> checks multiple times if the registry operation takes longer than
> `agent_ping_timeout`.
> 
> This commit updates the master logic to handle this by ignoring health
> check failures while the registry operation to mark the agent
> unreachable is still in progress.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 1dcce6cd66804990af238176c61aca03bb5c9471 
>   src/tests/partition_tests.cpp f3142ad8d50daafcdb70ad9dbb2772f8ba30db00 
> 
> Diff: https://reviews.apache.org/r/51653/diff/
> 
> 
> Testing
> -------
> 
> make check on OSX and Linux.
> 
> `./src/mesos-tests --gtest_filter="Strict/PartitionTest.FailHealthChecksTwice/0" --gtest_repeat=1000
--gtest_break_on_failure`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message