mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhitao Li <zhitaoli...@gmail.com>
Subject Re: Review Request 65954: Add a gauge for how long agent recovery takes.
Date Wed, 14 Mar 2018 00:04:52 GMT


> On March 9, 2018, 6:53 p.m., James Peach wrote:
> > src/slave/metrics.cpp
> > Lines 259 (patched)
> > <https://reviews.apache.org/r/65954/diff/2/?file=1972384#file1972384line259>
> >
> >     I don't know that I like the idea of a metric that is absent and then present.
I'd prefer that we just published a `0.0` until recovert is complete.
> >     
> >     Suggest we keep the recovery timestamp in the `Slave` and just publish that.

I thought about that too, but I actually like the idea of the metric being absent when the
value is not available yet. A zero value could confuse downstream aggregation.

For example, our team want to gather an average of recovery time across our cluster of thousands
of agents, but a presence of zero value could mistake the calculation.

I think Mesos already have some precedence on absent then present metrics. For instance, metrics
in `allocator/mesos/roles/<role>/...` could show up if framework under a new role registers
after Master started.

Let me know what do you think.


> On March 9, 2018, 6:53 p.m., James Peach wrote:
> > src/slave/slave.cpp
> > Lines 7322 (patched)
> > <https://reviews.apache.org/r/65954/diff/2/?file=1972385#file1972385line7322>
> >
> >     Since the gauge is being published in seconds, you need to use `Duration::secs`
to convert.

I prefer the API call to work on `Duration` and perform the `secs()` as late as possible,
as I've seen so many times when people pass a wrong time unit if the API task an integer/float.


- Zhitao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65954/#review198952
-----------------------------------------------------------


On March 7, 2018, 11:20 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65954/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 11:20 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Greg Mann, Jason Lai, and James Peach.
> 
> 
> Bugs: MESOS-8609
>     https://issues.apache.org/jira/browse/MESOS-8609
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The new metric `slave/recover_secs` can be used to tell us how long
> Mesos agent needed to finish its recovery cycle. This is an important
> metric on agent machines which have a lot of completed executor
> sandboxes.
> 
> Note that the metric 1) will only be available after recovery succeeded
> and 2) never change its value across agent process lifecycle afterwards.
> 
> 
> Diffs
> -----
> 
>   src/slave/metrics.hpp 3fc933ca65690d6fad63156398ad9c2c53789296 
>   src/slave/metrics.cpp 0eb2b59ed67e14e73b29d7592c239441df0008d5 
>   src/slave/slave.cpp e2facb3c15a2f907f6497c58a36842ed707f2c70 
> 
> 
> Diff: https://reviews.apache.org/r/65954/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message