mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 68706: Added master failover reregistration progress metrics.
Date Fri, 02 Nov 2018 08:24:23 GMT


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number of reregistered
agent is reached we are guaranteed to only set each timer once (but we may need to set multiple
timers in one call) right? This alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered agents count
being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
>     This is actaully fixed and marked as dropped somehow. We used equality operator to
0.0 to check whether the percentile was reached or not; The reason we used push gauge not
timer is explained in push gauge vs timer comments section

The point here is "This alleviates the need to check if the timer is already set!". It's still
in the new revision. I made another comment about it.


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that
it doesn't export any value if you haven't set it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all the timers
have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all of the
metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all
0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you but that's
a minor conveninence.
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
>     Sorry about this, I did make the comments but it must be in one of draft which was
not saved. 
>     
>     I agree that metrics should be zero but not absent when certain percentige were not
reached. I did tried both PushGauge and Timer implementation and tested in our clusters.
>     
>     If we used the timer, when the value was not set, that metric is actually missing.
PushGauge is set with the initial value 0.0 and we can tell whether it's set yet, the metric
will always exist no matter that percentile reached or not, and it has better performance.
>     
>     The brand new cluster case was tested and the results were included in the test results.

My point was the opposite. When a percentage is not reached, I feel that the value being absent
is preferrable and I used the above two cases to demonstrate it. (There is a subtle difference
between the two IMO).

It's in fact the current practice used by metrics like `state_fetch`.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message