mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xudong Ni via Review Board <nore...@reviews.apache.org>
Subject Re: Review Request 68706: Added master failover reregistration progress metrics.
Date Mon, 29 Oct 2018 21:30:17 GMT


> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number of reregistered
agent is reached we are guaranteed to only set each timer once (but we may need to set multiple
timers in one call) right? This alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered agents count
being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

This is actaully fixed and marked as dropped somehow. We used equality operator to 0.0 to
check whether the percentile was reached or not; The reason we used push gauge not timer is
explained in push gauge vs timer comments section


> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that
it doesn't export any value if you haven't set it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all the timers
have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all of the
metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all
0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you but that's
a minor conveninence.
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

Sorry about this, I did make the comments but it must be in one of draft which was not saved.


I agree that metrics should be zero but not absent when certain percentige were not reached.
I did tried both PushGauge and Timer implementation and tested in our clusters.

If we used the timer, when the value was not set, that metric is actually missing. PushGauge
is set with the initial value 0.0 and we can tell whether it's set yet, the metric will always
exist no matter that percentile reached or not, and it has better performance.

The brand new cluster case was tested and the results were included in the test results.


- Xudong


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message