mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 53897: Changed how master represents "recovered" frameworks.
Date Sun, 04 Dec 2016 02:01:52 GMT


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 2478
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line2478>
> >
> >     Can you add a comment on when we are here. It is not very obvious. Maybe copy
paste the comment from #2708?

I think the "it is now safe" comment isn't very helpful/accurate. I replaced this with:

```
    // The framework has previously been registered with this master;
    // it may or may not currently be connected.
```

Which matches a similar comment in the driver-based scheduler code path.


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 7124
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7124>
> >
> >     CHECK_NOTNULL(framework);

Is there a general rule for when to add `CHECK_NOTNULL` for internal functions that take pointer
parameters? `Master::failoverFramework`, `Master::_failoverFramework`, `Master::_exited(Framework*)`,
`Master::drop`, and `Master::authorizeTask` could all have `CHECK_NOTNULL`s added, for example.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 2, 2016, 8:56 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2016, 8:56 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
>   include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp 5114e9d21929157e09a4a9fc4dbfeca260fb6b45 
>   src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message