mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 53897: Changed how master represents "recovered" frameworks.
Date Sat, 03 Dec 2016 02:12:59 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------




src/master/master.hpp (line 672)
<https://reviews.apache.org/r/53897/#comment228472>

    s/newPid/pid/



src/master/master.cpp (line 1277)
<https://reviews.apache.org/r/53897/#comment228477>

    s/frameworks/framework's/



src/master/master.cpp (line 2474)
<https://reviews.apache.org/r/53897/#comment228486>

    i think this would be better as:
    
    ```
    if (framework == nullptr) {
      //
      recoverFramework(frameworkInfo);
      
      framework = getFramework(frameworkInfo.id());
    }
    
    CHECK_NOTNULL(framework);
    if (!framework->recovered())
    ```



src/master/master.cpp (line 2478)
<https://reviews.apache.org/r/53897/#comment228483>

    Can you add a comment on when we are here. It is not very obvious. Maybe copy paste the
comment from #2708?



src/master/master.cpp (lines 2707 - 2714)
<https://reviews.apache.org/r/53897/#comment228487>

    see comments above.



src/master/master.cpp (line 2714)
<https://reviews.apache.org/r/53897/#comment228485>

    move this outside the if statement.



src/master/master.cpp (lines 5488 - 5490)
<https://reviews.apache.org/r/53897/#comment228490>

    yea, i think this should be
    
    ```
    if (framework != nullptr && framework->connected()) {
    
    }
    ```



src/master/master.cpp (line 5512)
<https://reviews.apache.org/r/53897/#comment228491>

    shouldn't we send a ShutdownFrameworkMessage in this case?



src/master/master.cpp (lines 5854 - 5857)
<https://reviews.apache.org/r/53897/#comment228492>

    how about
    
    ```
    string status = framework == nullptr ? "unknown" : "disconnected";
    ```



src/master/master.cpp (lines 6046 - 6049)
<https://reviews.apache.org/r/53897/#comment228493>

    see above.



src/master/master.cpp (line 7021)
<https://reviews.apache.org/r/53897/#comment228494>

    CHECK_NOTNULL(framework);



src/master/master.cpp (line 7028)
<https://reviews.apache.org/r/53897/#comment228502>

    also want to check framework->pid and framework->http are None()?



src/master/master.cpp (lines 7029 - 7034)
<https://reviews.apache.org/r/53897/#comment228498>

    hmm. it's a bit weird that activating a framework also updates it. this should be done
by the callers before calling this method.



src/master/master.cpp (line 7087)
<https://reviews.apache.org/r/53897/#comment228505>

    do we have to send FrameworkReregisteredMessage for HTTP? IIRC Registered or Reregistered
both end up as `Subscribed` Event.
    
    if no need to make the distinction, can we just call `_failoverFramework` here to avoid
duplicating a bunch of code?



src/master/master.cpp (line 7095)
<https://reviews.apache.org/r/53897/#comment228506>

    given the above comments, do we still want to distinguish between DISCONNECTED and RECOVERED
states? if we can get away with it, that would be conceptually easy to understand.


- Vinod Kone


On Dec. 2, 2016, 8:56 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2016, 8:56 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
>   include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp 5114e9d21929157e09a4a9fc4dbfeca260fb6b45 
>   src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message