mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 61473: Do not kill non partition aware tasks.
Date Fri, 29 Sep 2017 18:19:49 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
-----------------------------------------------------------




src/master/http.cpp
Lines 324 (patched)
<https://reviews.apache.org/r/61473/#comment263308>

    Leave a blank line between these if blocks. Here and below.



src/master/http.cpp
Lines 350 (patched)
<https://reviews.apache.org/r/61473/#comment263309>

    Leave a blank line between these if blocks.



src/master/http.cpp
Line 4203 (original), 4220-4226 (patched)
<https://reviews.apache.org/r/61473/#comment263501>

    NPA tasks already have the state `TASK_LOST` right? Could you check that?
    
    The difference is that previously, if a framework was PA (and its unreachable tasks were
stored in  `unreachableTasks` as `UNREACHABLE`) and later it changed to NPA again, the code
here made no attempt to double check to change it to `TASK_LOST`. The new code would do it
though.
    
    It's debatable which is more right but I'd say it's safe to maintain the existing behavior?



src/master/master.hpp
Lines 2475-2477 (patched)
<https://reviews.apache.org/r/61473/#comment263317>

    Because this method takes a pointer, this mutation could affect future uses of it. Even
though right now nothing that cares about the state follows the call of `addUnreachableTask`,
it may still be good to not propagate the change.
    
    How about doing it in `addUnreachableTask`?
    
    ```
    void addUnreachableTask(const Task& _task)
    {
      Task* task = new Task(_task);
    
      // We have to use TASK_LOST for non-partition-aware frameworks
      // for backwards compatibility.
      if (!capabilities.partitionAware) {
        task->set_state(TASK_LOST);
      }
    
      unreachableTasks.set(task.task_id(), process::Owned<Task>(task));
    }
    ```



src/master/master.hpp
Line 2844 (original), 2840 (patched)
<https://reviews.apache.org/r/61473/#comment263318>

    s/non-/Non-/ to be consistent with other instances of `NOTE`s.



src/master/master.cpp
Line 6448 (original), 6439 (patched)
<https://reviews.apache.org/r/61473/#comment263319>

    The decision is pretty simple right now: all tasks are added unless the framework is completed.
So this separate introduction paragraph seems redudant and can be removed now:
    
    ```
    // Decide how to handle the tasks running on the agent:
    //
    ```



src/master/master.cpp
Line 6450 (original), 6441 (patched)
<https://reviews.apache.org/r/61473/#comment263320>

    "All tasks" except for those belonging to completed frameworks, so to summarize the block
of code below, we should probably mention this.



src/master/master.cpp
Lines 6453-6456 (original), 6443-6446 (patched)
<https://reviews.apache.org/r/61473/#comment263321>

    "no further cleanup is required" is true for all  circumstances now so I think this comment
is not important anymore.
    
    If we comment about the code below ignoring the history of how we got here but just as
it is, we could probably just remove the whole sentence:
    
    ```
      // If the master has failed
      // over since the agent was marked unreachable then the master shouldn't
      // have any record of the tasks running on the agent, so no further
      // cleanup is required.
    ```



src/master/master.cpp
Lines 7188-7190 (original), 7144-7146 (patched)
<https://reviews.apache.org/r/61473/#comment263411>

    Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little different than elsewhere
so I think this warrants a bit of explanation.
    
    e.g., 
    ```
    // Transition tasks to TASK_UNREACHABLE and remove (archive) them.
    // We convert the task state to TASK_LOST if is the framework is not partition aware.
    // However we only do the conversion right before the status update is sent out or the
    // task is archived because the processing prior to then requires tasks to be of the 
    // correct state TASK_UNREACHABLE.
    ```
    
    Does this sound right?



src/master/master.cpp
Lines 7169-7173 (patched)
<https://reviews.apache.org/r/61473/#comment263406>

    You have created this `newUpdate` but are not immediately using it, and I have to pay
attention to notice that it is `update` instead of `newUpdate` that is passed to `updateTask()`.
    
    So perhaps defer this change to after `updateTask` and `removeTask` are called? At that
point because the only remaining use of the `update` is to send it out, you don't even need
to make a copy any more. Just change the field.



src/master/master.cpp
Lines 7171-7172 (patched)
<https://reviews.apache.org/r/61473/#comment263399>

    This could be shortened to one line.
    
    ```
    update.mutable_status()->set_state(TASK_LOST);
    ```



src/master/master.cpp
Line 7218 (original), 7174 (patched)
<https://reviews.apache.org/r/61473/#comment263402>

    Looking into the method `updateTask`, passing in `update` here will cause the operator
API subscribers to receive `TASK_UNREACHABLE` for NPA task. So we should probably handle it
inside.



src/master/master.cpp
Lines 8989-8990 (original), 8945-8946 (patched)
<https://reviews.apache.org/r/61473/#comment263403>

    This is going to send `TASK_UNREACHABLE` to the operator API subscribers even for NPA
framework tasks. 
    
    We should probably be consistent and send `TASK_LOST`.


- Jiang Yan Xu


On Sept. 24, 2017, 3:46 p.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Sept. 24, 2017, 3:46 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 
>   src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 
>   src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 
>   src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message