mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 61473: Do not kill non partition aware tasks.
Date Thu, 26 Oct 2017 08:23:11 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review188966
-----------------------------------------------------------




src/master/http.cpp
Line 317 (original), 317 (patched)
<https://reviews.apache.org/r/61473/#comment266299>

    This comment isn't about this line but it seems my previous comment at the top of my last
review was overlooked so I am raising an issue here :)
    
    For this patch we should also update the comments above PARTITION_AWARE API to reflect
the change.
    https://github.com/apache/mesos/blob/6eefc685ccf304d0fb8ed4ff9bc314197d77f078/include/mesos/mesos.proto#L336
    
    Also please search the docs for necessary changes.



src/master/http.cpp
Lines 322-324 (patched)
<https://reviews.apache.org/r/61473/#comment266286>

    Seems like if the framework is NPA, we don't have to iterate through the list? i.e., put
the check outside the `foreachvalue` loop?



src/master/http.cpp
Lines 345-346 (patched)
<https://reviews.apache.org/r/61473/#comment266288>

    Given that we now store all NPA tasks in TASK_LOST (instead of being converted to TASK_LOST
on the fly here), is it more straightfoward to say 
    
    ```
    // Unreachable tasks belonging to a non-partition-aware framework have been stored as
TASK_LOST for backward compatibility so we should export them as completed tasks.
    ```
    
    ?



src/master/http.cpp
Lines 348-350 (patched)
<https://reviews.apache.org/r/61473/#comment266285>

    The reason we are here is that tasks in `framework_->unreachableTasks` could be in
`TASK_LOST` state and we should export those as "completed_tasks".
    
    So I think the condition check isn't right? `if (framework_->capabilities.partitionAware)`
is going to skip all tasks previously stored as `TASK_LOST`.
    
    Like I suggested in another place, should we be checking against `if (task.get()->state()
!= TASK_LOST)` directly?



src/master/http.cpp
Lines 4225 (patched)
<https://reviews.apache.org/r/61473/#comment266293>

    Is is more straightfoward to do `if (task.get()->state() == TASK_UNREACHABLE)` than
the inequality check?



src/master/http.cpp
Lines 4228-4229 (patched)
<https://reviews.apache.org/r/61473/#comment266295>

    Ditto on the update of the comment.



src/master/master.hpp
Lines 2591 (patched)
<https://reviews.apache.org/r/61473/#comment265908>

    You made a redundant copy here but I understand this line may go away anyways. :)



src/master/master.hpp
Line 2588 (original), 2594 (patched)
<https://reviews.apache.org/r/61473/#comment266296>

    Sorry we have gone back and forth but for now let's use this API:
    
    ```
    void removeTask(Task* task, bool unreachable)
    ```
    
    I originally hoped to solely rely on `task->state()` in the workflow but it turned
out to be pretty difficult. So here let's add a comment above the method explaning the reason
for the boolean.
    
    e.g.,
    
    ```
    // Removes the task. `unreachable` indicates whether the task is removed due to being
unreachable. Note that we cannot rely on the task state because it may not reflect unreachability
due to being set to TASK_LOST for backwards compatibility.
    ```



src/master/master.cpp
Lines 9683-9684 (original), 9637-9639 (patched)
<https://reviews.apache.org/r/61473/#comment266297>

    If the framework is PA, it can still contain tasks in `TASK_LOST`. For now let's follow
the implementation in `Master::_tasks_killing()`: check the state of each task.



src/tests/partition_tests.cpp
Lines 2044-2045 (patched)
<https://reviews.apache.org/r/61473/#comment266298>

    A comment shouldn't split the chain. I think 
    
    ```
        .WillRepeatedly(Return()); // The agent may resend status updates.
    ```
    
    fits in one line?


- Jiang Yan Xu


On Oct. 16, 2017, 1:59 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2017, 1:59 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
>   src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
>   src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
>   src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/8/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message