mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 58898: Send task kill for non-Partition Aware frameworks.
Date Wed, 10 May 2017 23:24:52 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58898/#review174194
-----------------------------------------------------------



My apologies for the delay in reviewing this.

High-level comments:

(a) Can we improve the description of the problem in the commit summary? It took me quite
a while to figure out what is actually going on here. My understanding is:

(1) Agent re-registers
(2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the agent
(3) Master offers agent resources to framework
(4) Framework launches new task on agent _before the agent has finished shutting down the
framework_
(5) Agent ignores launch task because it believes the framework is still terminating.

This is a race condition, because if the agent finished shutting down the framework (in #4)
before the launch task was received, there would not be a problem. Is my understanding correct?

(2) I'd prefer a new unit test that constructs exactly this scenario, rather than changing
existing unit tests.

(3) The new behavior is that the framework will receive `TASK_KILLED` for any non-PA tasks
running on a partitioned agent that re-registers. This does not seem ideal, because `TASK_KILLED`
_normally_ corresponds to a framework-initiated `killTask` operation.

(4) Thinking out loud, what if a custom executor ignores the `killTask` request? When shutting
down a framework, it seems we forcibly destroy the container (via `containerizer->destroy()`),
if the executor doesn't exit promptly upon receiving the framework shutdown request. AFAIK
we don't have similar logic for `killTask` requests.

3 + 4 suggests that maybe we want a different way to terminate the task on the agent -- let's
discuss.


src/master/master.cpp
Lines 6034 (patched)
<https://reviews.apache.org/r/58898/#comment247713>

    `tasksToKill`



src/master/master.cpp
Lines 6057 (patched)
<https://reviews.apache.org/r/58898/#comment247313>

    "Keep"



src/master/master.cpp
Lines 6058 (patched)
<https://reviews.apache.org/r/58898/#comment247312>

    "separate data structure"



src/master/master.cpp
Line 6097 (original), 6102 (patched)
<https://reviews.apache.org/r/58898/#comment247314>

    "launched by"



src/tests/partition_tests.cpp
Lines 693 (patched)
<https://reviews.apache.org/r/58898/#comment247742>

    Why did you move this to before the agent re-registers? The point of doing explicit reconciliation
is to check the master's state after the agent re-registers.
    
    There's no harm in _also_ reconciling before the agent re-registers, but if you want to
add that, let's do it in a separate review.



src/tests/partition_tests.cpp
Line 754 (original), 786 (patched)
<https://reviews.apache.org/r/58898/#comment247330>

    `EXPECT_FALSE`.



src/tests/partition_tests.cpp
Line 794 (original), 826 (patched)
<https://reviews.apache.org/r/58898/#comment247746>

    The point of this part of the test is to check that the agent's metrics are correct after
re-registration (and before starting any additional tasks). I'd prefer to check the metrics,
then start another task on the agent afterward.



src/tests/partition_tests.cpp
Line 2049 (original)
<https://reviews.apache.org/r/58898/#comment247745>

    Removing this results in two gmock warnings:
    
    ```
    GMOCK WARNING:
    Uninteresting mock function call - returning directly.
        Function call: killTask(0x7fbcf1c5fa20, @0x7fbcf1c54640 1)
    ```
    
    ```
    GMOCK WARNING:
    Uninteresting mock function call - returning directly.
        Function call: shutdown(0x7fbcf1c5fa20)
    ```


- Neil Conway


On May 1, 2017, 11:58 p.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58898/
> -----------------------------------------------------------
> 
> (Updated May 1, 2017, 11:58 p.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Mesos is now sending ShutdownFrameworkMessages to the agent for all
> non-partition-aware frameworks (including the ones that are still
> registered). This is problematic. The offer from this agent can
> still go to the same framework which can then launch new tasks.
> The agent then receives tasks of the same framework and ignores
> them because it thinks the framework is shutting down. The framework
> is not shutting down of course, so from the master and the scheduler's
> perspective the task is pending in STAGING forever until the next agent
> reregistration, which could happen much later. This commit fixes
> the problem by sending a task kill instead of ShutdownFrameworkMessage,
> when the agent re-registers after being unreachable, for non-partition
> aware framewworks.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 31a7a2fcf905c0c35e80692a69c290d4094deded 
>   src/tests/partition_tests.cpp 4ff428564d1fa6cb96e6f8ec8edc331da88a3eb6 
> 
> 
> Diff: https://reviews.apache.org/r/58898/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message