-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58898/#review174194
-----------------------------------------------------------
My apologies for the delay in reviewing this.
High-level comments:
(a) Can we improve the description of the problem in the commit summary? It took me quite
a while to figure out what is actually going on here. My understanding is:
(1) Agent re-registers
(2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the agent
(3) Master offers agent resources to framework
(4) Framework launches new task on agent _before the agent has finished shutting down the
framework_
(5) Agent ignores launch task because it believes the framework is still terminating.
This is a race condition, because if the agent finished shutting down the framework (in #4)
before the launch task was received, there would not be a problem. Is my understanding correct?
(2) I'd prefer a new unit test that constructs exactly this scenario, rather than changing
existing unit tests.
(3) The new behavior is that the framework will receive `TASK_KILLED` for any non-PA tasks
running on a partitioned agent that re-registers. This does not seem ideal, because `TASK_KILLED`
_normally_ corresponds to a framework-initiated `killTask` operation.
(4) Thinking out loud, what if a custom executor ignores the `killTask` request? When shutting
down a framework, it seems we forcibly destroy the container (via `containerizer->destroy()`),
if the executor doesn't exit promptly upon receiving the framework shutdown request. AFAIK
we don't have similar logic for `killTask` requests.
3 + 4 suggests that maybe we want a different way to terminate the task on the agent -- let's
discuss.
src/master/master.cpp
Lines 6034 (patched)
<https://reviews.apache.org/r/58898/#comment247713>
`tasksToKill`
src/master/master.cpp
Lines 6057 (patched)
<https://reviews.apache.org/r/58898/#comment247313>
"Keep"
src/master/master.cpp
Lines 6058 (patched)
<https://reviews.apache.org/r/58898/#comment247312>
"separate data structure"
src/master/master.cpp
Line 6097 (original), 6102 (patched)
<https://reviews.apache.org/r/58898/#comment247314>
"launched by"
src/tests/partition_tests.cpp
Lines 693 (patched)
<https://reviews.apache.org/r/58898/#comment247742>
Why did you move this to before the agent re-registers? The point of doing explicit reconciliation
is to check the master's state after the agent re-registers.
There's no harm in _also_ reconciling before the agent re-registers, but if you want to
add that, let's do it in a separate review.
src/tests/partition_tests.cpp
Line 754 (original), 786 (patched)
<https://reviews.apache.org/r/58898/#comment247330>
`EXPECT_FALSE`.
src/tests/partition_tests.cpp
Line 794 (original), 826 (patched)
<https://reviews.apache.org/r/58898/#comment247746>
The point of this part of the test is to check that the agent's metrics are correct after
re-registration (and before starting any additional tasks). I'd prefer to check the metrics,
then start another task on the agent afterward.
src/tests/partition_tests.cpp
Line 2049 (original)
<https://reviews.apache.org/r/58898/#comment247745>
Removing this results in two gmock warnings:
```
GMOCK WARNING:
Uninteresting mock function call - returning directly.
Function call: killTask(0x7fbcf1c5fa20, @0x7fbcf1c54640 1)
```
```
GMOCK WARNING:
Uninteresting mock function call - returning directly.
Function call: shutdown(0x7fbcf1c5fa20)
```
- Neil Conway
On May 1, 2017, 11:58 p.m., Megha Sharma wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58898/
> -----------------------------------------------------------
>
> (Updated May 1, 2017, 11:58 p.m.)
>
>
> Review request for mesos, Neil Conway and Jiang Yan Xu.
>
>
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Mesos is now sending ShutdownFrameworkMessages to the agent for all
> non-partition-aware frameworks (including the ones that are still
> registered). This is problematic. The offer from this agent can
> still go to the same framework which can then launch new tasks.
> The agent then receives tasks of the same framework and ignores
> them because it thinks the framework is shutting down. The framework
> is not shutting down of course, so from the master and the scheduler's
> perspective the task is pending in STAGING forever until the next agent
> reregistration, which could happen much later. This commit fixes
> the problem by sending a task kill instead of ShutdownFrameworkMessage,
> when the agent re-registers after being unreachable, for non-partition
> aware framewworks.
>
>
> Diffs
> -----
>
> src/master/master.cpp 31a7a2fcf905c0c35e80692a69c290d4094deded
> src/tests/partition_tests.cpp 4ff428564d1fa6cb96e6f8ec8edc331da88a3eb6
>
>
> Diff: https://reviews.apache.org/r/58898/diff/2/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Megha Sharma
>
>
|