mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 65109: Fixed a bug relating to lingering executors.
Date Sat, 27 Jan 2018 02:09:40 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65109/#review195957
-----------------------------------------------------------



It would for posterity to highlight the race in question, and to also mention the failover
case. Here's a suggestion:

```
An executor should be shutdown if all of its tasks are
killed while the executor is launching.

This patch fixes and issue where the executor is left
running when the task(s) get killed between the executor
registration/subscription and `Slave::___run()`. See
MESOS-8411 for more details. There is an additional race
in the agent failover case that is addressed in this patch.

The fix here is to fix the race by checking an executor's various
tasks queues during task kill and executor (re-)registration,
and shutting down executors that had never received any tasks.
```

One issue is that we probably need to be checking for SOURCE_SLAVE for the dropped case in
status update?


src/slave/constants.hpp
Lines 90-94 (patched)
<https://reviews.apache.org/r/65109/#comment275403>

    "correctly infer" here seems a little misleading, maybe we should mention the false positives?
    
    How about:
    
    // NOTE: This should be greater than zero because the agent looks
    // for completed tasks to determine (with false positives) whether
    // an executor ever received tasks. See MESOS-8411.
    //
    // TODO(mzhu): Remove this note once we determine whether an
    // executor ever received tasks without looking through the
    // completed tasks.



src/slave/slave.hpp
Lines 865 (patched)
<https://reviews.apache.org/r/65109/#comment275404>

    whoops, extra newline here



src/slave/slave.hpp
Lines 866-885 (patched)
<https://reviews.apache.org/r/65109/#comment276012>

    How about the following?
    
    ```
    // Returns true if the agent ever sent any tasks to this executor.
    // More precisely, this function returns whether either:
    //
    //  (1) There are terminated/completed tasks with a
    //      SOURCE_EXECUTOR status.
    //
    //  (2) `launchedTasks` is not empty.
    //
    // If this function returns false and there are no queued tasks,
    // we probably (see TODO below) have killed or dropped all of its
    // initial tasks, in which case the agent will shut down the executor.
    //
    // TODO(mzhu): Note however, that since we look through the cache
    // of completed tasks, we can have false positives when a task
    // that was delivered to the executor has been evicted from the
    // completed task cache by tasks that have been killed by the
    // agent before delivery. This should be fixed.
    //
    // NOTE: This function checks whether any tasks has ever been sent,
    // this does not necessarily mean the executor has ever received
    // any tasks. Specifically, tasks in `launchedTasks` may be dropped
    // before received by the executor if the agent restarts.
    ```



src/slave/slave.hpp
Lines 874-875 (patched)
<https://reviews.apache.org/r/65109/#comment275406>

    Lost is a little unclear here, perhaps we should highlight the kill / drop cases?
    
    ```
      // If this function returns false and there are no queued tasks,
      // we must have killed or dropped all of its initial tasks, in
      // which case the agent will shut down the executor.
    ```



src/slave/slave.hpp
Lines 879 (patched)
<https://reviews.apache.org/r/65109/#comment275407>

    Let's say "dropped" now that we've clarified "lost". "Lost" can mean many things whereas
dropped now identifies a particular case of "lost" where we've dropped a message.



src/slave/slave.hpp
Lines 880 (patched)
<https://reviews.apache.org/r/65109/#comment275409>

    agent restart, since agent failure is just a particular case



src/slave/slave.cpp
Lines 2820-2822 (patched)
<https://reviews.apache.org/r/65109/#comment275402>

    Thanks for clarifying this!
    
    How about:
    
    ```
    // At this point, we must have either sent some tasks to the running
    // executor or there are queued tasks that need to be delivered.
    // Otherwise, the executor state would have been synchronously
    // transitioned to TERMINATING when the queued tasks were killed.
    ```



src/slave/slave.cpp
Lines 3417-3421 (patched)
<https://reviews.apache.org/r/65109/#comment275416>

    I was thinking we should also be doing this in the REGISTERING case above? After talking
to vinod, I guess it's a bit complicated since we can't talk to the executor yet so we'd have
to do the forceful shutdown. Probably just need a TODO in the registering case to consider
shutting it down early.



src/slave/slave.cpp
Lines 3423-3424 (patched)
<https://reviews.apache.org/r/65109/#comment275419>

    How about: "because it has never been sent a task and all of its queued tasks have been
killed before delivery"?



src/slave/slave.cpp
Lines 4225-4232 (original), 4242-4243 (patched)
<https://reviews.apache.org/r/65109/#comment275427>

    subscribing executor .. because it was never sent a task and has no tasks to run



src/slave/slave.cpp
Line 4452 (original), 4463 (patched)
<https://reviews.apache.org/r/65109/#comment275422>

    registering executor



src/slave/slave.cpp
Lines 4695-4696 (patched)
<https://reviews.apache.org/r/65109/#comment275423>

    re-registering executor
    because it has no tasks to run and has never been sent a task



src/slave/slave.cpp
Lines 5011-5014 (patched)
<https://reviews.apache.org/r/65109/#comment276010>

    How about:
    
    ```
    // Also if the task is in `launchedTasks` but was dropped by the
    // agent, we know that the task did not reach the executor. We
    // will synchronously transition the task to ensure that the
    // agent re-registration logic can call `everSentTask()` after
    // dropping tasks.
    ```



src/slave/slave.cpp
Lines 5021 (patched)
<https://reviews.apache.org/r/65109/#comment275430>

    Shall we also be checking for SOURCE_SLAVE?
    
    let's line this up with the open paren for readability



src/slave/slave.cpp
Lines 5022 (patched)
<https://reviews.apache.org/r/65109/#comment275429>

    indented just 2



src/slave/slave.cpp
Lines 8983-8987 (patched)
<https://reviews.apache.org/r/65109/#comment276011>

    Ditto comment from above:
    
    ```
    // NOTE: This should be greater than zero because the agent looks
    // for completed tasks to determine (with false positives) whether
    // an executor ever received tasks. See MESOS-8411.
    //
    // TODO(mzhu): Remove this note once we determine whether an
    // executor ever received tasks without looking through the
    // completed tasks.
    ```



src/slave/slave.cpp
Lines 8988-8989 (patched)
<https://reviews.apache.org/r/65109/#comment276013>

    ```
    static_assert(
        MAX_COMPLETED_TASKS_PER_EXECUTOR > 0,
        "Max completed tasks per executor should be greater than zero");
    ```



src/slave/slave.cpp
Lines 9315 (patched)
<https://reviews.apache.org/r/65109/#comment275401>

    Why auto here?


- Benjamin Mahler


On Jan. 26, 2018, 11:13 p.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65109/
> -----------------------------------------------------------
> 
> (Updated Jan. 26, 2018, 11:13 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Vinod Kone.
> 
> 
> Bugs: MESOS-8411
>     https://issues.apache.org/jira/browse/MESOS-8411
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> An executor should be shutdown if it has never got any tasks
> i.e. all of its initial tasks are killed before launching.
> See MESOS-8411.
> 
> This patch ensures this by checking an executor's various
> tasks queues during task kill and executor registration,
> and shutting down executors that had never received any tasks.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp e6cb7cc0ccdaaf981eb66defa21b38720f4e1de9 
>   src/slave/slave.hpp 09c01ebd1b5e8008ba9e7d412042f1db76a2c5a5 
>   src/slave/slave.cpp 43c7955237655accdf869db1b6494a35f77acdb3 
> 
> 
> Diff: https://reviews.apache.org/r/65109/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> new tests in #65111
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message