mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Review Request 61639: Fixed an bug where the agent kills and still launches a task.
Date Tue, 15 Aug 2017 07:00:33 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61639/
-----------------------------------------------------------

Review request for mesos, Anand Mazumdar and Vinod Kone.


Bugs: MESOS-7744 and MESOS-7865
    https://issues.apache.org/jira/browse/MESOS-7744
    https://issues.apache.org/jira/browse/MESOS-7865


Repository: mesos


Description
-------

The following race leads to the agent both killing and launching a task:

  (1) Slave::__run completes, task is now within Executor::queuedTasks.
  (2) Slave::killTask locates the executor based on the task ID residing
      in queuedTasks, calls Slave::statusUpdate() with TASK_KILLED.
  (3) Slave::___run assumes that killed tasks have been removed from
      Executor::queuedTasks, but this now occurs asynchronously in
      Slave::_statusUpdate. So, the agent still sees the queued task
      and delivers it and adds the task to Executor::launchedTasks.
  (3) Slave::_statusUpdate runs, removes the task from
      Executor::launchedTasks and adds it to Executor::terminatedTasks.

The fix applied here is to synchronously transition queued tasks to
a terminal state when statusUpdate is called. This can be done because
for queued tasks, we do not need to retrieve the container status (the
task never reached the container).


Diffs
-----

  src/slave/slave.hpp 1fe93dab1b2bef24721cc1bcffebe1b259e96d79 
  src/slave/slave.cpp 7381530515f86faf4c3e8f82bcd9483f6cf0498b 


Diff: https://reviews.apache.org/r/61639/diff/1/


Testing
-------

make check

SlaveTest.KillQueuedTaskDuringExecutorRegistration captures this case, but it did not delay
retrieving the container status. This test could have been updated previously to delay the
container status, but now there is no container status to delay, so I've left the test alone.


Thanks,

Benjamin Mahler


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message