mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Clemmer <clemmer.alexan...@gmail.com>
Subject Re: Review Request 55313: Windows: Fixed the unkillable task bug, lit up executor tests.
Date Wed, 18 Jan 2017 20:49:55 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55313/
-----------------------------------------------------------

(Updated Jan. 18, 2017, 8:49 p.m.)


Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, and Joseph Wu.


Changes
-------

Address Joseph's comments.


Bugs: MESOS-6698, MESOS-6839 and MESOS-6870
    https://issues.apache.org/jira/browse/MESOS-6698
    https://issues.apache.org/jira/browse/MESOS-6839
    https://issues.apache.org/jira/browse/MESOS-6870


Repository: mesos


Description
-------

MESOS-6839 tracks a bug that causes the current implementation of the
default executor to be unable to delete any processes associated with a
task. To understand why requires some knowledge of the differences
between the process model of Windows and Unix.

In Unix, there is a robust notion of a process tree, with a well-defined
notion of process groups, sessions, signal delivery on the tree, and so
on. Windows lacks a robust notion of a process hierarchy, and therefore
largely has no equivalents to these constructs (including, notably,
signal semantics).

One of the problems this mismatch causes Mesos is that it complicates
the problem of killing a task, which is at its core a group of
processes. On Windows, the easiest way to make a process and all its
descendents easily killable is to track these processes in a Job Object,
which is a Windows kernel construct similar in principle to Linux's
control groups (though with different ideas of process namespacing).

There is some subtlety in making sure _all_ processes associated with a
task are captured inside a Job Object. The most important consideration
is that we need to make sure to add any process to the Job Object before
it has a chance to create any child processes; if we fail to do this,
the children will not be captured in the Job Object.

The solution to this is fairly simple on Windows. The process creation
API allows users to trivially create a process in a suspended state, so
that the Windows kernel scheduler does not schedule the process to run
until the user explicitly resumes the main thread. This allows us to
create the process and add it to a Job Object before it has a chance to
create children, and then start the process.

This commit will accomplish this by changing `PosixLauncher::fork` to
use the Subprocess parent hooks API, which implements exactly this
semantics. Essentially, the launcher will launch the containerizer
process, which will inspect the TaskInfo or the environment for a task
to launch, and then launch it. Using the parent hooks API, Subprocess
will create the containerizer process on Windows in a suspended state,
and then the parent hook supplied by the launcher will add that process
to a Job Object before it has a chance to run. Finally, Subprocess will
mark the process as runnable, and return.

This commit resolves MESOS-6839. We also light up the executor tests, so
it also resolves MESOS-6870 and MESOS-6839.


Diffs (updated)
-----

  src/slave/containerizer/mesos/launcher.cpp a6a8c01cb39f35f8174fcb5af0ef18de2da5ee78 
  src/tests/command_executor_tests.cpp 4d5c21ec427ebaac053e56ae554cb466dfeb0b8b 
  src/tests/default_executor_tests.cpp ec3e854ed58a0fbb3bfad0bd21eb0e2974548865 

Diff: https://reviews.apache.org/r/55313/diff/


Testing
-------


Thanks,

Alex Clemmer


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message