mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 54183: Improved management of unreachable and completed tasks in master.
Date Wed, 18 Jan 2017 19:18:43 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54183/
-----------------------------------------------------------

(Updated Jan. 18, 2017, 7:18 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Remove helper functions.


Bugs: MESOS-6619
    https://issues.apache.org/jira/browse/MESOS-6619


Repository: mesos


Description
-------

Before partition-awareness, when an agent failed health checks, the
master removed the agent from the registry, marked all of its tasks
TASK_LOST, and moved them to the `completedTasks` list in the master's
memory. Although "lost" tasks might still be running, partitioned agents
would only be allowed to re-register if the master failed over, in which
case the `completedTasks` map would be emptied.

When partition-awareness was introduced, we initially followed the same
scheme, with the only difference that partition-aware tasks are marked
TASK_UNREACHABLE, not TASK_LOST.

This scheme has a few shortcomings. First, partition-aware tasks might
resume running when the partitioned agent re-registers. Second, we
re-added non-partition aware tasks when the agent re-registered but then
marked them completed when the framework is shutdown, resulting in two
entries in `completedTasks`.

This commit introduces a separate bounded map, `unreachableTasks`. These
tasks are reported separately via the HTTP endpoints, because they have
different semantics (unlike completed tasks, unreachable tasks can
resume running). The size of this map is limited by a new master flag,
`--max_unreachable_tasks_per_framework`. This commit also changes the
master to omit re-adding non-partition-aware tasks on re-registering
agents (unless the master has failed over): those tasks will shortly be
shutdown anyway.

Finally, this commit fixes a minor bug in the previous code: the
previous coding neglected to shutdown non-partition-aware frameworks
running on pre-1.0 Mesos agents that re-register with the master after
a network partition.


Diffs (updated)
-----

  docs/configuration.md 2113d06f58dddc0a28ae1241a24096266fe39801 
  include/mesos/master/master.proto 03203c8ddd05876339b1c56789b7bb140499d49d 
  include/mesos/v1/master/master.proto f8edf39a68752c8601cece345f52bce8b9a8a68b 
  src/master/constants.hpp 900b69460de9280f30b16b86925e436054e080ca 
  src/master/flags.hpp 6a17b763dc76daa10073394f416b049e97a44238 
  src/master/flags.cpp 124bfbca6be9d5d60597b9a1793bd807202da8ec 
  src/master/http.cpp a44621f39cb059e654a56f57f75b38947f3a4230 
  src/master/master.hpp 8e8a9037af11cf95961b6498540a0fd486ed091b 
  src/master/master.cpp 73159328ce3fd838e02eba0e6a30cf69efc319ba 
  src/tests/partition_tests.cpp 72013d1bfee275c6f3cb90173f0c408d55e0bc5d 

Diff: https://reviews.apache.org/r/54183/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message