mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <>
Subject Re: Review Request 54183: Improved management of unreachable and completed tasks in master.
Date Fri, 02 Dec 2016 00:24:53 GMT

This is an automatically generated e-mail. To reply, visit:

(Updated Dec. 2, 2016, 12:24 a.m.)

Review request for mesos and Vinod Kone.


Fix a few places where we neglected to check `unreachableTasks`; improve tests.

Bugs: MESOS-6619

Repository: mesos


Before partition-awareness, when an agent failed health checks, the
master removed the agent from the registry, marked all of its tasks
TASK_LOST, and moved them to the `completedTasks` list in the master's
memory. Although "lost" tasks might still be running, partitioned agents
would only be allowed to re-register if the master failed over, in which
case the `completedTasks` map would be emptied.

When partition-awareness was introduced, we initially followed the same
scheme, with the only difference that partition-aware tasks are marked

This scheme has a few shortcomings. First, partition-aware tasks might
resume running when the partitioned agent re-registers. Second, we
re-added non-partition aware tasks when the agent re-registered but then
marked them completed when the framework is shutdown, resulting in two
entries in `completedTasks`.

This commit introduces a separate bounded map, `unreachableTasks`. These
tasks are reported separately via the HTTP endpoints, because they have
different semantics (unlike completed tasks, unreachable tasks can
resume running). The size of this map is limited by a new master flag,
`--max_unreachable_tasks_per_framework`. This commit also changes the
master to omit re-adding non-partition-aware tasks on re-registering
agents (unless the master has failed over): those tasks will shortly be
shutdown anyway.

Finally, this commit fixes a minor bug in the previous code: the
previous coding neglected to shutdown non-partition-aware frameworks
running on pre-1.0 Mesos agents that re-register with the master after
a network partition.

Diffs (updated)

  docs/ efe3e9bd9d203a7ba44adf4ead24f14b8b577637 
  include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
  include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
  src/master/constants.hpp 5dd0667f62d2c0617cc0d5aed8cc005bd8344c88 
  src/master/flags.hpp 6a17b763dc76daa10073394f416b049e97a44238 
  src/master/flags.cpp 9bfb40e22820c3ced40b128280ec63288fea8b41 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp 877ca9010d0d6efc97f3d71fbd27272a255409d0 
  src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 



`make check`


Neil Conway

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message