mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Conway <neil.con...@gmail.com>
Subject Re: Review Request 54183: Improved management of unreachable and completed tasks in master.
Date Wed, 11 Jan 2017 00:10:18 GMT


> On Jan. 3, 2017, 10:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5507
> > <https://reviews.apache.org/r/54183/diff/6/?file=1588710#file1588710line5507>
> >
> >     can we inline this?
> 
> Neil Conway wrote:
>     We could, but personally I find it more readable to make it a separate function.
The logic is a bit involved here (since we need to check two different data structures due
to backward compatibility requirements), and there's already a lot going on in `Master::_reregisterSlave`.
> 
> Vinod Kone wrote:
>     In general, we tend to factor out something into a function if it is reusable and
generic, not to reduce the number of lines someone has to read in a function. In fact the
argument was to inline in such cases because it will avoid the need to context switch. 
>     
>     Both `findPartitonAwareFrameworks` and `filterPartitionTasks` seem very specific
to the `_reregisterSlave` method, hence the suggestion.
>     
>     For example, instead of looping through all the agent's tasks 3 times to figure out
which tasks to add to `Slave*`, it's probably easier to do it in one inline loop here?

My goal isn't to reduce the # of lines to read the function, but to raise the level of abstraction
a little bit. For example, `findPartitonAwareFrameworks` has to check _two_ data structures,
depending on both agent version and whether the master has failed over. That is quite strange
and takes a bit of thought to validate. With a separate function, the reader of `_reregisterSlave`
doesn't need to think about those details if they don't want to -- instead they can just think
"okay, we find all the partition-aware frameworks on the agent and then ..."

I'm fine with looping over the agent's tasks three times, because each of the three things
we want to do is logically independent -- it takes a bit of thought to validate whether it
is actually correct to combine them into a single loop (because each loop iteration would
access a collection that is modified by the loop itself. It would actually be okay, but IMO
current approach is safer).


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54183/#review160019
-----------------------------------------------------------


On Jan. 9, 2017, 10:39 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54183/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 10:39 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6619
>     https://issues.apache.org/jira/browse/MESOS-6619
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Before partition-awareness, when an agent failed health checks, the
> master removed the agent from the registry, marked all of its tasks
> TASK_LOST, and moved them to the `completedTasks` list in the master's
> memory. Although "lost" tasks might still be running, partitioned agents
> would only be allowed to re-register if the master failed over, in which
> case the `completedTasks` map would be emptied.
> 
> When partition-awareness was introduced, we initially followed the same
> scheme, with the only difference that partition-aware tasks are marked
> TASK_UNREACHABLE, not TASK_LOST.
> 
> This scheme has a few shortcomings. First, partition-aware tasks might
> resume running when the partitioned agent re-registers. Second, we
> re-added non-partition aware tasks when the agent re-registered but then
> marked them completed when the framework is shutdown, resulting in two
> entries in `completedTasks`.
> 
> This commit introduces a separate bounded map, `unreachableTasks`. These
> tasks are reported separately via the HTTP endpoints, because they have
> different semantics (unlike completed tasks, unreachable tasks can
> resume running). The size of this map is limited by a new master flag,
> `--max_unreachable_tasks_per_framework`. This commit also changes the
> master to omit re-adding non-partition-aware tasks on re-registering
> agents (unless the master has failed over): those tasks will shortly be
> shutdown anyway.
> 
> Finally, this commit fixes a minor bug in the previous code: the
> previous coding neglected to shutdown non-partition-aware frameworks
> running on pre-1.0 Mesos agents that re-register with the master after
> a network partition.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 2113d06f58dddc0a28ae1241a24096266fe39801 
>   include/mesos/master/master.proto 03203c8ddd05876339b1c56789b7bb140499d49d 
>   include/mesos/v1/master/master.proto f8edf39a68752c8601cece345f52bce8b9a8a68b 
>   src/master/constants.hpp 900b69460de9280f30b16b86925e436054e080ca 
>   src/master/flags.hpp 6a17b763dc76daa10073394f416b049e97a44238 
>   src/master/flags.cpp 124bfbca6be9d5d60597b9a1793bd807202da8ec 
>   src/master/http.cpp 75dcd6199dbfcca6260baed49b838a45312bb667 
>   src/master/master.hpp 368ee1d5e97784fa54e0f141906405ee8f104317 
>   src/master/master.cpp 0c2210bbcd0622f704497daf78fe959cde99ff7f 
>   src/tests/partition_tests.cpp 72013d1bfee275c6f3cb90173f0c408d55e0bc5d 
> 
> Diff: https://reviews.apache.org/r/54183/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message