mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Review Request 70335: Sent FrameworkInfo to agents when applying operations.
Date Tue, 02 Apr 2019 22:09:42 GMT


> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.hpp
> > Lines 889 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136016#file2136016line889>
> >
> >     Why do we need this hashmap? I see that the code inserts/removes entries, but
nothing ever gets anything from it.

Yea we can nuke this.


> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.cpp
> > Lines 4439-4443 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136017#file2136017line4439>
> >
> >     Should we also unschedule the garbage collection of the framework work directory?
> >     
> >     It gets scheduled for GC here: https://github.com/apache/mesos/blob/1acb38c27306326a53f866b5386b5e28a6dc9314/src/slave/slave.cpp#L6834-L6838

My thought was that we don't need to unschedule the work dir because, in the case of non-launch
operations, we won't be creating any executor/task directories in the framework's work dir.
However, the following sequence seems possible:
1) All of a framework's tasks on an agent go terminal, so the framework is removed and its
meta/work dirs are scheduled for GC
2) The framework submits a non-launch operation on this agent, and we create a new framework
in this patch but do NOT unschedule GC on the framework's work dir
3) The framework launches a task on this agent, we create a new executor dir in it's work
dir, and then later on the framework's work dir is incorrectly GC'd because we never unscheduled
it in #2

So, I think you're right. We should unschedule GC on the framework's work dir.


> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.cpp
> > Lines 4459 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136017#file2136017line4459>
> >
> >     I noticed that `Slave::run()` only checkpoints the framework if `frameworkInfo.checkpoint()`
returns `true`.
> >     
> >     Should we maybe not checkpoint operations/frameworkInfos of frameworks that
set `checkpoint` to `false`?

I think the case of operations is different. Since non-launch operations are not coupled to
the framework lifecycle the way tasks are, we want to retain the FrameworkInfo even if the
framework does not have checkpointing enabled. For example, a framework could submit a CREATE_DISK
operation which takes a long time, then immediately tear itself down. If checkpointing is
not enabled, we would still want to have the FrameworkInfo checkpointed so that if the master
fails over and the agent reregisters, we can create the Framework struct in the master and
track the operation.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70335/#review214247
-----------------------------------------------------------


On March 28, 2019, 11:27 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70335/
> -----------------------------------------------------------
> 
> (Updated March 28, 2019, 11:27 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, Chun-Hung Hsiao, Gastón Kleiman,
Joseph Wu, and Meng Zhu.
> 
> 
> Bugs: MESOS-8582
>     https://issues.apache.org/jira/browse/MESOS-8582
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch updates the master to send a framework's full
> `FrameworkInfo` to the agent in the `ApplyOperationMessage`.
> The agent is updated to checkpoint frameworks when applying
> operations, and to unschedule GC on the meta directory when
> a new framework is created.
> 
> The test `TerminalOrphanOperationAfterMasterFailover` is
> removed since this patch eliminates the case of orphan
> operations relevant to that test.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp acc67d3763ddee9027e6cf375f1d495ff5805026 
>   src/messages/messages.proto e30ad34cc9212b05f85ba5e1d4fcfc9e49ae92c0 
>   src/slave/slave.hpp 2bffdc4b1e282d3c6dae2dcf23584a8a4550bf94 
>   src/slave/slave.cpp 5373cee5d30c2403497939eeba2ee5405117237e 
>   src/tests/persistent_volume_tests.cpp 7e929a5a3a92e16a5dec10206f37caebc20d66a8 
>   src/tests/reservation_tests.cpp cd84cd24d3587fafc01ae1861f22c47262f2d7e9 
>   src/tests/storage_local_resource_provider_tests.cpp 55c9389453754227de31e0a76d32ba663cc8ca7c

> 
> 
> Diff: https://reviews.apache.org/r/70335/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message