mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.
Date Fri, 01 Mar 2019 22:58:03 GMT


> On March 1, 2019, 10:15 p.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Line 8777 (original), 8785 (patched)
> > <https://reviews.apache.org/r/70081/diff/2/?file=2128176#file2128176line8801>
> >
> >     We potentially publish to multiple resource providers here where each could
fail independently. Do we need to perform any cleanup if we were able to publish to all but
one RP? By using `collect` (which is dictated by the return type) we seem to make it hard
for the caller to perform such cleanup, but am unsure we can perform the cleanup here (it
might also fail).
> >     
> >     I am missing something which makes this a no-issue or would we need to implement
above `TODO` to make this work?

Because resource allocation lifecycles are tied to task lifecycles: as soon as a task is completed,
the master would immediately offer out any resource provider resource previously used by the
task. This means that it would be very complicated to do cleanups (I assume you're talking
about things like `NodeUnpublishVolume`) asynchronously as they might race with another `PUBLISH_RESOURCES`.
Therefore, all cleanups are done *lazily* only when necessary. This is also related to why
we cannot use diff-based resource publishing.

Moreover, we currently don't do any `NodeUnpublishVolume` until the disk is destroyed. A subsequent
task would just bind-mount the already-published volume into their sandbox.

So the bottom line is we intentionally make `PUBLISH_RESOURES` an operation that requires
no cleanup.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213337
-----------------------------------------------------------


On March 1, 2019, 8:11 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 8:11 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message