mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jie Yu <yujie....@gmail.com>
Subject Re: Review Request 61946: Added validation of resource provider operations.
Date Wed, 06 Sep 2017 21:41:47 GMT


> On Aug. 28, 2017, 9:31 p.m., Jie Yu wrote:
> > src/master/validation.cpp
> > Lines 2205 (patched)
> > <https://reviews.apache.org/r/61946/diff/1/?file=1806110#file1806110line2205>
> >
> >     I think `checkpointedResources` should not be used for Resource Provider provided
resources. It should only apply to agent default resources. The checkpointing should be done
by the corresponding resource provider, not the agent for RP provided resources.
> >     
> >     As a result, for operations like RESERVE/UNRESERVE/CREATE/DESTROY, we need to
send operation to the corresponding resource provider as well. This does make sense. If we
ask agent to persist those information, what will be the semantics if the resource provider
is marked as gone?
> >     
> >     However, this does get complicated if we want to guarantee ordering for operations
in one `acceptOffers` call (for backwards compatibility), and we do want to allow frameworks
to launch a task right after reserve operation (the current semantics).
> >     
> >     To support that, I think we need to speculatively assume the operation will
be sucessful (thus allow a subsequent launch immediately at the master side). However, when
the checkpointing fails, we need a way to abort the subsequent launch at the agent side. This
is essentially why we CHECK fail if the checkpointing fails at the agent previously for `checkpointedResources`.
> >     
> >     For the resource provider case, we should do the same thing. We can abort the
agent if a checkpointing fails. However, this only applies to the local resource provider
that lives in the agent process. If a LRP is outside of the agent process, how to abort the
subsequent task launch if a previous operation fails is something we should think about. For
instance, always reject operations from the agent's RP manager if the operation is for a stale
stream ID?
> 
> Jan Schlicht wrote:
>     Fully agreed, thanks for bringing up the challenged with handling `RESERVE`/`UNRESERVE`/`CREATE`/`DESTROY`
with local and external resource providers. An idea for solving this with external resource
providers could be to rescind a launch, similar to how we rescind offers. E.g. an ERP would
send a rescind message to the master which then instructs the agent to stop the launch.

Chatted with bmahler and chun on this. Here is what I would propose to handle legacy operations:
1) Speculatively assume that the legacy operation will be successful at the master side, meaning
that it'll apply the operation immediately
2) Always perform a `contains` check when the agent receives a task (task_group). Currently,
we only do a `contains` check for checkpointed resources (https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2199-L2215).
This is not sufficient in the case where an RP or resource estimator update the agent total
resources. So, we should extend that `contains` check for all agent's resources.
3) Because of that check in 2), if the speculative operation fails, the subsequent task launch
will fail. We should trigger an agent re-registration so that master can reconcile with the
actual agent state to correct the failed speculation.
4) Similar to 2), we should perform a `contains` check for operations too. Think about `RESERVE`
followed by `UNRESERVE`. How to fail the `UNRESERVE` if `RESERVE` failed? Right now, we rely
on agent abort. I am not sure if it's possible to infer the *source* of the operation from
`CheckpointResourcesMessage`. That makes me feel that we probably should add an agent capability
for resource provider capable agent and the master will send explicit message about an operation
to the agent, instead of just a `CheckpointResourceMessage`.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61946/#review183988
-----------------------------------------------------------


On Sept. 1, 2017, 10:23 a.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61946/
> -----------------------------------------------------------
> 
> (Updated Sept. 1, 2017, 10:23 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier and Jie Yu.
> 
> 
> Bugs: MESOS-7594
>     https://issues.apache.org/jira/browse/MESOS-7594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added validation of resource provider operations.
> 
> 
> Diffs
> -----
> 
>   src/master/validation.hpp f4925752f20ae8ca4de1d9b4a3d5ffc394db9585 
>   src/master/validation.cpp 7c3247d407c9e6aa8cce457d6c6be0c39f4b532f 
> 
> 
> Diff: https://reviews.apache.org/r/61946/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message