mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <bbann...@apache.org>
Subject Re: Review Request 63732: Reconciled offer operations between agent and master.
Date Tue, 21 Nov 2017 21:11:11 GMT


> On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote:
> > src/master/master.cpp
> > Lines 7107-7121 (patched)
> > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7107>
> >
> >     Do you need to also update allocator for added or removed new operations?
> >     
> >     For instance, the allocator currently think the new operation A uses 2cpus.
Now, if A is removed (because it's dropped), do we need to tell the allocator that the 2cpus
are no longer used and they can be allocated to others?

Yes, this was missing. I added a method to inform the allocator of allocations after the fact
and adjusted this patch.


> On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote:
> > src/master/master.cpp
> > Lines 7108 (patched)
> > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7108>
> >
> >     Think about the case where agent crashes and restarts, not all RP has re-registered
yet. In that case, some operation from some not yet re-registered RP will not be part of this
operation list.
> >     
> >     I don't think we want to remove those operations just yet. I think we should
remove those operations only if the corresponding RP has re-registered with the agent and
show up in the offer operation list (or total resources).
> >     
> >     This is similar to we don't remove tasks when agent disconnects. In fact, you
should follow the similar patter in reconcileKnownSlave here. If an operation is unknown,
instead of calling `removeOfferOperation` directly, we should probably send a `reconcileOfferOperationMessage`
to the agent to ask the RP to generate a status update, and rely on status update handler
to properly handle the resource accounting.
> >     
> >     Also realized that we probably should also do the same in the agent code. Instead
of directly calling removeOfferOperation and addOfferOperation, send a `RECONCILE` message
to RP to asking the RP to generate a status udpate. If it's unknown to the RP, RP will send
OFFER_OPERATION_DROPPED, which is terminal.
> 
> Benjamin Bannier wrote:
>     I am not sure it should be the master's responsibility to keep these operations alive
since it e.g., has no visibility into whether or when a resource provider might come back.
It seems to me that we should instead let the agent decide when to prune operations from removed
resource providers from the list of operations processed here. While it currently would not
send operations for not yet resubscribed RPs after agent failover, these would appear together
with the RP resources once it resubscribes, so I think the difference here is purely semantic.
For the case of RP failover the agent will send operations from disconnected RPs with the
list processed here.
>     
>     In the end I feel the scenario of agent failover's task analogue is rather a master
failover with some agent not yet registered. There we would not be able to provide information
about tasks running on disconnected agents.

As we discussed offline, we will now only remove disappeared operations in the master if the
resource provider is registered with the agent (we infer that from the total resources the
agent reports). I adjusted the patch accordingly.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63732/#review191135
-----------------------------------------------------------


On Nov. 21, 2017, 10:09 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63732/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 10:09 p.m.)
> 
> 
> Review request for mesos, Jie Yu and Jan Schlicht.
> 
> 
> Bugs: MESOS-8207
>     https://issues.apache.org/jira/browse/MESOS-8207
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 7417b5d641fd4bb6d91cb0e6456c60201bbc8206 
> 
> 
> Diff: https://reviews.apache.org/r/63732/diff/4/
> 
> 
> Testing
> -------
> 
> `make check`, tested as part of https://reviews.apache.org/r/63843/.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message