mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Review Request 70014: Removed operations when removing resource providers.
Date Fri, 22 Feb 2019 00:08:04 GMT


> On Feb. 20, 2019, 5:09 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 8372-8387 (patched)
> > <https://reviews.apache.org/r/70014/diff/1/?file=2125873#file2125873line8372>
> >
> >     Do we recover the resources associated with non-terminal, non-speculative operations
in this code path? This function uses `allocator->updateSlave()` to change the agent's
total resources, but I don't think that affects the allocation. Will we end up "leaking" allocations
in the allocator here? Maybe we should just use `Master::removeOperation()`?
> 
> Joseph Wu wrote:
>     The allocator updates the sorters by calling `sorter->remove(agent, oldtotal);`
and then `sorter->add(agent, newtotal);`, but these do not change the allocation.  So yes,
this would leak allocations.
>     
>     Looks like I erroneously avoided `Master::removeOperation()` because that method
does not consider orphans.
>     ```
>       // If the operation was not speculated and is not terminal we
>       // need to also recover its used resources in the allocator.
>       if (!protobuf::isSpeculativeOperation(operation->info()) &&
>           !protobuf::isTerminalState(operation->latest_status().state())) {
>         Try<Resources> consumed = protobuf::getConsumedResources(operation->info());
>         CHECK_SOME(consumed);
>     
>         allocator->recoverResources(
>             operation->framework_id(),
>             operation->slave_id(),
>             consumed.get(),
>             None());
>       }
>     ```
>     
>     I need to add to the conditional and avoid calling `alloctor->recoverResources`
when the operation is an orphan (since orphans are not tracked by the allocator).  Once that
modification is done, `Master::removeOperation()` should be usable in this location.

I've made part of the fix to this here:
https://reviews.apache.org/r/69962/diff/2-3/

And part in this review.


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70014/#review213003
-----------------------------------------------------------


On Feb. 21, 2019, 4:07 p.m., Joseph Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70014/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 4:07 p.m.)
> 
> 
> Review request for mesos, Benno Evers, Gastón Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-9542
>     https://issues.apache.org/jira/browse/MESOS-9542
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a resource provider explicitly disconnects from the agent, the
> agent will send a `UpdateSlaveMessage` to the master, telling the
> master to remove the resource provider.  If there are any operations
> associated with the resource provider, they must be removed too,
> because there is no way to make forward progress on resource provider
> operations without a resource provider.
> 
> This removes a potential memory leak in the master's Framework structs.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 106d924bf16231b3bda3fb719db68c01d73644ee 
> 
> 
> Diff: https://reviews.apache.org/r/70014/diff/2/
> 
> 
> Testing
> -------
> 
> Fixes a couple of (flaky) issues with `OperationReconciliationTest.AgentPendingOperationAfterMasterFailover`
when combined with the rest of the orphan operation chain.
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message