mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Megha Sharma <mshar...@apple.com>
Subject Re: Review Request 61473: Do not kill non partition aware tasks.
Date Mon, 16 Oct 2017 09:01:34 GMT


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 7188-7190 (original), 7144-7146 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line7188>
> >
> >     Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little different
than elsewhere so I think this warrants a bit of explanation.
> >     
> >     e.g., 
> >     ```
> >     // Transition tasks to TASK_UNREACHABLE and remove (archive) them.
> >     // We convert the task state to TASK_LOST if is the framework is not partition
aware.
> >     // However we only do the conversion right before the status update is sent
out or the
> >     // task is archived because the processing prior to then requires tasks to be
of the 
> >     // correct state TASK_UNREACHABLE.
> >     ```
> >     
> >     Does this sound right?

+1


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 8989-8990 (original), 8945-8946 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line8994>
> >
> >     This is going to send `TASK_UNREACHABLE` to the operator API subscribers even
for NPA framework tasks. 
> >     
> >     We should probably be consistent and send `TASK_LOST`.

Right, missed it. So, one way to solve it is to let the state be TASK_LOST for NPA and change
it to TASK_UNREACHABLE just before calling removeTask() so the task goes to unreachable tasks
datastructure.


- Megha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
-----------------------------------------------------------


On Oct. 16, 2017, 8:59 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2017, 8:59 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
>   src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
>   src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
>   src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/8/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message