mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
Date Fri, 05 Jan 2018 19:11:19 GMT


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10062>
> >
> >     Perhaps move all the logic around determining the unreachable task down here.
> >     
> >     ```
> >     // We need to inform `removeTask` whether the task became unreachable due to
the agent being removed. Note that terminal tasks are not considered unreachable. We can not
rely on the task's state because it may have been set to `TASK_LOST` for backwards-compatibility.
> >     bool unreachable =
> >       unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
> >     ```
> >     
> >     Inlining `unreachable` is probably fine too but separating it helps clarify
it with comments?
> 
> James Peach wrote:
>     You have to sample the task state before the update because it might transition to
a terminal state at that time (and LOST is terminal).

Right, so put the line above `updateTask`? The main thing I am commenting on is that we can
consolidate the logic without first defining `bool unreachable = true;` much far above?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7637 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7637>
> >
> >     You could just wait for the the status update (being droppped) instead of capaturing
`Slave::executorTerminated` and risk the race condition of the master not receiving the status
update before the agent is deemed partitioned?
> 
> James Peach wrote:
>     Dealing with all the status updates was less reliable.

Do you see the race condition I was trying to point out?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7686 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7686>
> >
> >     No need to settle?
> 
> James Peach wrote:
>     This is settling to ensure that events following from marking the agent down are
fully processed.

Oh ok that's right.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
-----------------------------------------------------------


On Jan. 5, 2018, 11:06 a.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------------------------------------------------------
> 
> (Updated Jan. 5, 2018, 11:06 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod Kone, and Jiang
Yan Xu.
> 
> 
> Bugs: MESOS-8337
>     https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> -------
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message