mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
Date Fri, 12 Jan 2018 08:52:14 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195241
-----------------------------------------------------------



LGTM modulo Vinod's comments. Sorry we have gone back and forth on this but this time I think
we are close to reach what's shippable to unblock 1.5.


src/master/http.cpp
Lines 360-362 (original)
<https://reviews.apache.org/r/64940/#comment274470>

    So, as reminded by Vinod, for backward compatibility let's keep the checks. However here
we should change the condition to
    
    ```
    if (task->state() == TASK_UNREACHABLE) {
      continue;
    }
    ```
    
    because other terminal tasks that are not TASK_LOST could also be in this map...



src/master/master.hpp
Lines 2326-2327 (original)
<https://reviews.apache.org/r/64940/#comment274417>

    The following CHECK looks correct?
    
    ```
    CHECK(Master::isRemovable(task->state()) << task->state();
    ```
    
    Although failing CHECKs are annoying, it helped us uncover the inconsistency in our task
state transitions so it's better to correct it?



src/master/master.cpp
Lines 11023-11036 (original), 11021-11034 (patched)
<https://reviews.apache.org/r/64940/#comment274415>

    I was going to suggest that we remove the `if (task->state() == TASK_UNREACHABLE)`
check but per Vinod's comment we shouldn't.



src/tests/mesos.hpp
Lines 3511 (patched)
<https://reviews.apache.org/r/64940/#comment274471>

    s/a/an/?



src/tests/mesos.hpp
Line 3511 (original), 3518 (patched)
<https://reviews.apache.org/r/64940/#comment274472>

    s/a/an/?



src/tests/partition_tests.cpp
Lines 2401 (patched)
<https://reviews.apache.org/r/64940/#comment274477>

    We have to s/unreachable/completed/ and s/but/and/ ?



src/tests/partition_tests.cpp
Lines 2402 (patched)
<https://reviews.apache.org/r/64940/#comment274478>

    With all the discussions about teminology it seems  s/Completed/Terminal/ is more accurate?



src/tests/partition_tests.cpp
Lines 2440-2442 (patched)
<https://reviews.apache.org/r/64940/#comment274473>

    Nit: declare them in the order we are expecting on them? i.e., starting -> running
-> finished.



src/tests/partition_tests.cpp
Lines 2474-2475 (patched)
<https://reviews.apache.org/r/64940/#comment274474>

    Without a paused clock, there is a chance of race here right? If the same status update
is resent after a timeout before we acknowledge, it's possible that the resent update is going
to be caught by the 2nd expectation?
    
    Would it be possible to hoist the Clock::pause() statement?



src/tests/partition_tests.cpp
Lines 2527 (patched)
<https://reviews.apache.org/r/64940/#comment274475>

    Remove this debugging log.



src/tests/partition_tests.cpp
Lines 2532-2534 (patched)
<https://reviews.apache.org/r/64940/#comment274476>

    Sorry now you have to expect it to be in the completed_task section...


- Jiang Yan Xu


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------------------------------------------------------
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod Kone, and Jiang
Yan Xu.
> 
> 
> Bugs: MESOS-8337
>     https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> -------
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message