mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 70368: Initialized resource provider manager earlier when recovering.
Date Wed, 03 Apr 2019 05:01:35 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70368/#review214312
-----------------------------------------------------------



The fix looks good but the test could be flaky.

BTW although I know you prefer having the unit test and the fix together in one commit, it
seems to me that splitting them would make it easier to apply the test alone to verify that
the test make MESOS-9667 manifest ;)


src/tests/slave_tests.cpp
Lines 11570 (patched)
<https://reviews.apache.org/r/70368/#comment300522>

    Does a driver-based (v0) executor have the same problem? Or is there other consideration
for using an HTTP executor?



src/tests/slave_tests.cpp
Lines 11583 (patched)
<https://reviews.apache.org/r/70368/#comment300521>

    I can run the test without fixing the agent ID. If this is needed can you provide a reason?



src/tests/slave_tests.cpp
Lines 11681 (patched)
<https://reviews.apache.org/r/70368/#comment300519>

    This can be
    ```
    ASSERT_FALSE(resources.filter(&v1::Resources::hasResourceProvider).empty())
    ```
    Or if you prefer using `any_of`:
    ```
    ASSERT_TRUE(std::any_of(
        resources.begin(), resources.end(), &v1::Resources::hasResourcePRovider))
    ```



src/tests/slave_tests.cpp
Lines 11759 (patched)
<https://reviews.apache.org/r/70368/#comment300520>

    It seems that this `TASK_RUNNING` is just a coincident. In one of my run, after the agent
restarted, I saw:
    ```
    E0403 04:41:31.329046 32563 slave.cpp:3118] Failed to update resources for container 1dc3370c-875b-4a36-a84d-971b5a517b73
of executor 'default' of framework f23d7633-9a52-4263-b6a6-c03fca6d8b98-0000, destroying container:
Collect failed: Failed to publish resources 'disk(allocated: foo)[RAW]:200' for container
1dc3370c-875b-4a36-a84d-971b5a517b73: Resource provider 3eabc11c-5300-4831-ac4b-1f4d300ae877
is not subscribed
    ```
    Then the executor was killed and the task was transitioned to `TASK_LOST`:
    ```
    I0403 04:41:31.565970 32522 slave.cpp:5526] Handling status update TASK_LOST (Status UUID:
a82a0fdb-398a-4703-8356-6b2447f67b19) for task b25366b9-09a2-407f-98a5-d79403387624 of framework
f23d7633-9a52-4263-b6a6-c03fca6d8b98-0000 from @0.0.0.0:0
    ```
    
    The reason that we saw `TASK_RUNNING` was because the master only process `ACKNOWLEDGE`
once, and that was for `TASK_STARTING`. So, the agent tried to resend `TASK_RUNNING` after
restart. If `TASK_RUNNING` was acked faster, we would have see `TASK_LOST` here.
    
    I verified the above through adding an `os::sleep(Seconds(1))` at L11730.


- Chun-Hung Hsiao


On April 2, 2019, 8:35 a.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70368/
> -----------------------------------------------------------
> 
> (Updated April 2, 2019, 8:35 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9667
>     https://issues.apache.org/jira/browse/MESOS-9667
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When recovering and reusing the same agent ID the resource provider
> manager can be initialized before e.g., recovering executors. This patch
> move the initialization to such an earlier point. This e.g., allows to
> successfully publish resources via the manager when HTTP-based executors
> resubscribe which previously ran into an assertion failure.
> 
> If the agent ID is not reused we still need to wait for the agent to
> register with the master which would assign an agent ID. In that case we
> do not expect any executors to resubscribe.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 5373cee5d30c2403497939eeba2ee5405117237e 
>   src/tests/slave_tests.cpp 528a25a837513f153de2a5e89897440144385633 
> 
> 
> Diff: https://reviews.apache.org/r/70368/diff/1/
> 
> 
> Testing
> -------
> 
> * `make check`
> * the test fails without the agent change
> * ran the test for 17000 iterations without failures (failure rate <1% with 66% certainty)
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message