mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gastón Kleiman <gas...@mesosphere.com>
Subject Re: Review Request 55901: Added support for command health checks to the default executor.
Date Sat, 28 Jan 2017 12:40:01 GMT


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 480
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line480>
> >
> >     launch nested container session returns a streaming response, how come you are
calling `post()` helper here which expects a non-streaming response?
> >     
> >     probably one of the reasons why your test is hanging.

The `post()` helper delegates to `process::http::request()`, which takes boolean flag (`streamedResponse`).
If this flag is set to `false`, libprocess will convert a PIPE (streaming) response into a
BODY (non-streaming) response. This means that the `Future` returned by the helper will not
be completed until the server closes the connection.

Relevant links:
 - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/include/process/http.hpp#L953-L955
 - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1191-L1197
 - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1005-L1022

In these particular cases, it means that the `Future` will not be completed until the container
exits, which is exactly what we need.


Regarding the test hanging in Linux, some further debugging seem to indicate that test hangs
on `process::Clock::settle()`, because there's a race + deadlock in `RateLimiter` that leaves
a process stuck in `RUNNING`. I'll dig deeper on Monday, but here's some evidence:

# Snippet from a successful test run that "leaks" a running process

```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HealthCheckTest
[ RUN      ] HealthCheckTest.DefaultExecutorCmdHealthCheck

[...]

E0127 11:18:06.607446  8665 limiter.hpp:123] !!!! LIMITER _acquire
E0127 11:18:06.607471  8665 limiter.hpp:129] !!!! LIMITER There are 1 promises
E0127 11:18:06.608064  8665 limiter.hpp:186] !!!! LIMITER destroy

**** DEADLOCK DETECTED! ****
You are waiting on process __limiter__(1419)@192.99.40.208:41947 that it is currently executing.

[...]

[       OK ] HealthCheckTest.DefaultExecutorCmdHealthCheck (592 ms)
[----------] 1 test from HealthCheckTest (593 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (603 ms total)
[  PASSED  ] 1 test.

Repeating all tests (iteration 474) . . .
```

# Snippet from a hung run

```
[...]
E0127 11:18:06.878636  8640 cluster.cpp:358] !!! Settling the clock
E0127 11:18:06.878646  8640 process.cpp:3491] !!! Attempting to acquire mutex
E0127 11:18:06.878654  8640 process.cpp:3494] !!! !runq.empty()?
E0127 11:18:06.878659  8640 process.cpp:3502] !!! running.load() > 0?
E0127 11:18:06.878667  8640 process.cpp:3504] !!! 1 processes still running
E0127 11:18:06.878676  8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state:
3
E0127 11:18:06.878684  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893)
state: 3
E0127 11:18:06.878691  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894)
state: 3
E0127 11:18:06.878697  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895)
state: 3
E0127 11:18:06.878705  8640 process.cpp:3509] !!!! Process: __gc__ state: 3
E0127 11:18:06.878711  8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
E0127 11:18:06.878718  8640 process.cpp:3509] !!!! Process: __processes__ state: 3
E0127 11:18:06.878725  8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
E0127 11:18:06.878731  8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state:
3
E0127 11:18:06.878738  8640 process.cpp:3509] !!!! Process: files state: 3
E0127 11:18:06.878744  8640 process.cpp:3509] !!!! Process: help state: 3
E0127 11:18:06.878751  8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state:
3
E0127 11:18:06.878757  8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state:
3
E0127 11:18:06.878764  8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
E0127 11:18:06.878772  8640 process.cpp:3509] !!!! Process: logging state: 3
E0127 11:18:06.878777  8640 process.cpp:3509] !!!! Process: master state: 3
E0127 11:18:06.878792  8640 process.cpp:3509] !!!! Process: metrics state: 3
E0127 11:18:06.878800  8640 process.cpp:3509] !!!! Process: profiler state: 3
E0127 11:18:06.878806  8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
E0127 11:18:06.878813  8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420)
state: 3
E0127 11:18:06.878820  8640 process.cpp:3509] !!!! Process: system state: 3
E0127 11:18:06.878828  8640 process.cpp:3509] !!!! Process: version state: 3
E0127 11:18:06.878834  8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
E0127 11:18:06.878840  8640 process.cpp:3491] !!! Attempting to acquire mutex
E0127 11:18:06.878846  8640 process.cpp:3494] !!! !runq.empty()?
E0127 11:18:06.878852  8640 process.cpp:3502] !!! running.load() > 0?
E0127 11:18:06.878859  8640 process.cpp:3504] !!! 1 processes still running
E0127 11:18:06.878867  8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state:
3
E0127 11:18:06.878873  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893)
state: 3
E0127 11:18:06.878880  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894)
state: 3
E0127 11:18:06.878887  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895)
state: 3
E0127 11:18:06.878895  8640 process.cpp:3509] !!!! Process: __gc__ state: 3
E0127 11:18:06.878901  8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
E0127 11:18:06.878907  8640 process.cpp:3509] !!!! Process: __processes__ state: 3
E0127 11:18:06.878913  8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
E0127 11:18:06.878921  8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state:
3
E0127 11:18:06.878927  8640 process.cpp:3509] !!!! Process: files state: 3
E0127 11:18:06.878933  8640 process.cpp:3509] !!!! Process: help state: 3
E0127 11:18:06.878940  8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state:
3
E0127 11:18:06.878947  8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state:
3
E0127 11:18:06.878953  8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
E0127 11:18:06.878960  8640 process.cpp:3509] !!!! Process: logging state: 3
E0127 11:18:06.878967  8640 process.cpp:3509] !!!! Process: master state: 3
E0127 11:18:06.878973  8640 process.cpp:3509] !!!! Process: metrics state: 3
E0127 11:18:06.878979  8640 process.cpp:3509] !!!! Process: profiler state: 3
E0127 11:18:06.878986  8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
E0127 11:18:06.878993  8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420)
state: 3
E0127 11:18:06.878999  8640 process.cpp:3509] !!!! Process: system state: 3
E0127 11:18:06.879006  8640 process.cpp:3509] !!!! Process: version state: 3
E0127 11:18:06.879014  8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
[...] (repeats ad nauseam)
```


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, lines 481-485
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line481>
> >
> >     The identation seems wrong? do we do it like this elsehwhere?
> >     
> >     i would've done so
> >     
> >     ```
> >     .after(checkTimeout,
> >            defer(self(),
> >                  ...)
> >     ```

This was what `clang-format` suggested, but I adopted your suggestion.


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 531
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line531>
> >
> >     s/has not returned/timed out/

I had taken the failure message from the "normal" cmd check code. I'm updating both.


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, lines 534-536
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line534>
> >
> >     why repair?

To fail the future/check with a nice descriptive message that will later be logged.


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 538
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line538>
> >
> >     why not just return failure?

The compiler complains if I do that, so I have to choose between:

```
    .then([failure](const Option<int>& status) -> Future<Response> {
        return failure;
    });
```

And:

```
    .then([failure](const Option<int>& status) {
        return Future<Response>(failure);
    });
```

Which one do we prefer?


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, lines 426-429
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line426>
> >
> >     why a `.repair()` here?

To fail the future/check with a nice descriptive message that will later be logged.


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 435
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line435>
> >
> >     const & ?

This connection is passed to `nestedCommandHealthCheckTimedOut()`, which calls the non-const
method `connection.disconnect()`.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55901/#review163363
-----------------------------------------------------------


On Jan. 28, 2017, 12:39 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55901/
> -----------------------------------------------------------
> 
> (Updated Jan. 28, 2017, 12:39 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Anand Mazumdar, haosdent huang, and Vinod
Kone.
> 
> 
> Bugs: MESOS-6280
>     https://issues.apache.org/jira/browse/MESOS-6280
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added support for command health checks to the default executor.
> 
> 
> Diffs
> -----
> 
>   src/checks/health_checker.hpp 95da1ff7dd6b222a93076633eb3757ec9aa43cf6 
>   src/checks/health_checker.cpp e70bd7936752613a4f92c70c4c61cd7cdf7c4ee5 
>   src/launcher/default_executor.cpp 97eee05cac8cb1f62d43e2aecc08a8e54e49eac3 
>   src/tests/health_check_tests.cpp 710cb66eff6c4447caa22772f0cdc97cfa582c50 
> 
> Diff: https://reviews.apache.org/r/55901/diff/
> 
> 
> Testing
> -------
> 
> Introduced a new test: `HealthCheckTest.DefaultExecutorCmdHealthCheck`. It passes on
Linux, but not on macOS.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message