mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <bmah...@apache.org>
Subject Re: Review Request 63174: Added a benchmark for agent reregistration during master failover.
Date Tue, 24 Oct 2017 21:55:29 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63174/#review189093
-----------------------------------------------------------



A couple of suggestions for speeding up the benchmark overhead:

(1) Upgrade protobuf to 3.4.x, this comes with move support and rvalue setters for fields.
Which will avoid some copies in the benchmark code and improve performance elsewhere too :)
In the interim, you could manually use `Swap(T*)` but it means we'd probably want to re-write
the code once move support is available (so that doesn't seem like a good option).

(2) You could try using an arena for the test fixture, although I don't know if it's worth
the complexity. Probably just reducing copying is simpler.

(3) We can avoid re-parsing resources for each task and agent.


src/tests/master_benchmarks.cpp
Lines 63 (patched)
<https://reviews.apache.org/r/63174/#comment266020>

    Can you avoid parsing resources for each agent?



src/tests/master_benchmarks.cpp
Lines 84 (patched)
<https://reviews.apache.org/r/63174/#comment266018>

    Can you avoid parsing resources for every task?



src/tests/master_benchmarks.cpp
Lines 91-92 (patched)
<https://reviews.apache.org/r/63174/#comment266019>

    Code written this way is nice because it will automatically benefit from move support
when we upgrade protobuf to 3.4.x. :)
    
    Maybe you can write more of the test in such a manner that it would benefit from an upgrade
to 3.4.x? I would be happy to review a 3.4.x upgrade since we need it for other performance
improvements. We can see who wants to pick that up, I think Dmitry might be interested.



src/tests/master_benchmarks.cpp
Lines 139-140 (patched)
<https://reviews.apache.org/r/63174/#comment266015>

    Here's an example of where you could move into `message.frameworks` if you upgrade to
protobuf 3.4.x:
    
    ```
    message.mutable_frameworks()->Add(createFrameworkInfo(frameworkId));
    ```
    
    Alternatively, pre-3.4.x, you can swap:
    
    ```
    message.add_frameworks()->Swap(&createFrameworkInfo(frameworkId));
    
    // maybe you need to do:
    
    FrameworkInfo f = createFrameworkInfo(frameworkId);
    message.add_frameworks()->Swap(&f);
    ```



src/tests/master_benchmarks.cpp
Lines 143-147 (patched)
<https://reviews.apache.org/r/63174/#comment266016>

    Ditto copying here.



src/tests/master_benchmarks.cpp
Lines 163-167 (patched)
<https://reviews.apache.org/r/63174/#comment266017>

    Ditto copying here and elsewhere.



src/tests/master_benchmarks.cpp
Lines 241-243 (patched)
<https://reviews.apache.org/r/63174/#comment266013>

    Comment about why you're using the replicated log here?



src/tests/master_benchmarks.cpp
Lines 261 (patched)
<https://reviews.apache.org/r/63174/#comment266012>

    I'm a little concerned about this pattern, because if the test were to fail an assertion,
the process would be destructed without terminating / waiting on it.
    
    Can you use a wrapper around the process that terminates and waits?
    
    Alternatively, if we had a SCOPE_EXIT { ... } abstraction (I had a review but never committed
it), we could just do:
    
    ```
    SCOPE_EXIT { process::terminate(pid); wait(pid); };
    ```
    
    E.g. https://github.com/facebook/folly/blob/v2017.10.23.00/folly/ScopeGuard.h#L285-L287


- Benjamin Mahler


On Oct. 24, 2017, 6:05 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63174/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2017, 6:05 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Dmitry Zhuk, and Ilya Pronin.
> 
> 
> Bugs: MESOS-8098
>     https://issues.apache.org/jira/browse/MESOS-8098
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The current benchmark is very simple: without framework involvement and without agent
retries but it's possible to add a number of others so I am creating a new file for them.
> 
> 
> Diffs
> -----
> 
>   src/Makefile.am b60a54a031260de6f1fb43584ae5083df2dc7e31 
>   src/tests/CMakeLists.txt 386e0473c93d0a993248c7818067071d0c761c76 
>   src/tests/master_benchmarks.cpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/63174/diff/2/
> 
> 
> Testing
> -------
> 
> Benchmark based off https://github.com/apache/mesos/commit/41193181d6b75eeecae2729bf98007d9318e351a
(close to current HEAD).
> 
> ```
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 11.188008209secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(22404 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
20.868372615secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(37981 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 15.354579251secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(33766 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (94151
ms total)
> 
> 
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 11.045441129secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(19959 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
21.324309077secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(38490 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 14.68607521secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(32073 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (90523
ms total)
> 
> ```
> 
> Benchmark based off https://github.com/apache/mesos/commit/d9c90bf1d9c8b3a7dcc47be0cb773efff57cfb9d
(before https://issues.apache.org/jira/browse/MESOS-7713 was merged)
> 
> ```
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 23.217901878secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(38327 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
46.158610597secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(75280 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 38.56781112secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(68006 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (181613
ms total)
> 
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 25.752844224secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(43509 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
45.190859035secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(73966 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 36.322992753secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(66946 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (184421
ms total)
> ```
> 
> The recently patches cut down the time by over 50%. These were built with `--enable-optimize
--enable-lock-free-run-queue --enable-lock-free-event-queue --enable-last-in-first-out-fixed-size-semaphore`.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message