mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Yan Xu <...@jxu.me>
Subject Re: Review Request 63174: Added a benchmark for agent reregistration during master failover.
Date Tue, 24 Oct 2017 18:09:41 GMT


> On Oct. 19, 2017, 6:38 p.m., Benjamin Mahler wrote:
> > Thanks Yan! I will dig in soon.
> > 
> > Just some quick questions:
> > 
> > (1) I thought during the meeting you said it was taking a minute, but looking at
all the benchmark timings they're all under a second? Is it only the benchmark setup that's
expensive here?
> > (2) Is this with the lock free event & run queues? If not, how much do they
help?
> > (3) As an aside, it has come up before, but it would be useful to be able to force
the messages to go through the remote stack rather than the local stack. No need to think
about this yet, but just something to keep in mind as not being accurate in this benchmark.
> 
> Jiang Yan Xu wrote:
>     1) Yeah looks like it. I used to include the setup time so it was large. 
>     2) Yeah I have used `--enable-optimize --enable-lock-free-run-queue --enable-lock-free-event-queue
--enable-last-in-first-out-fixed-size-semaphore`. I could compare with the perf without them.
>     3) Right right I think we should keep that in mind and we should have tests that
cover the remote stack. For the case here I thought it would be a simple and good-enough start
since the local stack alright coveres the proto (de)serliazation and the rest of the libprocess
optimization that we recently have improved.

Haha... actually the sub-second numbers in revision 1 were totally meaningless. I did `process::await(reregistered)`
instead of `process::await(reregistered).await();` when I intended to wait for the results...

I did some optimization in rev 2 e.g., parallelize the message preparation, allocate from
the stack instead of heap but I have to reduce the number of tasks to prevent it from running
too long. 

PTAL.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63174/#review188799
-----------------------------------------------------------


On Oct. 24, 2017, 11:05 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63174/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2017, 11:05 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Dmitry Zhuk, and Ilya Pronin.
> 
> 
> Bugs: MESOS-8098
>     https://issues.apache.org/jira/browse/MESOS-8098
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The current benchmark is very simple: without framework involvement and without agent
retries but it's possible to add a number of others so I am creating a new file for them.
> 
> 
> Diffs
> -----
> 
>   src/Makefile.am b60a54a031260de6f1fb43584ae5083df2dc7e31 
>   src/tests/CMakeLists.txt 386e0473c93d0a993248c7818067071d0c761c76 
>   src/tests/master_benchmarks.cpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/63174/diff/2/
> 
> 
> Testing
> -------
> 
> Benchmark based off https://github.com/apache/mesos/commit/41193181d6b75eeecae2729bf98007d9318e351a
(close to current HEAD).
> 
> ```
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 11.188008209secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(22404 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
20.868372615secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(37981 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 15.354579251secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(33766 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (94151
ms total)
> 
> 
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 11.045441129secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(19959 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
21.324309077secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(38490 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 14.68607521secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(32073 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (90523
ms total)
> 
> ```
> 
> Benchmark based off https://github.com/apache/mesos/commit/d9c90bf1d9c8b3a7dcc47be0cb773efff57cfb9d
(before https://issues.apache.org/jira/browse/MESOS-7713 was merged)
> 
> ```
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 23.217901878secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(38327 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
46.158610597secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(75280 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 38.56781112secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(68006 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (181613
ms total)
> 
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 100000 running tasks and 100000 completed tasks
in 25.752844224secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/0
(43509 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
> Starting reregistration for all agents
> Reregistered 2000 agents with a total of 200000 running tasks and 0 completed tasks in
45.190859035secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/1
(73966 ms)
> [ RUN      ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
> Starting reregistration for all agents
> Reregistered 20000 agents with a total of 100000 running tasks and 0 completed tasks
in 36.322992753secs
> [       OK ] AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test.AgentReregistrationDelay/2
(66946 ms)
> [----------] 3 tests from AgentFrameworkTaskCount/MasterFailover_BENCHMARK_Test (184421
ms total)
> ```
> 
> The recently patches cut down the time by over 50%. These were built with `--enable-optimize
--enable-lock-free-run-queue --enable-lock-free-event-queue --enable-last-in-first-out-fixed-size-semaphore`.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message