mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guangya Liu <gyliu...@gmail.com>
Subject Re: Review Request 51027: Track allocation candidates to bound allocator.
Date Sat, 24 Sep 2016 02:04:37 GMT


> On 九月 23, 2016, 2:40 a.m., Guangya Liu wrote:
> > It is really weired that the performance of `SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/7`
does not improve much when calling `addSlave`, need check more for why `addSlave` was same?
Without fix, the `addSlave` will call `allocate` for each agent, but with the fix, only one
`allocate` will be called....
> > 
> > ```
> > without fix:
> > [==========] Running 1 test from 1 test case.
> > [----------] Global test environment set-up.
> > [----------] 1 test from SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test
> > [ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/7
> > Using 1000 agents and 6000 frameworks
> > Added 6000 frameworks in 122268us
> > Added 1000 agents in 42.037104secs
> > 
> > With fix:
> > [==========] Running 1 test from 1 test case.
> > [----------] Global test environment set-up.
> > [----------] 1 test from SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test
> > [ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/7
> > Using 1000 agents and 6000 frameworks
> > Added 6000 frameworks in 116107us
> > Added 1000 agents in 41.615396secs
> > ```
> 
> Guangya Liu wrote:
>     Jacob, I did more test with the code on Aug 23, at which I posted some result in
this RR, and found that the test result is different, I did following to get Aug 23 code.
>     
>     ```
>     LiuGuangyas-MacBook-Pro:build gyliu$ git checkout 2f78a440ef4201c5b11fb92c225694e84a60369c
>     
>     LiuGuangyas-MacBook-Pro:build gyliu$ git log -1
>     commit 2f78a440ef4201c5b11fb92c225694e84a60369c
>     Author: Gilbert Song <songzihao1990@gmail.com>
>     Date:   Mon Aug 22 13:00:58 2016 -0700
>     
>         Fixed potential flakiness in ROOT_RecoverOrphanedPersistentVolume.
>     
>         Review: https://reviews.apache.org/r/51271/
>     ```
>     
>     The test result seems still same as now (without your patch and the code is get from
Aug 23):
>     
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test
>     [ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/7
>     Using 1000 agents and 6000 frameworks
>     Added 6000 frameworks in 144272us
>     Added 1000 agents in 43.107001secs
>     ```
>     
>     But anyway, I think that we need find out why the performance for `addSlave` was
not improved based on your patch.
> 
> Jacob Janco wrote:
>     Yes agreed, per our Slack discussions, I'll look into this. Thanks for posting the
followup.
> 
> Benjamin Mahler wrote:
>     `addSlave()` is asynchronous and we do not wait for all of the `addSlave()` futures
to complete, so any speedup in `addSlave()` will only affect the next caller that waits for
a result from the allocator.
> 
> Benjamin Mahler wrote:
>     Ah I missed that we do a `Clock::settle()`, nevermind :)
> 
> Guangya Liu wrote:
>     Some thinking for why `addSlave` does not improve much...
>     
>     Without Jacob's patch, the logic woule be:
>     
>     ```
>     addSlave -> allocate the single slave
>     addSlave -> allocate the single slave
>     addSlave -> allocate the single slave
>     ...
>     addSlave -> allocate the single slave
>     ```
>     
>     With Jacob's patch, the logic would be:
>     
>     ```
>     addSlave
>     addSlave
>     addSlave
>     ...
>     addSlave - > allocate for **all** of the slaves
>     ```
>     
>     The time elapsed by `allocate a single slave N times` with `allocate N slaves in
one allocate` request should not different much, the only difference is one is looping the
event queue while another is looping in allocator, that's why there are not enough performance
change for this.
>     
>     But this will impact a lot when adding frameworks or some other events in allocator
which will call `allocate(slaves)`, one proposal is we may need to add some new benchmark
test cases which do the following logic, the following logic will trigger each `addframework`
operation call `allocate(slaves)` without Jacob's patch, but will only call `allocate(slaves)`
one time with Jacob's patch.
>     
>     ```
>     1) Add slaves first
>     2) Add frameworks
>     ```
>     
>     We may get some performance improvement with above case.
>     
>     Currently, all of the benchmark test are using 
>     
>     ```
>     1) Add frameworks
>     2) Add agents
>     ```
>     
>     That's why not much performance improvement...
> 
> Jacob Janco wrote:
>     This makes sense Guangya, I'm in the process of creating a minimal benchmark adding
a set of slaves then adding frameworks. I'll post here if the results are interesting.

Jacob, just FYI, your fix does help a lot for the case I listed above!

A new simple benchmark test.

```
TEST_P(HierarchicalAllocator_BENCHMARK_Test, AddSlaveAndFrameworks)
{
  size_t slaveCount = std::tr1::get<0>(GetParam());
  size_t frameworkCount = std::tr1::get<1>(GetParam());

  // Pause the clock because we want to manually drive the allocations.
  Clock::pause();

  struct OfferedResources {
    FrameworkID   frameworkId;
    SlaveID       slaveId;
    Resources     resources;
  };

  vector<OfferedResources> offers;

  auto offerCallback = [&offers](
      const FrameworkID& frameworkId,
      const hashmap<SlaveID, Resources>& resources_)
  {
    foreach (auto resources, resources_) {
      offers.push_back(
          OfferedResources{frameworkId, resources.first, resources.second});
    }
  };

  cout << "Using " << slaveCount << " agents and "
       << frameworkCount << " frameworks" << endl;

  vector<SlaveInfo> slaves;
  slaves.reserve(slaveCount);

  vector<FrameworkInfo> frameworks;
  frameworks.reserve(frameworkCount);

  initialize(master::Flags(), offerCallback);

  Stopwatch watch;

  const Resources agentResources = Resources::parse(
      "cpus:24;mem:4096;disk:4096;ports:[31000-32000]").get();

  // Each agent has a portion of it's resources allocated to a single
  // framework. We round-robin through the frameworks when allocating.
  Resources allocation = Resources::parse("cpus:16;mem:2014;disk:1024").get();

  Try<::mesos::Value::Ranges> ranges = fragment(createRange(31000, 32000), 16);
  ASSERT_SOME(ranges);
  ASSERT_EQ(16, ranges->range_size());

  allocation += createPorts(ranges.get());

  watch.start();

  for (size_t i = 0; i < slaveCount; i++) {
    slaves.push_back(createSlaveInfo(agentResources));

    // Add some used resources on each slave. Let's say there are 16 tasks, each
    // is allocated 1 cpu and a random port from the port range.
    hashmap<FrameworkID, Resources> used;
    used[frameworks[i % frameworkCount].id()] = allocation;
    allocator->addSlave(
        slaves[i].id(), slaves[i], None(), slaves[i].resources(), used);
  }

  // Wait for all the `addSlave` operations to be processed.
  Clock::settle();

  watch.stop();

  cout << "Added " << slaveCount << " agents in "
       << watch.elapsed() << endl;

  watch.start();

  for (size_t i = 0; i < frameworkCount; i++) {
    frameworks.push_back(createFrameworkInfo("*"));
    allocator->addFramework(frameworks[i].id(), frameworks[i], {});
  }

  // Wait for all the `addFramework` operations to be processed.
  Clock::settle();

  watch.stop();

  cout << "Added " << frameworkCount << " frameworks in "
       << watch.elapsed() << endl;
}
```

With your fix:

```
./bin/mesos-tests.sh --benchmark --gtest_filter="SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddSlaveAndFrameworks/7"
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test
[ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddSlaveAndFrameworks/7
Using 1000 agents and 6000 frameworks
Added 1000 agents in 141352us
Added 6000 frameworks in 12.710913secs
[       OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddSlaveAndFrameworks/7
(12902 ms)
[----------] 1 test from SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test (12902
ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (12912 ms total)
[  PASSED  ] 1 test.
```

Without your fix, I did not wait for the test run finished, as I waited for almost 10 minutes
and the test is not finished....


- Guangya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51027/#review150123
-----------------------------------------------------------


On 九月 23, 2016, 4:32 p.m., Jacob Janco wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51027/
> -----------------------------------------------------------
> 
> (Updated 九月 23, 2016, 4:32 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Guangya Liu, James Peach, Klaus Ma, and Jiang
Yan Xu.
> 
> 
> Bugs: MESOS-3157
>     https://issues.apache.org/jira/browse/MESOS-3157
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> - Triggered allocations dispatch allocate() only
>   if there is no pending allocation in the queue.
> - Allocation candidates are accumulated and only
>   cleared when enqueued allocations are processed.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 2c31471ee0f5d6836393bf87ff9ecfd8df835013

>   src/master/allocator/mesos/hierarchical.cpp 2d56bd011f2c87c67a02d0ae467a4a537d36867e

> 
> Diff: https://reviews.apache.org/r/51027/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> note: check without filters depends on https://reviews.apache.org/r/51028
> 
> With new benchmark https://reviews.apache.org/r/49617: 
> Sample output without 51027:
> [ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/22
> Using 10000 agents and 3000 frameworks
> Added 3000 frameworks in 57251us
> Added 10000 agents in 3.21345353333333mins
> allocator settled after  1.61236038333333mins
> [       OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/22
(290578 ms)
> 
> Sample output with 51027:
> [ RUN      ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/22
> Using 10000 agents and 3000 frameworks
> Added 3000 frameworks in 39817us
> Added 10000 agents in 3.22860541666667mins
> allocator settled after  25.525654secs
> [       OK ] SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.FrameworkFailover/22
(220137 ms)
> 
> 
> Thanks,
> 
> Jacob Janco
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message