mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guangya Liu <gyliu...@gmail.com>
Subject Re: Review Request 50017: WIP: Validated the resources when parsing it.
Date Mon, 18 Jul 2016 10:13:58 GMT


> On 七月 14, 2016, 5:56 p.m., Benjamin Mahler wrote:
> > include/mesos/resources.hpp, lines 135-147
> > <https://reviews.apache.org/r/50017/diff/1/?file=1443931#file1443931line135>
> >
> >     This suggests a change of semantics to `Resources`.
> >     
> >     Currently: `Resources` always refers to a valid collection resources. This function
would never return Some(Error) unless we change the semantics.
> >     
> >     If we want to remove the unnecessary validation, we can do this by making '`operator
+= (const Resources& that)`' avoid validating '`that`' since it's already valid. Currently
it just calls '`operator += (const Resource& r)`' which is unaware of '`r`' already being
valid since it comes from a '`Resources`'.
> >     
> >     However, we have to keep the validation in '`operator += (const Resource&
r)`', since '`r`' is just a protobuf coming from an arbitrary caller, it may be invalid and
we need to validate.
> >     
> >     Make sense?
> 
> Guangya Liu wrote:
>     Hi Ben, does there are any example in mesos code for `Resource` object come from
protobuf? I found that we are always using `Resource::parse` to get a `Resource` or `Resources`
object and I have added some validation logic in `Resource::parse` in when getting `Resource`
or `Resources` object.
> 
> Guangya Liu wrote:
>     Ben, got your point. We can set `Resource` object as following:
>     
>     ```javascript 
>     Resource scalar;
>     scalar.set_name("cpus");
>     scalar.set_type(Value::SCALAR);
>     scalar.mutable_scalar()->set_value(1.234);
>     ```
>     
>     So if we igore the validation for `'operator += (const Resource& r)'`, there
might be some erros if the resource was not constructed correctly. This is not a good way
to create `Resource` object but we cannot guard the usage for this now.
>     
>     I found that the API `'operator += (const Resource& r)'` was not called directly
too frequently, but we are using this API `'operator += (const Resources& that)'` for
the most of the time, so seems it is good to keep the validation for `'operator += (const
Resource& r)'` as it was not called directly too much. 
>     
>     The problem is that `'operator += (const Resources& that)'` is calling `'operator
+= (const Resource& r)'`, I did not found a good way to keep validation for `'operator
+= (const Resource& r)'` but ignore the validation for `'operator += (const Resources&
that)'`, still checking how we can make this work, any comments/suggestions for this?
> 
> Guangya Liu wrote:
>     Did more test by removing validation, the performance of resources operations with
`port` resources increased more than 5x.
>     
>     With validation
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
>     [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
>     Took 11.212199secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     Took 8.145903secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     Took 11.154993secs to perform 1000 'total = total + r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     Took 7.974004secs to perform 1000 'total = total - r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (38490 ms)
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (38490 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (38512 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Without validation
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
>     [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
>     Took 2.90743secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
>     Took 9166us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11,
13-14, 16-17, 1...
>     Took 3.038083secs to perform 1000 'total = total + r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     Took 9253us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (5966 ms)
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (5966 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (5988 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Also after more checking, I think that we can remove the validation from resource
operations, as we already have resource validation when launching tasks, creating resurces
etc. Even though application developer may create resources using protobuf directly, but if
the resource definition is not right, the validation for task resource will be failed.
> 
> Guangya Liu wrote:
>     Sorter test, the sorter add allocation increased 10x. It is strange that the benchmark
test for allocation test in hierarchical test does not improve much, still checking...
>     
>     With validation:
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
>     [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
>     Using 50000 agents and 1000 clients
>     Added 1000 clients in 19572us
>     Added 50000 agents in 915932us
>     Added allocations for 50000 agents in 40.303108secs
>     Full sort of 1000 clients took 32702us
>     No-op sort of 1000 clients took 314us
>     [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (42449 ms)
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (42449 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (42466 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Without validation:
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
>     [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
>     Using 50000 agents and 1000 clients
>     Added 1000 clients in 22764us
>     Added 50000 agents in 862959us
>     Added allocations for 50000 agents in 3.934831secs
>     Full sort of 1000 clients took 30187us
>     No-op sort of 1000 clients took 322us
>     [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (6031 ms)
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (6031 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (6049 ms total)
>     [  PASSED  ] 1 test.
>     ```

Hi Ben, now there are 4 places where we can constrcut resoures.

1) From framework, the resources here will be validated when launching task, so it is ok even
if the framework developer made some mistake to construct some invalid resources.
2) From agent flags, we can validate this resource when parsing it.
3) From agent resources estimator, we can validate those resources when got those resources
from estimator in agent. We need to update agent code https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5035
to validate the resources.
4) From agent resources hook, we can validate this in hook manager when got those resoures
from resources hooks. We need to update agent code here https://github.com/apache/mesos/blob/master/src/hook/manager.cpp#L366
to validate the resources.

After the above finished, we can simply remove the `validate()` from `operator+=` and `operator-=`
for `Resource`, what do you think?


- Guangya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50017/#review142257
-----------------------------------------------------------


On 七月 18, 2016, 10:13 a.m., Guangya Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50017/
> -----------------------------------------------------------
> 
> (Updated 七月 18, 2016, 10:13 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Klaus Ma.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The "validation" API was called in a huge number of times, but this
> was only needed when parsing resources and we do not need to do
> other validation when doing resources operations, such as add etc.
> 
> 
> Diffs
> -----
> 
>   include/mesos/resources.hpp a557e97c65194d4aad879fb88d8edefd1c95b8d8 
>   include/mesos/v1/resources.hpp a5ba8fec4c9c3643646308f75a4b28cefe0b3df3 
>   src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 
>   src/tests/resources_tests.cpp 40d290ac540d26373c5fb7c2a93d27d1aa61d722 
>   src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a 
> 
> Diff: https://reviews.apache.org/r/50017/diff/
> 
> 
> Testing
> -------
> 
> The `qcachegrind` result here: https://docs.google.com/document/d/1oilen04e8trIOgYbj-jxAd7esTRDaySll8y2BQ6p5nU/edit?usp=sharing
> 
> After the fix, the performance of `adding` resources increased 30+% when adding 50000
agents to the cluster.
> 
> Before fix:
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/30
> Using 50000 agents and 1 clients
> Added 1 clients in 47us
> **Added 50000 agents in 1.312497secs**
> Sorted 1 clients in 43us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/30 (1321 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/31
> Using 50000 agents and 50 clients
> Added 50 clients in 948us
> **Added 50000 agents in 1.325987secs**
> Sorted 50 clients in 1165us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/31 (1340 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/32
> Using 50000 agents and 100 clients
> Added 100 clients in 1697us
> **Added 50000 agents in 1.409478secs**
> Sorted 100 clients in 2876us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/32 (1432 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/33
> Using 50000 agents and 200 clients
> Added 200 clients in 4553us
> **Added 50000 agents in 1.371473secs**
> Sorted 200 clients in 5371us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/33 (1412 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/34
> Using 50000 agents and 500 clients
> Added 500 clients in 8836us
> **Added 50000 agents in 1.304245secs**
> Sorted 500 clients in 14697us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/34 (1387 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
> Using 50000 agents and 1000 clients
> Added 1000 clients in 19508us
> **Added 50000 agents in 1.270555secs**
> Sorted 1000 clients in 32575us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (1433 ms)
> 
> 
> After the fix:
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/30
> Using 50000 agents and 1 clients
> Added 1 clients in 42us
> **Added 50000 agents in 891266us**
> Sorted 1 clients in 59us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/30 (902 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/31
> Using 50000 agents and 50 clients
> Added 50 clients in 933us
> **Added 50000 agents in 885006us**
> Sorted 50 clients in 1220us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/31 (899 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/32
> Using 50000 agents and 100 clients
> Added 100 clients in 1879us
> **Added 50000 agents in 903112us**
> Sorted 100 clients in 2800us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/32 (922 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/33
> Using 50000 agents and 200 clients
> Added 200 clients in 3893us
> **Added 50000 agents in 881240us**
> Sorted 200 clients in 5802us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/33 (912 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/34
> Using 50000 agents and 500 clients
> Added 500 clients in 10712us
> **Added 50000 agents in 877442us**
> Sorted 500 clients in 17887us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/34 (949 ms)
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
> Using 50000 agents and 1000 clients
> Added 1000 clients in 21472us
> **Added 50000 agents in 916653us**
> Sorted 1000 clients in 37369us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (1057 ms)
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message