mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guangya Liu <gyliu...@gmail.com>
Subject Re: Review Request 50017: WIP: Validated the resources when parsing it.
Date Mon, 18 Jul 2016 22:53:31 GMT


> On 七月 14, 2016, 5:56 p.m., Benjamin Mahler wrote:
> > include/mesos/resources.hpp, lines 135-147
> > <https://reviews.apache.org/r/50017/diff/1/?file=1443931#file1443931line135>
> >
> >     This suggests a change of semantics to `Resources`.
> >     
> >     Currently: `Resources` always refers to a valid collection resources. This function
would never return Some(Error) unless we change the semantics.
> >     
> >     If we want to remove the unnecessary validation, we can do this by making '`operator
+= (const Resources& that)`' avoid validating '`that`' since it's already valid. Currently
it just calls '`operator += (const Resource& r)`' which is unaware of '`r`' already being
valid since it comes from a '`Resources`'.
> >     
> >     However, we have to keep the validation in '`operator += (const Resource&
r)`', since '`r`' is just a protobuf coming from an arbitrary caller, it may be invalid and
we need to validate.
> >     
> >     Make sense?
> 
> Guangya Liu wrote:
>     Hi Ben, does there are any example in mesos code for `Resource` object come from
protobuf? I found that we are always using `Resource::parse` to get a `Resource` or `Resources`
object and I have added some validation logic in `Resource::parse` in when getting `Resource`
or `Resources` object.
> 
> Guangya Liu wrote:
>     Ben, got your point. We can set `Resource` object as following:
>     
>     ```javascript 
>     Resource scalar;
>     scalar.set_name("cpus");
>     scalar.set_type(Value::SCALAR);
>     scalar.mutable_scalar()->set_value(1.234);
>     ```
>     
>     So if we igore the validation for `'operator += (const Resource& r)'`, there
might be some erros if the resource was not constructed correctly. This is not a good way
to create `Resource` object but we cannot guard the usage for this now.
>     
>     I found that the API `'operator += (const Resource& r)'` was not called directly
too frequently, but we are using this API `'operator += (const Resources& that)'` for
the most of the time, so seems it is good to keep the validation for `'operator += (const
Resource& r)'` as it was not called directly too much. 
>     
>     The problem is that `'operator += (const Resources& that)'` is calling `'operator
+= (const Resource& r)'`, I did not found a good way to keep validation for `'operator
+= (const Resource& r)'` but ignore the validation for `'operator += (const Resources&
that)'`, still checking how we can make this work, any comments/suggestions for this?
> 
> Guangya Liu wrote:
>     Did more test by removing validation, the performance of resources operations with
`port` resources increased more than 5x.
>     
>     With validation
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
>     [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
>     Took 11.212199secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     Took 8.145903secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     Took 11.154993secs to perform 1000 'total = total + r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     Took 7.974004secs to perform 1000 'total = total - r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (38490 ms)
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (38490 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (38512 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Without validation
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
>     [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
>     Took 2.90743secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
>     Took 9166us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11,
13-14, 16-17, 1...
>     Took 3.038083secs to perform 1000 'total = total + r' operations on ports(*):[1-2,
4-5, 7-8, 10-11, 13-14, 16-17, 1...
>     Took 9253us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
>     [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (5966 ms)
>     [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (5966 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (5988 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Also after more checking, I think that we can remove the validation from resource
operations, as we already have resource validation when launching tasks, creating resurces
etc. Even though application developer may create resources using protobuf directly, but if
the resource definition is not right, the validation for task resource will be failed.
> 
> Guangya Liu wrote:
>     Sorter test, the sorter add allocation increased 10x. It is strange that the benchmark
test for allocation test in hierarchical test does not improve much, still checking...
>     
>     With validation:
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
>     [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
>     Using 50000 agents and 1000 clients
>     Added 1000 clients in 19572us
>     Added 50000 agents in 915932us
>     Added allocations for 50000 agents in 40.303108secs
>     Full sort of 1000 clients took 32702us
>     No-op sort of 1000 clients took 314us
>     [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (42449 ms)
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (42449 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (42466 ms total)
>     [  PASSED  ] 1 test.
>     ```
>     
>     Without validation:
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
>     [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
>     Using 50000 agents and 1000 clients
>     Added 1000 clients in 22764us
>     Added 50000 agents in 862959us
>     Added allocations for 50000 agents in 3.934831secs
>     Full sort of 1000 clients took 30187us
>     No-op sort of 1000 clients took 322us
>     [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (6031 ms)
>     [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (6031 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (6049 ms total)
>     [  PASSED  ] 1 test.
>     ```
> 
> Guangya Liu wrote:
>     Hi Ben, now there are 4 places where we can constrcut resoures.
>     
>     1) From framework, the resources here will be validated when launching task, so it
is ok even if the framework developer made some mistake to construct some invalid resources.
>     2) From agent flags, we can validate this resource when parsing it.
>     3) From agent resources estimator, we can validate those resources when got those
resources from estimator in agent. We need to update agent code https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5035
to validate the resources.
>     4) From agent resources hook, we can validate this in hook manager when got those
resoures from resources hooks. We need to update agent code here https://github.com/apache/mesos/blob/master/src/hook/manager.cpp#L366
to validate the resources.
>     
>     After the above finished, we can simply remove the `validate()` from `operator+=`
and `operator-=` for `Resource`, what do you think?
> 
> Benjamin Mahler wrote:
>     It is brittle to remove validation from `operator += (const Resource& r)` and
`operator -= (const Resource& r)` because `r` can be an invalid protobuf if a caller constructs
one and calls `+=` or `-=`.
>     
>     My suggestion from earlier was to do something like the following:
>     
>     ```
>     operator += (const Resource& r)
>     {
>       if (validate(r).isNone()) {
>         add(r);
>       }
>     }
>     
>     operator += (const Resources& that)
>     {
>       foreach (const Resource& r, that) {
>         add(r);
>       }
>     }
>     
>     private:
>     
>     void add(const Resource& r)
>     {
>       // Assumes r is valid, does no validation!
>     }
>     ```
>     
>     This way, we avoid validation in `operator += (const Resources& that)` because
`Resources` always contains valid `Resource` objects. But in `operator += (const Resource&
r)` we continue to validate since a caller could easily provide invalid protobuf.

Thanks Ben, I also thinked this solution before as you proposed, but my thinking at that time
is the framework developer should always do some resource validation after construct a `Resource`
object, just like what we did for `calloc/malloc` in C, so we do not need do the `validation`
in `operator += (const Resources& that)`, yes, this is brittle. 

As the API `operator += (const Resources& that)` was not called so frequently directly,
so I think adding `validation` for `operator += (const Resources& that)` can both improve
the performance and make the logic robust, will follow up a patch later accoding to your suggestion.
Thanks.


- Guangya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50017/#review142257
-----------------------------------------------------------


On 七月 18, 2016, 2:24 p.m., Guangya Liu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50017/
> -----------------------------------------------------------
> 
> (Updated 七月 18, 2016, 2:24 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Klaus Ma.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The "validation" API was called in a huge number of times, but this
> was only needed when parsing resources and we do not need to do
> other validation when doing resources operations, such as add etc.
> 
> This patch is a WIP patch and was now only removing the `validation`
> from `operator+=` and `operator-=` for `Resource`.
> 
> Now there are 4 places where we can constrcut resoures.
> 
> 1) From framework, the resources here will be validated when launching
> task, so it is ok even if the framework developer made some mistake to
> construct some invalid resources.
> 2) From agent flags, we can validate this resource when parsing it.
> 3) From agent resources estimator, we can validate those resources when
> got those resources from estimator in agent. We need to update agent code https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5035
> to validate the resources.
> 4) From agent resources hook, we can validate this in hook manager when
> got those resoures from resources hooks. We need to update agent code here https://github.com/apache/mesos/blob/master/src/hook/manager.cpp#L366
> to validate the resources.
> 
> After the above finished, we can simply remove the `validate()` from
> `operator+=` and `operator-=` for `Resource`
> 
> 
> Diffs
> -----
> 
>   include/mesos/resources.hpp a557e97c65194d4aad879fb88d8edefd1c95b8d8 
>   include/mesos/v1/resources.hpp a5ba8fec4c9c3643646308f75a4b28cefe0b3df3 
>   src/common/resources.cpp f6ff92b591c15bc8e93fd85e1896349c3a7bb968 
>   src/tests/resources_tests.cpp 40d290ac540d26373c5fb7c2a93d27d1aa61d722 
>   src/v1/resources.cpp 8c3f2d1c1529915a59d47fe37bb3fc7a3267079a 
> 
> Diff: https://reviews.apache.org/r/50017/diff/
> 
> 
> Testing
> -------
> 
> The `qcachegrind` result here: https://docs.google.com/document/d/1oilen04e8trIOgYbj-jxAd7esTRDaySll8y2BQ6p5nU/edit?usp=sharing
> 
> Sorter test, the sorter add allocation increased 10x.
> 
> With validation:
> ```
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
> Using 50000 agents and 1000 clients
> Added 1000 clients in 19572us
> Added 50000 agents in 915932us
> Added allocations for 50000 agents in 40.303108secs
> Full sort of 1000 clients took 32702us
> No-op sort of 1000 clients took 314us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (42449 ms)
> [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (42449 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (42466 ms total)
> [  PASSED  ] 1 test.
> ```
> 
> Without validation:
> ```
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test
> [ RUN      ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35
> Using 50000 agents and 1000 clients
> Added 1000 clients in 22764us
> Added 50000 agents in 862959us
> Added allocations for 50000 agents in 3.934831secs
> Full sort of 1000 clients took 30187us
> No-op sort of 1000 clients took 322us
> [       OK ] AgentAndClientCount/Sorter_BENCHMARK_Test.FullSort/35 (6031 ms)
> [----------] 1 test from AgentAndClientCount/Sorter_BENCHMARK_Test (6031 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (6049 ms total)
> [  PASSED  ] 1 test.
> ```
> 
> The performance of resources operations with port resources operations.
> 
> With validation
> ```
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
> [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
> Took 11.212199secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
> Took 8.145903secs to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
> Took 11.154993secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
> Took 7.974004secs to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
> [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (38490 ms)
> [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (38490 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (38512 ms total)
> [  PASSED  ] 1 test.
> ```
> 
> Without validation
> ```
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test
> [ RUN      ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2
> Took 2.90743secs to perform 1000 'total += r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
> Took 9166us to perform 1000 'total -= r' operations on ports(*):[1-2, 4-5, 7-8, 10-11,
13-14, 16-17, 1...
> Took 3.038083secs to perform 1000 'total = total + r' operations on ports(*):[1-2, 4-5,
7-8, 10-11, 13-14, 16-17, 1...
> Took 9253us to perform 1000 'total = total - r' operations on ports(*):[1-2, 4-5, 7-8,
10-11, 13-14, 16-17, 1...
> [       OK ] ResourcesOperators/Resources_BENCHMARK_Test.Arithmetic/2 (5966 ms)
> [----------] 1 test from ResourcesOperators/Resources_BENCHMARK_Test (5966 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (5988 ms total)
> [  PASSED  ] 1 test.
> ```
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message