mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Sekretenko <asekrete...@mesosphere.io>
Subject Re: Review Request 72786: Implemented regex-based offer constraints.
Date Fri, 21 Aug 2020 14:50:27 GMT


> On Aug. 20, 2020, 11:33 p.m., Benjamin Mahler wrote:
> > Could you split this into two logical changes?
> > 
> > 1. Add the regex support w/o any caching
> > 2. Add caching of constructed regexes as a memory / performance optimization
> > 
> > It seems like the caching in itself warrants some thought, and so it would be easier
to review them in isolation.

Makes sense; will do that.


> On Aug. 20, 2020, 11:33 p.m., Benjamin Mahler wrote:
> > src/master/allocator/mesos/offer_constraints_filter.cpp
> > Lines 47-49 (patched)
> > <https://reviews.apache.org/r/72786/diff/4/?file=2238488#file2238488line47>
> >
> >     Some documentation on these and their values would be helpful.
> >     
> >     Wow, max_mem is rather complicated, and it looks hard to be confident about
our choice of value:
> >     
> >     https://github.com/google/re2/blob/2020-08-01/re2/re2.h#L591-L617
> >     
> >     Although that reads as though there is some caching / flexibility on the memory
usage, I assume there is also a hard error if it can't construct the appropriately initial
data structures?
> >     
> >     Given this, it's a little scary to not have a tunable flag for it, should we
run into surprising results in practice.
> >     
> >     Similarly, for max program size of 100, it's really hard to see how this value
is produced, and while 100 seems like a standard upper bound (according to envoy docs), it's
a little scary to not have a tunable flag for this one too.
> >     
> >     Thoughts?

Yes, there is a hard error if RE2 cannot be constructed without breaching the limit.

Not having max_mem / program size as configurable flags indeed seems scary, I'll change that.
 
My initial thinking was that having these configurable makes it more complex for the frameworks
to validate the regexps on their side (so that the frameworks can avoid failing subscription/update).
With those limits being configurable, frameworks will need to somehow know the exact values
of these limits (or Mesos will have to provide a validation API).

Probably you are right, and having those configurable right from the start is a lesser evil.


> On Aug. 20, 2020, 11:33 p.m., Benjamin Mahler wrote:
> > src/master/allocator/mesos/offer_constraints_filter.cpp
> > Lines 60-62 (patched)
> > <https://reviews.apache.org/r/72786/diff/4/?file=2238488#file2238488line60>
> >
> >     Thinking out loud here:
> >     
> >     It's really hard for me to tell from here, but the weak cache will cover the
typical path of a scheduler just updating to the same constraint info as before, right?
> >     
> >     Caching wise, it seems most important to (1) avoid the duplication of regexes
within a scheduler, and (2) secondarily avoid the duplication of regexes across time and across
schedulers.
> >     
> >     I can see how this covers the second case, since there will clearly be other
references to regexes used by other schedulers. But the first case is a little less clear
because it relies on the caller further up the stack to not throw away the old constraints
object before making the new one, right? Is that ensured because the allocator will hold a
copy, and the master constructs the new constraints before giving to the allocator to overwrite
the old? This seems like a lot of non-local reasoning needed.
> >     
> >     Just makes me wonder if there's a more direct way to cache (where the caching
behavior is clear locally, with less assumption about the caller baked in). We could consider
caching only across a single scheduler's updates if that's very simple to do and reason about.

Ideally, caching of RE2s should accomplish two goals:
1. Minimize the memory footprint of the used regexps.
2. Avoid re-creating the same regex while keeping the excess memory usage low.

Caching of weak references serves the first goal; it guarantees the minimum possible total
memory usage of RE2s.
I'm not sure if it is reasonable to assume that different schedulers will use different regexps,
especially the complex ones.
I can easily imagine a complex regexp selecting 97 agents out of 500 written down somewhere
on the operator's internal Wiki and copy-pasted from one scheduler to another.

On the other hand, the second goal is always a kind of CPU/memory tradeoff.
I fully agree that the fact that weak reference caching also helps with this one looks like
a coincidence.

Worse, some frameworks in some situations (for example, Marathon with a crashlooping app)
will find themselves constantly switching a constraint (or, more likely, a whole constraint
group) on/off. 

Weak caching does not help at all in this case; this is why I'm using an LRU, but probably
it is worth looking for something better than a global LRU with an arbitrary fixed size of
1024.

I would expect that the constraints are extremely unlikely to "jump" from one framework to
another, so this indeed can be solved by a form of per-framework caching (another option would
be making the filter mutable and copyable, but I would rather avoid that...)

In any case, to accomplish the second goal we need eviction criteria.
One of them should be the following: we do not evict a RE2 until after a new "version" of
the framework's filter is created. As you say, it's better to keep this porperty explicit.
This basically means caching all RE2s used for creating the latest version of a framework's
filter.

Probably, to address the question of "blinking" constraints, we should consider caching RE2s
for several recent versions (looks like a resizeable per-framework LRU?)

What do you think?


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72786/#review221667
-----------------------------------------------------------


On Aug. 17, 2020, 8:23 p.m., Andrei Sekretenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72786/
> -----------------------------------------------------------
> 
> (Updated Aug. 17, 2020, 8:23 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-10173
>     https://issues.apache.org/jira/browse/MESOS-10173
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Implemented regex-based offer constraints.
> 
> 
> Diffs
> -----
> 
>   include/mesos/allocator/allocator.hpp c700528e14bb42f6cea37f42dd7b72eb76a1a6b9 
>   src/master/allocator/mesos/offer_constraints_filter.cpp PRE-CREATION 
>   src/master/master.hpp 214307fcae47905672260758a1b96a034ed80257 
>   src/master/master.cpp 6a013e334b19bd108791d1c5fd0864c710aae8cb 
>   src/tests/master/offer_constraints_filter_tests.cpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/72786/diff/4/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message