mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chun-Hung Hsiao <chhs...@apache.org>
Subject Re: Review Request 69812: Implemented the RPC retry logic for SLRP.
Date Wed, 23 Jan 2019 22:49:31 GMT


> On Jan. 23, 2019, 3:33 p.m., James DeFelice wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1904 (patched)
> > <https://reviews.apache.org/r/69812/diff/1/?file=2121403#file2121403line1904>
> >
> >     There are a few other calls in the spec that might return `RESOURCE_EXHAUSTED`,
which is also mitigated by backoff. Consider adding that case as well.
> >     
> >     Furthermore, some calls may return `NOT_FOUND`, which may also be mitigated
by a retry. It's not clear that the SLRP has enough information for a retry in every such
case. Needs more thought.

Right, there are a couple other error statuses we could consider retry. However, taking `RESOURCE_EXHAUSTED`
as an example, its is an retryable error for `CreateVolume` and `ControllerPublishVolume`
given some pre-conditions, but not a retryable error for `CreateSnapshot` in the latest CSI
spec. In the future we could build up a per-call retry policy that contains a list of retryable
errors with their associated pre-conditions. But for now I'm being conservative and sticking
with what https://grpc.io/grpc/cpp/namespacegrpc.html#aff1730578c90160528f6a8d67ef5c43b states,
as a guideline for general retry. Dropping. Please reopen it if you feel we should address
this right now.


> On Jan. 23, 2019, 3:33 p.m., James DeFelice wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1916 (patched)
> > <https://reviews.apache.org/r/69812/diff/1/?file=2121403#file2121403line1916>
> >
> >     what about a metric for call retries?

`resource_providers/<type>.<name>/csi_plugin/rpcs/<rpc>/errors` should be
a good approximation. I'll create a follow-up patch for finer-grained error metrics, but probably
won't backport it. Is it good enough?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69812/#review212242
-----------------------------------------------------------


On Jan. 23, 2019, 7:10 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69812/
> -----------------------------------------------------------
> 
> (Updated Jan. 23, 2019, 7:10 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9517
>     https://issues.apache.org/jira/browse/MESOS-9517
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the CSI plugin returns a retryable error (i.e., `DEADLINE_EXCEEDED`
> or `UNAVAILABLE`) for `CreateVolume` or `DeleteVolume` CSI calls, SLRP
> will now retry indefinitely with a random exponential backoff.
> 
> 
> Diffs
> -----
> 
>   src/csi/client.hpp 5d40d54c2abbd03993ce8835d37db23e209c7554 
>   src/csi/client.cpp 61ed410985099828a2f58b1527ab57daa4b379df 
>   src/resource_provider/storage/provider.hpp 331f7b785b14b814c2889488effd53f3a48a1b95

>   src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c

> 
> 
> Diff: https://reviews.apache.org/r/69812/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> A unit test will be added later in the chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message