mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Budnik <abud...@mesosphere.com>
Subject Re: Review Request 72029: Changed termination logic of the default executor.
Date Mon, 03 Feb 2020 12:40:25 GMT


> On Янв. 31, 2020, 12:50 п.п., Qian Zhang wrote:
> > src/launcher/default_executor.cpp
> > Lines 1089-1098 (original), 1095-1104 (patched)
> > <https://reviews.apache.org/r/72029/diff/4/?file=2210076#file2210076line1095>
> >
> >     I see `_shutdown` will be called in some error cases, like:
> >     https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392
> >     https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L1041:L1044
> >     So for such cases the previous behavior is self terminate just after sleeping
1 second, but now it is after sleeping 60 seconds with your patch. I do not think we should
sleep so long before self termination for those cases.
> 
> Andrei Budnik wrote:
>     Updated.
> 
> Qian Zhang wrote:
>     I see you have updated `_shutdown` to:
>     ```
>       void _shutdown()
>       {
>         if (unacknowledgedUpdates.empty()) {
>           terminate(self());
>         } else {
>           // This is a fail safe in case the agent doesn't send an ACK for
>           // a status update for some reason.
>           const Duration duration = Seconds(60);
>     
>           LOG(INFO) << "Terminating after " << duration;
>     
>           delay(duration, self(), &Self::__shutdown);
>         }
>       }
>     ```
>     That's also what I thought, and I think it can handle the following cases well.
>     https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L390:L392
>     https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L406:L408
>     
>     But what about the cases like below?
>     https://github.com/apache/mesos/blob/1.9.0/src/launcher/default_executor.cpp#L559:L565
>     
>     In such cases, `unacknowledgedUpdates` is likely not empty and agent has failed (i.e.
no ACKs can be sent to the executor), so executor will sleep 60s before self termination,
but I think the executor should self terminate immediately in this case instead, HDYT?

I think it makes sense to wait for 1 minute before terminating in this particular case. If
the connection is lost due to the agent restart, then there is a high chance that it'll reconnect
to the executor later. So it'd be nice to give the executor a chance to resend all unacknowledged
status updates (TASK_STARTING). Also, I'd say that this case happens rarely.

There are also a few cases of calling `_shutdown` on internal error or a bug. If there are
unacknowledged status updates, then we'd better give a chance to send these status updates
as well.


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219448
-----------------------------------------------------------


On Янв. 30, 2020, 3:28 п.п., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72029/
> -----------------------------------------------------------
> 
> (Updated Янв. 30, 2020, 3:28 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8537
>     https://issues.apache.org/jira/browse/MESOS-8537
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, the default executor terminated itself after all containers
> had terminated. This could lead to termination of the executor before
> processing of a terminal status update by the agent. In order
> to mitigate this issue, the executor slept for one second to give a
> chance to send all status updates and receive all status update
> acknowledgements before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the default executor if all status updates have
> been acknowledged by the agent and no running containers left.
> Also, this patch increases the timeout from one second to one minute
> for fail-safety.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c 
> 
> 
> Diff: https://reviews.apache.org/r/72029/diff/5/
> 
> 
> Testing
> -------
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message