mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qian Zhang <zhang...@cn.ibm.com>
Subject Re: Review Request 46187: Terminate when receiving the ACK of terminal status update.
Date Wed, 27 Apr 2016 14:31:14 GMT


> On April 15, 2016, 12:49 a.m., Vinod Kone wrote:
> > src/launcher/http_command_executor.cpp, line 749
> > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749>
> >
> >     Looking at slave::statusUpdate() code there are several scenarios where the
slave ignores a status update sent by the executor; this means this executor could end up
not terminating forever.
> >     
> >     Can you do the following:
> >     
> >     --> Enque a message in the queue to self terminate after some timeout (you
can use the delay() function) to be safe.
> >     
> >     --> Add a TODO that we do this to be safe and also because slave sometimes
doesn't ACK a status update. Link to a ticket that fixes the slave status update semantics
to always ACK a status update sent by an executor.
> >     
> >     sounds good?
> 
> Vinod Kone wrote:
>     @Qian, any update on this? If this particular review is going to take some time,
I think it is still useful two commit the other 2 reviews in this chain. AFAICT, they are
independent of this review?
> 
> Qian Zhang wrote:
>     @Vinod, sorry for the late. I have filed a ticket (https://issues.apache.org/jira/browse/MESOS-5262)
for enhancing `slave::statusUpdate()` to always ACK the status update sent by executor.
>     
>     And can you please elaborate about the specific scenarios this executor could not
terminate forever. Originially I thought the scenario should be: executor sends a terminal
status upate to slave when the corresponding framework is in `TERMINATING` state (e.g., operator
tears down the framework), then in `Slave::statusUpdate()`, this status update will be ignored,
so the executor will not get the ACK. But after testing, I found in this case the executor
can still terminate, because the container corresponded to this executor will be destroyed
by `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so after `--executor_shutdown_grace_period`,
the executor can still terminate. So I am not in which case the executor will never terminate.
>     
>     And yes, the other 2 patches are independent of this one, I will make them not depending
on this one in the review board, thanks!
> 
> Qian Zhang wrote:
>     After more thinking, I see one scenario the executor could never terminate is: agent
is down right after it sends SHUTDOWN event to executor. In this case, when handling the SHUTDOWN
event, executor will kill the task and send TASK_KILLED status update to agent, but it will
not get ACK since agent is already down, so the executor will still run. But I think once
agent is started again, executor will receive the ACK and then terminate. I am not sure if
this behavior is OK, or we want executor to terminate once it receives the SHUTDOWN event
rather than wait for agent is started again?
> 
> Vinod Kone wrote:
>     I think it is the right thing for the executor to stay up until the agent comes back
up and sends an ACK. Otherwise, which is currently the case, the agent has no idea about the
exit status of the executor.
>     
>     So the cases where I see no ACKs from the agent are:
>     
>     1) `update` doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably shutdown
the executor instead of just ignoring.
>     
>     2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to Executor
object?
>     
>     3) Framework is terminating.
>       --> Per your observation the agent is guaranteed to destroy the container. We
need to add a comment here explainining this.
>     
>     4) Executor is unknown. While we forward the status update to the framework, the
ACK is dropped by `_statusUpdate` and `___statusUpdate()`
>        --> If the update is generated by the agent (`killTask()`, `_runTask()`) we
should be fine because there is no executor.
>        --> If the update is sent by an executor for a task belonging to another executor,
that another executor is hopefully already dead. So we should be fine.
>        --> Is it possible for the update to be sent by an executor that is not known
to the agent? Don't think we can do much here since we don't have access to the Executor object.

> I think it is the right thing for the executor to stay up until the agent comes back
up and sends an ACK. Otherwise, which is currently the case, the agent has no idea about the
exit status of the executor.

But that seems conflict with your original comment. My understanding to your original comment
is, to avoid executor not terminating forever, executor needs to enque a message to self terminate
after some timeout (I think we can do it via delay() in `reaped()` after terminal status update
is sent), if so, then when executor handles SHUTDOWN event, it will self terminate anyway
during agent down, so it is not possible to make executor stay up until agent comes back.

With this patch, I think the behavior of HTTP commmand executor and driver based command executor
will be different, let's see this scenario: a checkpoint-enabled frameworks launches a "sleep
100" task, after the task is running, agent is down, and during agent down, the task finishes,
and then agent is started again. In this case, after agent is started again, for driver-based
command executor, framework will receive a TASK_FAILED since the executor has terminated during
agent is down, but for HTTP command executor (with this patch applied), framework will receive
a TASK_FINISHED since the executer will NOT terminate until it receives the ACK for the TASK_FINISHED.

I think we need to make them have consistent behavior. Any suggestions? :-)

> 1) update doesn't have UUID.
  --> Is this possible with the HTTP executor? If not, we should probably shutdown the
executor instead of just ignoring.

This is not possible with HTTP command executor, because in `HttpCommandExecutor::update()`
we always set UUID in the status upate before sending it to agent.

> 2) Framework is unknown.
  --> Don't think we can do much here because we don't have access to Executor object?

I do not think this will happen in our case, because framework will only be removed (`Slave::removeFramework()`)
when it has no executor (`framework->executors.empty()`), but in our case, it must have
one executor (the HTTP command executor).


> 4) Executor is unknown. While we forward the status update to the framework, the ACK
is dropped by _statusUpdate and ___statusUpdate()

>   --> If the update is generated by the agent (killTask(), _runTask()) we should be
fine because there is no executor.

>   --> If the update is sent by an executor for a task belonging to another executor,
that another executor is hopefully already dead. So we should be fine.

>   --> Is it possible for the update to be sent by an executor that is not known to
the agent? Don't think we can do much here since we don't have access to the Executor object.

I do not think it is possible that an executor sends a update to agent but agent does not
know this executor.


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review128916
-----------------------------------------------------------


On April 27, 2016, 9:23 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated April 27, 2016, 9:23 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-4855
>     https://issues.apache.org/jira/browse/MESOS-4855
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 0b4136c040ec9cf585c0d38f8471495a855cd640 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message