mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Mahler" <benjamin.mah...@gmail.com>
Subject Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.
Date Fri, 19 Jun 2015 22:43:51 GMT


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate situation
where a slave has neglected to get the dynamic reservation because it was just starting up
and then it gets the task launch it will shutdown the slave because the CHECK will fail? I
would expect the slave to simply send a TASK_LOST. Said another way, this is not an assertion
our code guarantees. If instead we were waiting for some kind of an ack from the slave that
it received the dynamic reservation before it send the task launch then a CHECK would make
sense.
> 
> Jie Yu wrote:
>     We don't expect this to happen because we always send a CheckpointResourcesMessage
before sending the task to the slave and TCP ensures in order delivery (out of order delivery
is possible if two sockets are used. it's possible because the way we create ephemeral connections,
but this is very unlikely to happen). Master won't send the task to the slave if the slave
hasn't registered.
>     
>     I would rather keep the CHECK here unless we found that this is a real issue (and
then we can change that to send status update).
> 
> Michael Park wrote:
>     So currently it is possible for this to happen, but only with a very small probability.
Your proposal is to keep the `CHECK` and put in the effort to eliminate the possibility once
we observe it as a real problem, correct? The part that I don't quite understand is, what's
the motivation to wait for a real problem to occur when we know it's possible to run into
this issue (even with a small probability), the effort to change the `CHECK` to sending `TASK_LOST`
seems to be small?
> 
> Jie Yu wrote:
>     Well, everything has a probablity to fail, the question is how large the probability
is. Memory could have hardware errors and a bit could be flipped due to random reasons, does
that mean that we have to do parity check in every single location in our code base? I think
my point is the probability for this to fail is extremely low so that we shouldn't worry too
much.
>     
>     I am fine with sending a status update.
> 
> Alexander Rukletsov wrote:
>     I wonder, what are the cases when the task launch request may arrive before `CheckpointResourcesMessage`?
If my understanding is correct, we do not have delivery guarantee for `CheckpointResourcesMessage`,
nor we have the same queue in Master for `CheckpointResourcesMessage` and `RunTaskMessage`
to ensure the order. My intuition is that the probability of such an event is not negligible:
a network blip can occur and `CheckpointResourcesMessage` may be lost or delayed, we can open
another socket to the slave for `RunTaskMessage`. Could you please help me understand that?

Such a situation will manifest an `Exited` event from the socket closure. At the application
level, we want to ensure that if there are any `Exited` events, the slave (or framework) will
re-register. This is currently not fully implemented: currently only the master-side `Exited`
is implemented (we ping the slave telling it we think it is disconnected), the slave-side
`Exited` is a no-op.

It may become simpler with the HTTP API since we have a single duplex socket (the master does
not initiate a connection with the slave (or framework)). This means that the responsibility
of dealing with a closed socket is left to the slave (or framework) only. Off the top of my
head, I'm not sure if there are situations where only 1 side of the socket can be broken..
so maybe it will be just as complicated :)

Let's discuss off this thread, I do have some tickets around this stuff.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
-----------------------------------------------------------


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> -----------------------------------------------------------
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
>     https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic reservations
patches.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message