mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu" <yujie....@gmail.com>
Subject Re: Review Request 35433: CHECK that checkpointed resources exist on the slave.
Date Wed, 17 Jun 2015 00:33:06 GMT


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate situation
where a slave has neglected to get the dynamic reservation because it was just starting up
and then it gets the task launch it will shutdown the slave because the CHECK will fail? I
would expect the slave to simply send a TASK_LOST. Said another way, this is not an assertion
our code guarantees. If instead we were waiting for some kind of an ack from the slave that
it received the dynamic reservation before it send the task launch then a CHECK would make
sense.
> 
> Jie Yu wrote:
>     We don't expect this to happen because we always send a CheckpointResourcesMessage
before sending the task to the slave and TCP ensures in order delivery (out of order delivery
is possible if two sockets are used. it's possible because the way we create ephemeral connections,
but this is very unlikely to happen). Master won't send the task to the slave if the slave
hasn't registered.
>     
>     I would rather keep the CHECK here unless we found that this is a real issue (and
then we can change that to send status update).
> 
> Michael Park wrote:
>     So currently it is possible for this to happen, but only with a very small probability.
Your proposal is to keep the `CHECK` and put in the effort to eliminate the possibility once
we observe it as a real problem, correct? The part that I don't quite understand is, what's
the motivation to wait for a real problem to occur when we know it's possible to run into
this issue (even with a small probability), the effort to change the `CHECK` to sending `TASK_LOST`
seems to be small?

Well, everything has a probablity to fail, the question is how large the probability is. Memory
could have hardware errors and a bit could be flipped due to random reasons, does that mean
that we have to do parity check in every single location in our code base? I think my point
is the probability for this to fail is extremely low so that we shouldn't worry too much.

I am fine with sending a status update.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
-----------------------------------------------------------


On June 15, 2015, 12:39 p.m., Michael Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> -----------------------------------------------------------
> 
> (Updated June 15, 2015, 12:39 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Jie Yu.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic reservations
patches.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 67732a40ef683cb0188425f0bba8fe8ab83e461c 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message