mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qian Zhang <zhq527...@gmail.com>
Subject Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.
Date Fri, 15 Feb 2019 09:04:15 GMT


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the state
where the file is empty because the agent did not finish writing to it.
> >     
> >     However, an empty file might occur in case of hard reboot of the agent's host.
This happens because page cache is dumped every 20 seconds by default in Linux. There is a
chance that the file is created, but data has not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** in case
of orphan containers.

> Given that state::checkpoint is atomic, we can not end up in the state where the file
is empty because the agent did not finish writing to it.
> However, an empty file might occur in case of hard reboot of the agent's host. This happens
because page cache is dumped every 20 seconds by default in Linux. There is a chance that
the file is created, but data has not yet synced on disk.

Agree. And I see the comment `// This could happen if the slave died after opening the file
for writing but before it checkpointed anything.` in a couple of places in Mesos code (e.g.,
`slave/state.cpp`, `metadata_manager.cpp`), I think those comments need to be updated as well.

> As we have agreed with Gilbert, we need to ignore empty files only in case of orphan
containers.

Can you please elaborate a bit? Why do we want to treat orphan containers and recoverable
containers differently? How will we handle the recoverable containers in this case?


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8

> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message