mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qian Zhang <zhq527...@gmail.com>
Subject Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.
Date Mon, 18 Feb 2019 14:26:03 GMT


> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the state
where the file is empty because the agent did not finish writing to it.
> >     
> >     However, an empty file might occur in case of hard reboot of the agent's host.
This happens because page cache is dumped every 20 seconds by default in Linux. There is a
chance that the file is created, but data has not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** in case
of orphan containers.
> 
> Qian Zhang wrote:
>     > Given that state::checkpoint is atomic, we can not end up in the state where
the file is empty because the agent did not finish writing to it.
>     > However, an empty file might occur in case of hard reboot of the agent's host.
This happens because page cache is dumped every 20 seconds by default in Linux. There is a
chance that the file is created, but data has not yet synced on disk.
>     
>     Agree. And I see the comment `// This could happen if the slave died after opening
the file for writing but before it checkpointed anything.` in a couple of places in Mesos
code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I think those comments need to be
updated as well.
>     
>     > As we have agreed with Gilbert, we need to ignore empty files only in case of
orphan containers.
>     
>     Can you please elaborate a bit? Why do we want to treat orphan containers and recoverable
containers differently? How will we handle the recoverable containers in this case?
> 
> Andrei Budnik wrote:
>     > I think those comments need to be updated as well.
>     
>     We can update these comments in a separate patch later.
>     
>     > Why do we want to treat orphan containers and recoverable containers differently?
How will we handle the recoverable containers in this case?
>     
>     I think that `recoverable` containers could not have empty state files by construction
as checkpointing state is atomic. If this invariant does not satisfy, then we definitely have
a bug in our code which we need to fix.
>     In the case of hard reboots, this invariant might be broken, but all containers are
`orphan`.
>     
>     If the reasoning above looks acceptable, then we might want to recover after broken
invariant for `orphan` containers, while keeping this error for `non-orphan` containers as
an assertion (for us, developers) that the invariant could not be broken in normal circumstances.

Agree, and I posted a patch for updating comments here: https://reviews.apache.org/r/70001/


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8

> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message