mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jie Yu <yujie....@gmail.com>
Subject Re: Review Request 58900: Changed the default fetcher cache directory.
Date Tue, 09 May 2017 15:26:36 GMT


> On May 8, 2017, 10:35 p.m., Joseph Wu wrote:
> > src/slave/containerizer/fetcher.cpp
> > Lines 261 (patched)
> > <https://reviews.apache.org/r/58900/diff/2/?file=1710782#file1710782line295>
> >
> >     This patch is still missing cleanup for this directory.
> >     
> >     Before this patch, we would "leak" fetcher cache directories whenever the SlaveID
changed.  This change will make the fetcher leak the cache every time the agent is started.
> >     
> >     I'm considering the following options:
> >     1) Change the `--fetcher_cache_dir` to be the cache directory, rather than the
_parent_ of the cache directory.  This has some implications when running multiple agents
one a single machine (as the default value of the flag will no longer be acceptable).  But
with a single cache directory, managing leaks is as easy as clearing the directory on startup.
> >     2) Delete the fetcher cache directory in the fetcher's destructor.  This will
only really do anything when the agent is gracefully shut down.  But graceful shutdowns can
be considered somewhat rare.
> >     3) Checkpoint the PID of the current process inside the fetcher cache directory.
 Whenever the agent starts, it could scan for these PIDs and check which processes are still
running.  We can then delete the caches of dead processes.

First of all, i think making `--fetcher_cache_dir` be the cache directory makes sense to me.
I won't be too worried about the case of running multiple agents. If we want to run multiple
agents, there are many other flags that we need to make unique as well (e.g., runtime_dir,
cgroups_root, etc.)

For 3), can we avoid checkpointing. For instance, can we introduce a `trash` directory under
`fetcher_cache_dir`, and rely on `rename` to atomically move the cached artifacts to the trash
directory for removal.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58900/#review174232
-----------------------------------------------------------


On May 8, 2017, 10:20 p.m., Joseph Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58900/
> -----------------------------------------------------------
> 
> (Updated May 8, 2017, 10:20 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-7304
>     https://issues.apache.org/jira/browse/MESOS-7304
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The fetcher cache directory was historically located (by default)
> in `/tmp/mesos/fetch`.  The agent flag `--fetcher_cache_dir` could
> be used to change this value.
> 
> The fetcher would create a subdirectory underneath `/tmp/mesos/fetch`
> for each `SlaveID`.  This was done because multiple agents can run on
> the same node.  If all the agents use the same default fetcher cache
> directory, they will collide and cause unpredictable results.
> As a result, the `SlaveID` needed to be passed into the fetcher
> after the agent recovers and/or registers with the master, because
> that is when the `SlaveID` is determined.
> 
> This changes the default fetcher cache directory to
> `/tmp/mesos/fetch/XXXXXX`, where the 6 X's are replaced by `::mkdtemp`.
> The `SlaveID` subdirectory has also been removed.
> 
> This change, while techically a breaking change, is safe because of
> how the fetcher uses this directory.  Upon starting up, the fetcher
> "recovers" by clearing this directory.  By using a temporary directory,
> we similarly get an empty fetcher cache upon startup.
> 
> This change will only cause breakages if multiple agents are run
> with the same `--fetcher_cache_dir`.  In this case, each agent
> will delete the fetcher caches of all the other agents.
> 
> ---
> 
> With the removal of the `SlaveID` field in the fetcher's methods,
> it is no longer necessary to pass in the `SlaveID` or agent Flags
> at agent recovery time.  Instead, the flags can be passed in during
> the fetcher's construction.
> 
> Similarly, the fetcher's "recovery" (clearing the fetcher cache)
> can be done immediately upon construction, which simplifies the
> code slightly.
> 
> 
> Diffs
> -----
> 
>   src/local/local.cpp e47980929db2da1f31cf899a0e1fc452070e11f3 
>   src/slave/containerizer/fetcher.hpp 9e3018dc087ed55c61b2824d0105bc5339b83043 
>   src/slave/containerizer/fetcher.cpp a910fea5a5556afb376524c5bb2ff98d7d84e611 
>   src/slave/flags.hpp e5784ef81ad0720c7ec061ee0b28b8fadae77afd 
>   src/slave/flags.cpp ed99fadbf1aa91f7f0a57c9fb351c0247a40c6f4 
>   src/slave/main.cpp 507d59996a90f51c7cd1e0f836a34c2e9c484791 
>   src/slave/slave.cpp a37c6c888e7573209aadb07576cfb727fa1ec4ff 
> 
> 
> Diff: https://reviews.apache.org/r/58900/diff/2/
> 
> 
> Testing
> -------
> 
> See last patch in chain.
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message