mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [mesos] cf-natali opened a new pull request #380: Fixed a bug preventing agent recovery when executor GC is interrupted.
Date Sat, 30 Jan 2021 20:04:41 GMT

cf-natali opened a new pull request #380:
URL: https://github.com/apache/mesos/pull/380


   If the agent is interrupted after garbage collecting the executor's
   latest run meta directory but before garbage collecting the top-level
   executor meta directory, the "latest" symlink will dangle, which would
   cause the agent executor recovery to fail.
   Instead, we can simply ignore if the "latest" symlink dangles, since
   it's always created after the latest run directory it points to, and
   never deleted until the top-level executor meta directory is garbage
   collected.
   
   
   Example logs showing the problem:
   
   Agent GC'ing the directory:
   ```
   I0129 22:38:45.060012 28292 slave.cpp:7107] Executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055'
of framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002 exited with status 0
   I0129 22:38:45.060871 28292 slave.cpp:7218] Cleaning up executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055'
of framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002 at executor(1)@127.0.1.1:40075
   [...]
   I0129 22:38:45.061872 29250 gc.cpp:95] Scheduling '/tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab'
for gc 4.938180864secs in the future
   I0129 22:38:45.061939 29250 gc.cpp:95] Scheduling '/tmp/tmp2y330b17mesos_agent_work_dir/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055'
for gc 4.93812992secs in the future
   [...]
   I0129 22:38:50.019327 29251 gc.cpp:272] Deleting /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab
   I0129 22:38:50.019573 29251 gc.cpp:288] Deleted '/tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab'
   ```
   
   The agent got killed, and didn't get to GC `/tmp/tmp2y330b17mesos_agent_work_dir/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055`.
   
   Then the agent restarted:
   ```
   [...]
   E0129 22:38:54.942884 29402 slave.cpp:8355] EXIT with status 1: Failed to perform recovery:
Failed to recover framework 1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002: Failed to recover executor
'task-72954d99-5719-414f-b7d9-5f35c5d70055': Failed to find latest run of executor 'task-72954d99-5719-414f-b7d9-5f35c5d70055':
No such file or directory
   ```
   
   We can see that `latest` for executor `task-72954d99-5719-414f-b7d9-5f35c5d70055` points
to run `fa5986f6-777b-42fb-88b4-e4ce339c21ab` which has already been GCed.
   ```
   cf@thinkpad:~/src/mesos$ ls -l /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/latest

   lrwxrwxrwx 1 cf cf 235 janv. 29 22:28 /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/latest
-> /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab
   cf@thinkpad:~/src/mesos$ ls -l /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/
   total 4
   lrwxrwxrwx 1 cf cf 235 janv. 29 22:28 latest -> /tmp/tmp2y330b17mesos_agent_work_dir/meta/slaves/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-S5/frameworks/1f2209bb-43e4-4b8b-b36c-3fa1e855a0f1-0002/executors/task-72954d99-5719-414f-b7d9-5f35c5d70055/runs/fa5986f6-777b-42fb-88b4-e4ce339c21ab
   cf@thinkpad:~/src/mesos$ 
   ```
   
   @bbannier  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message