Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Leftovers of checkpointed jobs accumulate in SPOOL
- Date: Thu, 12 Apr 2012 11:21:29 +0200
- From: Michael Hanke <michael.hanke@xxxxxxxxx>
- Subject: Re: [Condor-users] Leftovers of checkpointed jobs accumulate in SPOOL
On Mon, Mar 12, 2012 at 05:42:25PM +0100, Michael Hanke wrote:
> On Mon, Mar 12, 2012 at 11:24:09AM -0500, Alan De Smet wrote:
> > Michael Hanke <michael.hanke@xxxxxxxxx> wrote:
> > > I'm testing DMTCP-based checkpointing of vanilla job in our Condor pool
> > > (all version 7.7.5). I noticed that jobs once evicted remain in SPOOL
> > > even after they got restarted on an exec node again. Checkpoint files,
> > > executable, restart script and various other files remain -- I assume
> > > that is just everything.
> >
> > I'm not clear what you're reporting. Files in SPOOL should
> > remain as long as the associated job is still in the queue. Are
> > you saying that the job in question left the queue (is no longer
> > visible in condor_q), but still has a subdirectory in SPOOL?
>
> Yes, that is the case -- sorry for having been vague. It doesn't seem to
> make a difference whether a job terminates normally, or gets
> condor_rm'ed. Whenever such a job got checkpointed once, it leaves its
> remains in SPOOL.
Here are few more bits of information. I currently have a vanilla job (3009.9)
running that got checkpointed yesterday (by DMTCP), the result is in
SPOOL
% ls -l /var/spool/condor/3009/9/
total 12
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0/
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.swap/
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.tmp/
% condor_q -l 3009.10 |grep 'Owner'
Owner = "mih"
Right now the job is still running, so I wouldn't expect Condor to
remove the checkpoints. However ShadowLog has this:
04/11/12 15:14:20 (3009.9) (11192): Failed to remove /var/spool/condor/3009/9/cluster3009.proc9.subproc0.swap: Permission denied (errno 13)
04/11/12 15:14:20 (3009.9) (11192): Job 3009.9 is being evicted from slot1@snake1
04/11/12 15:15:20 (3009.9) (11192): **** condor_shadow (condor_SHADOW) pid 11192 EXITING WITH STATUS 107
% ps aux |grep 3009.9
mih 7929 0.0 0.0 93700 6796 ? S Apr11 0:02 condor_shadow -f 3009.9 --schedd=<10.0.0.254:47584> --xfer-queue=limit=upload,download;addr=<10.0.0.254:42112> <10.0.0.254:42112> -
It seems like the shadow should have permission to remove this
directory. Maybe it was present at that point and the shadow tried to
remove it unconditionally -- but it says "permission denied" and not
"file not found".
I tried using a simple 'sleep' test job to get a sense of what is going
on (submit file is attached).
First test: condor_submit -> condor_vacate_job -> condor_rm
# what the logs say
% grep 3029.0 *
MatchLog:04/12/12 10:52:53 Request 03029.00000:
MatchLog:04/12/12 10:52:53 Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
MatchLog:04/12/12 10:55:51 Request 03029.00000:
MatchLog:04/12/12 10:55:51 Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 10:52:53 Request 03029.00000:
NegotiatorLog:04/12/12 10:52:53 Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 10:55:51 Request 03029.00000:
NegotiatorLog:04/12/12 10:55:51 Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
SchedLog:04/12/12 10:52:53 (pid:29046) Starting add_shadow_birthdate(3029.0)
SchedLog:04/12/12 10:52:53 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 11445)
SchedLog:04/12/12 10:54:03 (pid:29046) Shadow pid 11445 for job 3029.0 exited with status 107
SchedLog:04/12/12 10:54:03 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3029.0) deleted
SchedLog:04/12/12 10:55:51 (pid:29046) Starting add_shadow_birthdate(3029.0)
SchedLog:04/12/12 10:55:51 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 14533)
SchedLog:04/12/12 10:58:28 (pid:29046) Shadow pid 14533 for job 3029.0 exited with status 102
SchedLog:04/12/12 10:58:28 (pid:29046) Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0: Directory not empty (errno 39)
ShadowLog:04/12/12 10:52:53 Initializing a VANILLA shadow for job 3029.0
ShadowLog:04/12/12 10:52:53 (3029.0) (11445): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 10:53:55 (3029.0) (11445): Got SIGTERM. Performing graceful shutdown.
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0.swap: Permission denied (errno 13)
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Job 3029.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): **** condor_shadow (condor_SHADOW) pid 11445 EXITING WITH STATUS 107
ShadowLog:04/12/12 10:55:51 Initializing a VANILLA shadow for job 3029.0
ShadowLog:04/12/12 10:55:51 (3029.0) (14533): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Requesting fast removal of job.
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Job 3029.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): **** condor_shadow (condor_SHADOW) pid 14533 EXITING WITH STATUS 102
# SPOOL content AFTER restart
% ll /var/spool/condor/3029/0
total 12K
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.swap/
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.tmp/
% ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.err
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 _condor_stderr
-rw-r--r-- 1 mih mih 91 Apr 12 10:54 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh*
# SPOOL content AFTER condor_rm
% ll /var/spool/condor/3029/0
total 4.0K
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/
% ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.err
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 3029.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp
-rw-r--r-- 1 mih mih 0 Apr 12 10:54 _condor_stderr
-rw-r--r-- 1 mih mih 91 Apr 12 10:54 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh*
And the same thing again -- the only difference being that the job
exit(0) on its own -- no condor_rm:
# what the logs say
% grep 3030.0 *
MatchLog:04/12/12 11:01:12 Request 03030.00000:
MatchLog:04/12/12 11:01:12 Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
MatchLog:04/12/12 11:03:07 Request 03030.00000:
MatchLog:04/12/12 11:03:07 Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 11:01:12 Request 03030.00000:
NegotiatorLog:04/12/12 11:01:12 Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 11:03:07 Request 03030.00000:
NegotiatorLog:04/12/12 11:03:07 Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
SchedLog:04/12/12 11:01:12 (pid:29046) Starting add_shadow_birthdate(3030.0)
SchedLog:04/12/12 11:01:12 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 20197)
SchedLog:04/12/12 11:02:19 (pid:29046) Shadow pid 20197 for job 3030.0 exited with status 107
SchedLog:04/12/12 11:02:19 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted
SchedLog:04/12/12 11:03:07 (pid:29046) Starting add_shadow_birthdate(3030.0)
SchedLog:04/12/12 11:03:07 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 22184)
SchedLog:04/12/12 11:07:11 (pid:29046) Shadow pid 22184 for job 3030.0 reports job exit reason 100.
SchedLog:04/12/12 11:07:11 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted
SchedLog:04/12/12 11:07:11 (pid:29046) Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0: Directory not empty (errno 39)
ShadowLog:04/12/12 11:01:12 Initializing a VANILLA shadow for job 3030.0
ShadowLog:04/12/12 11:01:12 (3030.0) (20197): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 11:02:11 (3030.0) (20197): Got SIGTERM. Performing graceful shutdown.
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0.swap: Permission denied (errno 13)
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Job 3030.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): **** condor_shadow (condor_SHADOW) pid 20197 EXITING WITH STATUS 107
ShadowLog:04/12/12 11:03:07 Initializing a VANILLA shadow for job 3030.0
ShadowLog:04/12/12 11:03:07 (3030.0) (22184): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Job 3030.0 terminated: exited with status 0
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Reporting job exit reason 100 and attempting to fetch new job.
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): **** condor_shadow (condor_SHADOW) pid 22184 EXITING WITH STATUS 100
% ll /var/spool/condor/3030/0
total 4.0K
drwxr-xr-x 2 mih mih 4.0K Apr 12 11:02 cluster3030.proc0.subproc0/
% ll /var/spool/condor/3030/0/cluster3030.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih 0 Apr 12 11:02 3030.0.err
-rw-r--r-- 1 mih mih 0 Apr 12 11:02 3030.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 11:02 3030.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 11:02 ckpt_sleep_3030_0-1a0c4a68dcce1-13381-4f8699dc.dmtcp
-rw-r--r-- 1 mih mih 0 Apr 12 11:02 _condor_stderr
-rw-r--r-- 1 mih mih 91 Apr 12 11:02 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 11:02 dmtcp_restart_script.sh*
Conclusion: Despite permission denied messages some SPOOL data is
removed -- both with condor_rm or at "natural" job exit. However, ALL
DMTCP-related files remain in SPOOL until condor_preen catches them.
% sudo condor_preen -remove
/var/spool/condor/3030/0/cluster3030.proc0.subproc0 - Removed
/var/spool/condor/3029/0/cluster3029.proc0.subproc0 - Removed
but even condor_preen doesn't clean everything
% ll /var/spool/condor/3030/0
total 0
I hope helps. Thanks,
Michael
--
Michael Hanke
http://mih.voxindeserto.de
# DMTCP checkpointing related boilerplate (should be applicable in most cases)
universe = vanilla
executable = /usr/lib/condor/shim_dmtcp
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
kill_sig = 2
output = $(CLUSTER).$(PROCESS).shimout
error = $(CLUSTER).$(PROCESS).shimerr
log = $(CLUSTER).$(PROCESS).log
dmtcp_args = --log $(CLUSTER).$(PROCESS).shimlog --stdout $(CLUSTER).$(PROCESS).out --stderr $(CLUSTER).$(PROCESS).err
dmtcp_env = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS)
# job-specific setup, uses arguments and environment defined above
arguments = $(dmtcp_args) sleep 240
environment = $(dmtcp_env)
queue 1