[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Leftovers of checkpointed jobs accumulate in SPOOL



On Mon, Mar 12, 2012 at 05:42:25PM +0100, Michael Hanke wrote:
> On Mon, Mar 12, 2012 at 11:24:09AM -0500, Alan De Smet wrote:
> > Michael Hanke <michael.hanke@xxxxxxxxx> wrote:
> > > I'm testing DMTCP-based checkpointing of vanilla job in our Condor pool
> > > (all version 7.7.5). I noticed that jobs once evicted remain in SPOOL
> > > even after they got restarted on an exec node again. Checkpoint files,
> > > executable, restart script and various other files remain -- I assume
> > > that is just everything.
> > 
> > I'm not clear what you're reporting.  Files in SPOOL should
> > remain as long as the associated job is still in the queue.  Are
> > you saying that the job in question left the queue (is no longer
> > visible in condor_q), but still has a subdirectory in SPOOL? 
> 
> Yes, that is the case -- sorry for having been vague. It doesn't seem to
> make a difference whether a job terminates normally, or gets
> condor_rm'ed. Whenever such a job got checkpointed once, it leaves its
> remains in SPOOL.

Here are few more bits of information. I currently have a vanilla job (3009.9)
running that got checkpointed yesterday (by DMTCP), the result is in
SPOOL

% ls -l /var/spool/condor/3009/9/
total 12
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0/
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.swap/
drwxr-xr-x 2 mih mih 4096 Apr 11 15:14 cluster3009.proc9.subproc0.tmp/

% condor_q -l 3009.10 |grep 'Owner'
Owner = "mih"

Right now the job is still running, so I wouldn't expect Condor to
remove the checkpoints. However ShadowLog has this:

04/11/12 15:14:20 (3009.9) (11192): Failed to remove /var/spool/condor/3009/9/cluster3009.proc9.subproc0.swap: Permission denied (errno 13)
04/11/12 15:14:20 (3009.9) (11192): Job 3009.9 is being evicted from slot1@snake1
04/11/12 15:15:20 (3009.9) (11192): **** condor_shadow (condor_SHADOW) pid 11192 EXITING WITH STATUS 107

% ps aux |grep 3009.9
mih       7929  0.0  0.0  93700  6796 ?        S    Apr11   0:02 condor_shadow -f 3009.9 --schedd=<10.0.0.254:47584> --xfer-queue=limit=upload,download;addr=<10.0.0.254:42112> <10.0.0.254:42112> -

It seems like the shadow should have permission to remove this
directory. Maybe it was present at that point and the shadow tried to
remove it unconditionally -- but it says "permission denied" and not
"file not found".

I tried using a simple 'sleep' test job to get a sense of what is going
on (submit file is attached).

First test: condor_submit -> condor_vacate_job -> condor_rm

# what the logs say
% grep 3029.0 *
MatchLog:04/12/12 10:52:53     Request 03029.00000:
MatchLog:04/12/12 10:52:53       Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
MatchLog:04/12/12 10:55:51     Request 03029.00000:
MatchLog:04/12/12 10:55:51       Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 10:52:53     Request 03029.00000:
NegotiatorLog:04/12/12 10:52:53       Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 10:55:51     Request 03029.00000:
NegotiatorLog:04/12/12 10:55:51       Matched 3029.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
SchedLog:04/12/12 10:52:53 (pid:29046) Starting add_shadow_birthdate(3029.0)
SchedLog:04/12/12 10:52:53 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 11445)
SchedLog:04/12/12 10:54:03 (pid:29046) Shadow pid 11445 for job 3029.0 exited with status 107
SchedLog:04/12/12 10:54:03 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3029.0) deleted
SchedLog:04/12/12 10:55:51 (pid:29046) Starting add_shadow_birthdate(3029.0)
SchedLog:04/12/12 10:55:51 (pid:29046) Started shadow for job 3029.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 14533)
SchedLog:04/12/12 10:58:28 (pid:29046) Shadow pid 14533 for job 3029.0 exited with status 102
SchedLog:04/12/12 10:58:28 (pid:29046) Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0: Directory not empty (errno 39)
ShadowLog:04/12/12 10:52:53 Initializing a VANILLA shadow for job 3029.0
ShadowLog:04/12/12 10:52:53 (3029.0) (11445): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 10:53:55 (3029.0) (11445): Got SIGTERM. Performing graceful shutdown.
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Failed to remove /var/spool/condor/3029/0/cluster3029.proc0.subproc0.swap: Permission denied (errno 13)
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): Job 3029.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 10:54:03 (3029.0) (11445): **** condor_shadow (condor_SHADOW) pid 11445 EXITING WITH STATUS 107
ShadowLog:04/12/12 10:55:51 Initializing a VANILLA shadow for job 3029.0
ShadowLog:04/12/12 10:55:51 (3029.0) (14533): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Requesting fast removal of job.
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): Job 3029.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 10:58:28 (3029.0) (14533): **** condor_shadow (condor_SHADOW) pid 14533 EXITING WITH STATUS 102

# SPOOL content AFTER restart
% ll /var/spool/condor/3029/0
total 12K
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.swap/
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0.tmp/

% ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 3029.0.err
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 3029.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 _condor_stderr
-rw-r--r-- 1 mih mih   91 Apr 12 10:54 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh*

# SPOOL content AFTER condor_rm
% ll /var/spool/condor/3029/0
total 4.0K
drwxr-xr-x 2 mih mih 4.0K Apr 12 10:54 cluster3029.proc0.subproc0/

% ll /var/spool/condor/3029/0/cluster3029.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 3029.0.err
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 3029.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 10:54 3029.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 10:54 ckpt_sleep_3029_0-1a0c4a68dcce1-13241-4f8697e8.dmtcp
-rw-r--r-- 1 mih mih    0 Apr 12 10:54 _condor_stderr
-rw-r--r-- 1 mih mih   91 Apr 12 10:54 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 10:54 dmtcp_restart_script.sh*


And the same thing again -- the only difference being that the job
exit(0) on its own -- no condor_rm:


# what the logs say
% grep 3030.0 *
MatchLog:04/12/12 11:01:12     Request 03030.00000:
MatchLog:04/12/12 11:01:12       Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
MatchLog:04/12/12 11:03:07     Request 03030.00000:
MatchLog:04/12/12 11:03:07       Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 11:01:12     Request 03030.00000:
NegotiatorLog:04/12/12 11:01:12       Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
NegotiatorLog:04/12/12 11:03:07     Request 03030.00000:
NegotiatorLog:04/12/12 11:03:07       Matched 3030.0 mih@xxxxxxxxxxx <10.0.0.254:47584> preempting none <10.0.0.5:37266> slot1@snake5
SchedLog:04/12/12 11:01:12 (pid:29046) Starting add_shadow_birthdate(3030.0)
SchedLog:04/12/12 11:01:12 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 20197)
SchedLog:04/12/12 11:02:19 (pid:29046) Shadow pid 20197 for job 3030.0 exited with status 107
SchedLog:04/12/12 11:02:19 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted
SchedLog:04/12/12 11:03:07 (pid:29046) Starting add_shadow_birthdate(3030.0)
SchedLog:04/12/12 11:03:07 (pid:29046) Started shadow for job 3030.0 on slot1@snake5 <10.0.0.5:37266> for mih, (shadow pid = 22184)
SchedLog:04/12/12 11:07:11 (pid:29046) Shadow pid 22184 for job 3030.0 reports job exit reason 100.
SchedLog:04/12/12 11:07:11 (pid:29046) Match record (slot1@snake5 <10.0.0.5:37266> for mih, 3030.0) deleted
SchedLog:04/12/12 11:07:11 (pid:29046) Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0: Directory not empty (errno 39)
ShadowLog:04/12/12 11:01:12 Initializing a VANILLA shadow for job 3030.0
ShadowLog:04/12/12 11:01:12 (3030.0) (20197): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 11:02:11 (3030.0) (20197): Got SIGTERM. Performing graceful shutdown.
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Failed to remove /var/spool/condor/3030/0/cluster3030.proc0.subproc0.swap: Permission denied (errno 13)
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): Job 3030.0 is being evicted from slot1@snake5
ShadowLog:04/12/12 11:02:19 (3030.0) (20197): **** condor_shadow (condor_SHADOW) pid 20197 EXITING WITH STATUS 107
ShadowLog:04/12/12 11:03:07 Initializing a VANILLA shadow for job 3030.0
ShadowLog:04/12/12 11:03:07 (3030.0) (22184): Request to run on slot1@snake5 <10.0.0.5:37266> was ACCEPTED
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Job 3030.0 terminated: exited with status 0
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): Reporting job exit reason 100 and attempting to fetch new job.
ShadowLog:04/12/12 11:07:11 (3030.0) (22184): **** condor_shadow (condor_SHADOW) pid 22184 EXITING WITH STATUS 100

% ll /var/spool/condor/3030/0
total 4.0K
drwxr-xr-x 2 mih mih 4.0K Apr 12 11:02 cluster3030.proc0.subproc0/

% ll /var/spool/condor/3030/0/cluster3030.proc0.subproc0
total 2.0M
-rw-r--r-- 1 mih mih    0 Apr 12 11:02 3030.0.err
-rw-r--r-- 1 mih mih    0 Apr 12 11:02 3030.0.out
-rw-r--r-- 1 mih mih 3.0K Apr 12 11:02 3030.0.shimlog
-rw------- 1 mih mih 2.0M Apr 12 11:02 ckpt_sleep_3030_0-1a0c4a68dcce1-13381-4f8699dc.dmtcp
-rw-r--r-- 1 mih mih    0 Apr 12 11:02 _condor_stderr
-rw-r--r-- 1 mih mih   91 Apr 12 11:02 _condor_stdout
-rwxr--r-- 1 mih mih 3.9K Apr 12 11:02 dmtcp_restart_script.sh*


Conclusion: Despite permission denied messages some SPOOL data is
removed -- both with condor_rm or at "natural" job exit. However, ALL
DMTCP-related files remain in SPOOL until condor_preen catches them.

% sudo condor_preen -remove
  /var/spool/condor/3030/0/cluster3030.proc0.subproc0 - Removed
  /var/spool/condor/3029/0/cluster3029.proc0.subproc0 - Removed

but even condor_preen doesn't clean everything

% ll /var/spool/condor/3030/0
total 0


I hope helps. Thanks,

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de
# DMTCP checkpointing related boilerplate (should be applicable in most cases)
universe = vanilla
executable = /usr/lib/condor/shim_dmtcp
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
kill_sig = 2
output = $(CLUSTER).$(PROCESS).shimout
error = $(CLUSTER).$(PROCESS).shimerr
log = $(CLUSTER).$(PROCESS).log
dmtcp_args = --log $(CLUSTER).$(PROCESS).shimlog --stdout $(CLUSTER).$(PROCESS).out --stderr $(CLUSTER).$(PROCESS).err
dmtcp_env = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS)

# job-specific setup, uses arguments and environment defined above
arguments = $(dmtcp_args) sleep 240
environment = $(dmtcp_env)

queue 1