Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Processes run, but quit immediately
- Date: Mon, 13 Feb 2006 09:42:52 -0000
- From: "Cranford, Ross" <R.Cranford@xxxxxxxxxxxx>
- Subject: [Condor-users] Processes run, but quit immediately
Hi there,
We've installed condor on three of our 40 machines
here, one acting as central manager/submit, the others as executing nodes. I had
some problems initially with file permissions, but as far as I can see, these
are all ironed out. I'm not using a shared file system.
However, whenever the test jobs that came with
condor are run, they submit to the two executing nodes, then are pre-empted
immediately (as far as I can see from the status changes). The shadow log on the
central manager looks like this:
2/10 15:12:32 (59.1) (26137):Shadow: RSC_SOCK connected, fd = 17
2/10
15:12:32 (59.1) (26137):Shadow: CLIENT_LOG connected, fd = 18
2/10 15:12:32
(59.1) (26137):My_Filesystem_Domain = "beo"
2/10 15:12:32 (59.1)
(26137):My_UID_Domain = "beo"
2/10 15:12:32 (59.1) (26137):
Entering pseudo_get_file_stream
2/10 15:12:32 (59.1) (26137):
file = "/home/condor/spool/cluster59.ickpt.subproc0"
2/10 15:12:32 (59.1)
(26137): Weird 0xc0a801fe
2/10 15:12:32 (59.1)
(26137): Weird 0xc0a801fe
2/10 15:12:32 (59.1)
(26137):Reaped child status - pid 26138 exited with status 0
2/10 15:12:33
(59.1) (26137):Shadow: Job 59.1 exited, termsig = 9, coredump = 0, retcode =
129
2/10 15:12:33 (59.1) (26137):Shadow: Job was kicked off without a
checkpoint
2/10 15:12:33 (59.1) (26137):Shadow: DoCleanup: unlinking TmpCkpt
'/home/condor/spool/cluster59.proc1.subproc0.tmp'
2/10 15:12:33 (59.1)
(26137):Trying to unlink /home/condor/spool/cluster59.proc1.subproc0.tmp
2/10
15:12:33 (59.1) (26137):user_time = 1 ticks
2/10 15:12:33 (59.1)
(26137):sys_time = 1 ticks
2/10 15:12:33 (59.1) (26137):********** Shadow
Exiting(107) **********
The StarterLog on the machine the job was allocated to seems to receive the
files fine, but then gives this:
2/10 15:07:56 Started user job - PID = 3910
2/10 15:07:56
cmd_fp = 0x828be78
2/10 15:07:56 end
2/10 15:07:56 *FSM*
Transitioning to state "SUPERVISE"
2/10 15:07:56 *FSM* Got
asynchronous event "CHILD_EXIT"
2/10 15:07:56 *FSM* Executing
transition function "reaper"
2/10 15:07:56 Process 3910 exited with status
129
2/10 15:07:56 EXEC of user process failed, probably insufficient swap
Does anyone have any ideas? I'll be happy to send any other details.
Many thanks.
This message is intended for the addressee(s) only and should not be
read, copied or disclosed to anyone else outwith the University without the
permission of the sender. It is your responsibility to ensure that this message
and any attachments are scanned for viruses or other defects. Napier University
does not accept liability for any loss or damage which may result from this
email or any attachment, or for errors or omissions arising after it was sent.
Email is not a secure medium. Email entering the University's system is subject
to routine monitoring and filtering by the University.