Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] remote condor job never gets removed
- Date: Thu, 18 May 2006 17:20:45 +0900
- From: Andrew Stubbings <ajs@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] remote condor job never gets removed
A remote job submitted from 6.7.18 SuSE 9.3/x86_64 to 6.7.19 SuSE
8.2/x86 completes but never gets removed from the queue or the results
returned back to the submitting machine:
$ cat remote_vanilla.sub
universe = vanilla
executable = vanilla.sh
requirements = Arch == "INTEL"
output = $(Cluster).$(Process).out
error = $(Cluster).$(Process).err
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
log = remote_vanilla.log
notification = never
queue
$ condor_submit -remote cmhost -pool cmhost remote_vanilla.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 61.
Spooling data files for 1 jobs...
$ condor_q -pool cmhost -name cmhost
-- Schedd: cmhost.bestsystems.co.jp : <172.16.10.117:46010>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
61.0 ajs 5/18 16:55 0+00:00:06 C 0 9.8 vanilla.sh
0 jobs; 0 idle, 0 running, 0 held
$ ls -l 61*
-rw-r--r-- 1 ajs users 0 May 18 16:55 61.0.err
-rw-r--r-- 1 ajs users 0 May 18 16:55 61.0.out
$
The SchedLog shows the job completed but ends with an mrec error:
5/18 16:38:32 Job 61.0 is finished
5/18 16:38:32 Added data to SelfDrainingQueue job_is_finished_queue,
now has 1 element(s)
5/18 16:38:32 Registered timer for SelfDrainingQueue
job_is_finished_queue, period: 0 (id: 52)
5/18 16:38:32 Exited check_zombie( 15343, 0x0x856a504 )
5/18 16:38:32
5/18 16:38:32 ..................
5/18 16:38:32 .. Shadow Recs (0/1)
5/18 16:38:32 ..................
5/18 16:38:32 Exited delete_shadow_rec( 15343 )
5/18 16:38:32 -------- Begin starting jobs --------
5/18 16:38:32 Job 61.-1: not runnable
5/18 16:38:32 match (<172.16.10.117:46011>#1147937001#5) out of jobs
(cluster id 61); relinquishing
5/18 16:38:32 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
5/18 16:38:32 SEC_DEBUG_PRINT_KEYS is undefined, using default value
of False
5/18 16:38:32 Called send_vacate( <172.16.10.117:46011>, 443 )
5/18 16:38:32 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
value of 0
5/18 16:38:32 SEC_DEBUG_PRINT_KEYS is undefined, using default value
of False
5/18 16:38:32 Sent RELEASE_CLAIM to startd on <172.16.10.117:46011>
5/18 16:38:32 Match record (<172.16.10.117:46011>, 61, -1) deleted
5/18 16:38:32 ClaimId of deleted match:
<172.16.10.117:46011>#1147937001#5
5/18 16:38:32 -------- Done starting jobs --------
5/18 16:38:32 Inside SelfDrainingQueue::timerHandler() for
job_is_finished_queue
5/18 16:38:32 Job cleanup for 61.0 will block, calling
jobIsFinished() in a thread
5/18 16:38:32 SelfDrainingQueue job_is_finished_queue is empty, not
resetting timer
5/18 16:38:32 Canceling timer for SelfDrainingQueue
job_is_finished_queue (timer id: 52)
5/18 16:38:32 DaemonCore: No more children processes to reap.
5/18 16:38:32 jobIsFinished() completed, calling DestroyProc(61.0)
5/18 16:38:32 SCHEDD_ROUND_ATTR_JobFinishedHookDone is undefined,
using default value of 0
5/18 16:38:32 Got VACATE_SERVICE from <172.16.10.117:47921>
5/18 16:38:32 mrec for "<172.16.10.117:46011>#1147937001#5" not
found -- match not deleted
Both the submit machine and remote schedd machine are included in each
other's /etc/hosts. The submit machine condor_config has the following
authentication:
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE
and the remote schedd condor_config has:
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
Do I have a configuration problem?
Andrew
--
Andrew Stubbings
BestSystems, Inc.
Tel: +81 29 860 7080
E-mail: ajs@xxxxxxxxxxxxxxxxx
www.bestsystems.co.jp