We saw this problem on our live system on the weekend so I had our co-op do a detailed analysis for this on our test system and it’s very re-creatable. The condor_release call consistently times out on clusters with more than 1500 processes in them. Thankfully it doesn’t half-release the cluster. But it still means that if you’ve submitted a cluster on hold with more than 1500 processes in it you’re never going to get it to run. Is this a known issue?
- Ian
<<job.tar.gz>>
All tests were done on a dual-Xeon 1 GHz Dell 2000-series server with 2 GB of RAM running FC3 and the 6.7.20 binaries. Here is our test scenario for this problem.
The central server used for this test has virtually no load. Output from top:
5309 ttcbatch 17 0 7080 3180 5988 S 1.3 0.2 0:05.15 condor_master
4815 acanis 16 0 7652 2080 3504 S 0.7 0.1 1:31.03 sshd
4990 acanis 15 0 5560 3200 2112 S 0.7 0.2 1:39.93 screen
7364 acanis 16 0 3120 1080 1796 R 0.7 0.1 0:00.24 top
212 root 15 0 0 0 0 S 0.3 0.0 1:21.51 kjournald
2898 root 15 0 15428 8996 8656 S 0.3 0.4 5:22.36 X
5310 ttcbatch 16 0 7956 3892 6188 S 0.3 0.2 0:08.71 condor_collecto
1 root 16 0 3028 564 1408 S 0.0 0.0 0:01.42 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.15 migration/0
The following daemons are running on the central machine:
ttcbatch 5309 1 0 14:04 ? 00:00:00 /opt/condor/sbin/condor_master
ttcbatch 5310 5309 0 14:04 ? 00:00:00 condor_collector -f
ttcbatch 5311 5309 1 14:04 ? 00:00:00 condor_negotiator -f
ttcbatch 5312 5309 99 14:04 ? 00:00:52 condor_schedd -f
ttcbatch 5313 5309 0 14:04 ? 00:00:00 condor_quill -f
Output from condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
vm1@TTC-ACANI WINNT51 INTEL Unclaimed Idle 0.000 634 0+00:02:50
vm2@TTC-ACANI WINNT51 INTEL Unclaimed Idle 0.000 634 0+00:11:43
vm3@TTC-ACANI WINNT51 INTEL Owner Idle 0.000 388 0+06:21:15
vm4@TTC-ACANI WINNT51 INTEL Owner Idle 0.100 388 0+06:21:16
vm1@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:44
vm2@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:28
vm3@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:35
vm4@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:42
vm5@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:38
vm6@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.010 255 0+03:20:33
vm7@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:45
vm8@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:44
The system has no clusters in it. Checking the queue with condor_q outputs nothing.
The cluster run is contained in the attached tarball:
[acanis@ttc-abcdev cluster_0]$ condor_submit condor_ticket
Submitting job(s) <snip>
Logging submit event(s) <snip>
1500 job(s) submitted to cluster 1198.
Now try releasing the cluster. You will get the following error:
[acanis@ttc-abcdev cluster_0]$ condor_release 1198
Couldn't find/release all jobs in cluster 1198.
Looking at the queue with condor_q you’ll see all the processes are still held:
<snip>
1198.1497 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe
1198.1498 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe
1198.1499 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe
1500 jobs; 0 idle, 0 running, 1500 held
Inspecting the Schedlog gives:
7/6 14:40:24 No HoldReasonCode found for job 1198.1399
7/6 14:40:24 No HoldReasonSubCode found for job 1198.1399
7/6 14:40:24 SCHEDD_ROUND_ATTR_JobStatus is undefined, using default value of 0
7/6 14:40:24 SCHEDD_ROUND_ATTR_ReleaseReason is undefined, using default value of 0
7/6 14:40:24 SCHEDD_ROUND_ATTR_EnteredCurrentStatus is undefined, using default value of 0
7/6 14:40:24 SCHEDD_ROUND_ATTR_LastHoldReason is undefined, using default value of 0
7/6 14:40:24 No HoldReasonCode found for job 1198.1499
7/6 14:40:24 No HoldReasonSubCode found for job 1198.1499
7/6 14:40:24 condor_write(): Socket closed when trying to write buffer, fd is 10
7/6 14:40:24 Buf::write(): condor_write() failed
7/6 14:40:24 actOnJobs: couldn't send results to client: aborting
7/6 14:40:24 JobsRunning = 0
7/6 14:40:24 JobsIdle = 0
7/6 14:40:24 JobsHeld = 3000
7/6 14:40:24 JobsRemoved = 0
7/6 14:40:24 LocalUniverseJobsRunning = 0
7/6 14:40:24 LocalUniverseJobsIdle = 0
7/6 14:40:24 SchedUniverseJobsRunning = 0
7/6 14:40:24 SchedUniverseJobsIdle = 0
7/6 14:40:24 N_Owners = 1
7/6 14:40:24 MaxJobsRunning = 10000
7/6 14:40:24 ENABLE_SOAP is undefined, using default value of False
7/6 14:40:24 Trying to update collector <137.57.176.244:9618>
7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>
7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
7/6 14:40:24 Sent HEART BEAT ad to 1 collectors. Number of submittors=1
7/6 14:40:24 Changed attribute: RunningJobs = 0
7/6 14:40:24 Changed attribute: IdleJobs = 0
7/6 14:40:24 Changed attribute: HeldJobs = 3000
7/6 14:40:24 Changed attribute: FlockedJobs = 0
7/6 14:40:24 Changed attribute: Name = "Priority50@xxxxxxxxxx"
7/6 14:40:24 Sent ad to central manager for Priority50@xxxxxxxxxx
7/6 14:40:24 Trying to update collector <137.57.176.244:9618>
7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>
7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
7/6 14:40:24 Sent ad to 1 collectors for Priority50@xxxxxxxxxm
Releasing just one of the processes in the cluster works.
--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer
Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300
Attachment:
job.tar.gz
Description: job.tar.gz