HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them

Date: Thu, 6 Jul 2006 15:40:09 -0400
From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Subject: [Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them

Title: Bug: condor_release times out releasing clusters with more than 1500 processes in them

We saw this problem on our live system on the weekend so I had our co-op do a detailed analysis for this on our test system and it’s very re-creatable. The condor_release call consistently times out on clusters with more than 1500 processes in them. Thankfully it doesn’t half-release the cluster. But it still means that if you’ve submitted a cluster on hold with more than 1500 processes in it you’re never going to get it to run. Is this a known issue?

- Ian

<<job.tar.gz>>

All tests were done on a dual-Xeon 1 GHz Dell 2000-series server with 2 GB of RAM running FC3 and the 6.7.20 binaries. Here is our test scenario for this problem.

The central server used for this test has virtually no load. Output from top:

5309 ttcbatch 17 0 7080 3180 5988 S 1.3 0.2 0:05.15 condor_master

4815 acanis 16 0 7652 2080 3504 S 0.7 0.1 1:31.03 sshd

4990 acanis 15 0 5560 3200 2112 S 0.7 0.2 1:39.93 screen

7364 acanis 16 0 3120 1080 1796 R 0.7 0.1 0:00.24 top

212 root 15 0 0 0 0 S 0.3 0.0 1:21.51 kjournald

2898 root 15 0 15428 8996 8656 S 0.3 0.4 5:22.36 X

5310 ttcbatch 16 0 7956 3892 6188 S 0.3 0.2 0:08.71 condor_collecto

1 root 16 0 3028 564 1408 S 0.0 0.0 0:01.42 init

2 root RT 0 0 0 0 S 0.0 0.0 0:00.15 migration/0

The following daemons are running on the central machine:

ttcbatch 5309 1 0 14:04 ? 00:00:00 /opt/condor/sbin/condor_master

ttcbatch 5310 5309 0 14:04 ? 00:00:00 condor_collector -f

ttcbatch 5311 5309 1 14:04 ? 00:00:00 condor_negotiator -f

ttcbatch 5312 5309 99 14:04 ? 00:00:52 condor_schedd -f

ttcbatch 5313 5309 0 14:04 ? 00:00:00 condor_quill -f

Output from condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

vm1@TTC-ACANI WINNT51 INTEL Unclaimed Idle 0.000 634 0+00:02:50

vm2@TTC-ACANI WINNT51 INTEL Unclaimed Idle 0.000 634 0+00:11:43

vm3@TTC-ACANI WINNT51 INTEL Owner Idle 0.000 388 0+06:21:15

vm4@TTC-ACANI WINNT51 INTEL Owner Idle 0.100 388 0+06:21:16

vm1@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:44

vm2@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:28

vm3@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:35

vm4@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+00:03:42

vm5@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:38

vm6@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.010 255 0+03:20:33

vm7@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:45

vm8@TTC-BS866 WINNT51 INTEL Unclaimed Idle 0.000 255 0+03:20:44

The system has no clusters in it. Checking the queue with condor_q outputs nothing.

The cluster run is contained in the attached tarball:

[acanis@ttc-abcdev cluster_0]$ condor_submit condor_ticket

Submitting job(s) <snip>

Logging submit event(s) <snip>

1500 job(s) submitted to cluster 1198.

Now try releasing the cluster. You will get the following error:

[acanis@ttc-abcdev cluster_0]$ condor_release 1198

Couldn't find/release all jobs in cluster 1198.

Looking at the queue with condor_q you’ll see all the processes are still held:

<snip>

1198.1497 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe

1198.1498 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe

1198.1499 acanis 7/6 14:36 0+00:00:00 H 0 253.9 wrapper.bat no_swe

1500 jobs; 0 idle, 0 running, 1500 held

Inspecting the Schedlog gives:

7/6 14:40:24 No HoldReasonCode found for job 1198.1399

7/6 14:40:24 No HoldReasonSubCode found for job 1198.1399

7/6 14:40:24 SCHEDD_ROUND_ATTR_JobStatus is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_ReleaseReason is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_EnteredCurrentStatus is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_LastHoldReason is undefined, using default value of 0

7/6 14:40:24 No HoldReasonCode found for job 1198.1499

7/6 14:40:24 No HoldReasonSubCode found for job 1198.1499

7/6 14:40:24 condor_write(): Socket closed when trying to write buffer, fd is 10

7/6 14:40:24 Buf::write(): condor_write() failed

7/6 14:40:24 actOnJobs: couldn't send results to client: aborting

7/6 14:40:24 JobsRunning = 0

7/6 14:40:24 JobsIdle = 0

7/6 14:40:24 JobsHeld = 3000

7/6 14:40:24 JobsRemoved = 0

7/6 14:40:24 LocalUniverseJobsRunning = 0

7/6 14:40:24 LocalUniverseJobsIdle = 0

7/6 14:40:24 SchedUniverseJobsRunning = 0

7/6 14:40:24 SchedUniverseJobsIdle = 0

7/6 14:40:24 N_Owners = 1

7/6 14:40:24 MaxJobsRunning = 10000

7/6 14:40:24 ENABLE_SOAP is undefined, using default value of False

7/6 14:40:24 Trying to update collector <137.57.176.244:9618>

7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>

7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False

7/6 14:40:24 Sent HEART BEAT ad to 1 collectors. Number of submittors=1

7/6 14:40:24 Changed attribute: RunningJobs = 0

7/6 14:40:24 Changed attribute: IdleJobs = 0

7/6 14:40:24 Changed attribute: HeldJobs = 3000

7/6 14:40:24 Changed attribute: FlockedJobs = 0

7/6 14:40:24 Changed attribute: Name = "Priority50@xxxxxxxxxx"

7/6 14:40:24 Sent ad to central manager for Priority50@xxxxxxxxxx

7/6 14:40:24 Trying to update collector <137.57.176.244:9618>

7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>

7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False

7/6 14:40:24 Sent ad to 1 collectors for Priority50@xxxxxxxxxm

Releasing just one of the processes in the cluster works.

Ian R. Chesal <ichesal@xxxxxxxxxx>

Senior Software Engineer

Altera Corporation

Toronto Technology Center

Tel: (416) 926-8300

Attachment: job.tar.gz
Description: job.tar.gz

Follow-Ups:
- Re: [Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them
  - From: Peter Keller

Prev by Date: [Condor-devel] (no subject)
Next by Date: Re: [Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them
Previous by thread: [Condor-devel] (no subject)
Next by thread: Re: [Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them
Index(es):
- Date
- Thread