HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Bug: condor_release times out releasing clusters with more than 1500 processes in them



Title: Bug: condor_release times out releasing clusters with more than 1500 processes in them

We saw this problem on our live system on the weekend so I had our co-op do a detailed analysis for this on our test system and its very re-creatable. The condor_release call consistently times out on clusters with more than 1500 processes in them. Thankfully it doesnt half-release the cluster. But it still means that if youve submitted a cluster on hold with more than 1500 processes in it youre never going to get it to run. Is this a known issue?

- Ian

<<job.tar.gz>>

All tests were done on a dual-Xeon 1 GHz Dell 2000-series server with 2 GB of RAM running FC3 and the 6.7.20 binaries. Here is our test scenario for this problem.

The central server used for this test has virtually no load. Output from top:

5309 ttcbatch  17   0  7080 3180 5988 S  1.3  0.2   0:05.15 condor_master

4815 acanis    16   0  7652 2080 3504 S  0.7  0.1   1:31.03 sshd

4990 acanis    15   0  5560 3200 2112 S  0.7  0.2   1:39.93 screen

7364 acanis    16   0  3120 1080 1796 R  0.7  0.1   0:00.24 top

212 root      15   0     0    0    0 S  0.3  0.0   1:21.51 kjournald

2898 root      15   0 15428 8996 8656 S  0.3  0.4   5:22.36 X

5310 ttcbatch  16   0  7956 3892 6188 S  0.3  0.2   0:08.71 condor_collecto

1 root      16   0  3028  564 1408 S  0.0  0.0   0:01.42 init

2 root      RT   0     0    0    0 S  0.0  0.0   0:00.15 migration/0

The following daemons are running on the central machine:

ttcbatch  5309     1  0 14:04 ?        00:00:00 /opt/condor/sbin/condor_master

ttcbatch  5310  5309  0 14:04 ?        00:00:00 condor_collector -f

ttcbatch  5311  5309  1 14:04 ?        00:00:00 condor_negotiator -f

ttcbatch  5312  5309 99 14:04 ?        00:00:52 condor_schedd -f

ttcbatch  5313  5309  0 14:04 ?        00:00:00 condor_quill -f

Output from condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@TTC-ACANI WINNT51     INTEL  Unclaimed  Idle       0.000   634  0+00:02:50

vm2@TTC-ACANI WINNT51     INTEL  Unclaimed  Idle       0.000   634  0+00:11:43

vm3@TTC-ACANI WINNT51     INTEL  Owner      Idle       0.000   388  0+06:21:15

vm4@TTC-ACANI WINNT51     INTEL  Owner      Idle       0.100   388  0+06:21:16

vm1@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+00:03:44

vm2@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+00:03:28

vm3@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+00:03:35

vm4@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+00:03:42

vm5@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+03:20:38

vm6@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.010   255  0+03:20:33

vm7@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+03:20:45

vm8@TTC-BS866 WINNT51     INTEL  Unclaimed  Idle       0.000   255  0+03:20:44

The system has no clusters in it. Checking the queue with condor_q outputs nothing.

The cluster run is contained in the attached tarball:

[acanis@ttc-abcdev cluster_0]$ condor_submit condor_ticket

Submitting job(s) <snip>

Logging submit event(s) <snip>

1500 job(s) submitted to cluster 1198.

Now try releasing the cluster. You will get the following error:

        [acanis@ttc-abcdev cluster_0]$ condor_release 1198

        Couldn't find/release all jobs in cluster 1198.

Looking at the queue with condor_q youll see all the processes are still held:

        <snip>

1198.1497 acanis          7/6  14:36   0+00:00:00 H  0   253.9 wrapper.bat no_swe

1198.1498 acanis          7/6  14:36   0+00:00:00 H  0   253.9 wrapper.bat no_swe

1198.1499 acanis          7/6  14:36   0+00:00:00 H  0   253.9 wrapper.bat no_swe

1500 jobs; 0 idle, 0 running, 1500 held

Inspecting the Schedlog gives:

7/6 14:40:24 No HoldReasonCode found for job 1198.1399

7/6 14:40:24 No HoldReasonSubCode found for job 1198.1399

7/6 14:40:24 SCHEDD_ROUND_ATTR_JobStatus is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_ReleaseReason is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_EnteredCurrentStatus is undefined, using default value of 0

7/6 14:40:24 SCHEDD_ROUND_ATTR_LastHoldReason is undefined, using default value of 0

7/6 14:40:24 No HoldReasonCode found for job 1198.1499

7/6 14:40:24 No HoldReasonSubCode found for job 1198.1499

7/6 14:40:24 condor_write(): Socket closed when trying to write buffer, fd is 10

7/6 14:40:24 Buf::write(): condor_write() failed

7/6 14:40:24 actOnJobs: couldn't send results to client: aborting

7/6 14:40:24 JobsRunning = 0

7/6 14:40:24 JobsIdle = 0

7/6 14:40:24 JobsHeld = 3000

7/6 14:40:24 JobsRemoved = 0

7/6 14:40:24 LocalUniverseJobsRunning = 0

7/6 14:40:24 LocalUniverseJobsIdle = 0

7/6 14:40:24 SchedUniverseJobsRunning = 0

7/6 14:40:24 SchedUniverseJobsIdle = 0

7/6 14:40:24 N_Owners = 1

7/6 14:40:24 MaxJobsRunning = 10000

7/6 14:40:24 ENABLE_SOAP is undefined, using default value of False

7/6 14:40:24 Trying to update collector <137.57.176.244:9618>

7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>

7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False

7/6 14:40:24 Sent HEART BEAT ad to 1 collectors. Number of submittors=1

7/6 14:40:24 Changed attribute: RunningJobs = 0

7/6 14:40:24 Changed attribute: IdleJobs = 0

7/6 14:40:24 Changed attribute: HeldJobs = 3000

7/6 14:40:24 Changed attribute: FlockedJobs = 0

7/6 14:40:24 Changed attribute: Name = "Priority50@xxxxxxxxxx"

7/6 14:40:24 Sent ad to central manager for Priority50@xxxxxxxxxx

7/6 14:40:24 Trying to update collector <137.57.176.244:9618>

7/6 14:40:24 Attempting to send update via UDP to collector ttc-abcdev.altera.com <137.57.176.244:9618>

7/6 14:40:24 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False

7/6 14:40:24 Sent ad to 1 collectors for Priority50@xxxxxxxxxm

Releasing just one of the processes in the cluster works.

--

Ian R. Chesal <ichesal@xxxxxxxxxx>

Senior Software Engineer

Altera Corporation

Toronto Technology Center

Tel: (416) 926-8300

Attachment: job.tar.gz
Description: job.tar.gz