Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Trouble with a schedd getting out-of-sync with reality
- Date: Mon, 31 Jan 2005 15:22:37 -0500
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: [Condor-users] Trouble with a schedd getting out-of-sync with reality
(This is problem relates to 6.7.3 running on Windows XP)
I'm having persistent trouble with one user's schedd daemon reporting
mis-information about the state of running jobs from this machine. The
user has a large number of jobs scheduled (3041 jobs; 2955 idle, 86
running, 0 held). A large number of the jobs have had their requirements
set to restrict their machines to a set of 7 in our pool. All seven of
these machines are dual-processor machines with 2 VMs running on each of
them.
That means no more than 14 jobs can be running at the same time.
However, when I look at the condor_q output for this machine it's
reporting that more than 14 of the restricted jobs are running
simultaneously.
What's odd is that first thing this morning, when all these jobs were
queued up, the output from condor_q was great. It seems to have drifted
over time. So that more and more jobs are not reporting that they've
finished when you view the condor_q output.
She has 4 clusters with the following requirements set on each job in
the cluster:
((VirtualMemory >= ImageSize) && (Memory =!= UNDEFINED) && (Arch ==
"INTEL" && (OpSys == "WINNT40" || OpSys == "WINNT50" || OpSys ==
"WINNT51")) && (AlteraIsDesktop =?= FALSE) && ((AlteraMachineClass ==
866)) && ((Machine == "TTC-BS866-008.altera.com" || Machine ==
"TTC-BS866-011.altera.com" || Machine == "TTC-BS866-012.altera.com" ||
Machine == "TTC-BS866-013.altera.com" || Machine ==
"TTC-BS866-014.altera.com" || Machine == "TTC-BS866-015.altera.com" ||
Machine == "TTC-BS866-016.altera.com"))) && (Disk >= DiskUsage) &&
(HasFileTransfer)
If I query those four clusters for their running jobs I get more than 14
jobs returned:
[0] > condor_q -name ttc-bchan2.altera.priv.altera.com -const
'JobStatus==2' 134 142 135 123
-- Schedd: TTC-BCHAN2.altera.priv.altera.com : <137.57.142.165:1045>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
134.67 bchan 1/28 18:13 0+03:49:53 R 20 573.4
wrapper.bat /exper
134.86 bchan 1/28 18:13 0+02:42:44 R 20 1207.8
wrapper.bat /exper
135.1 bchan 1/28 18:14 0+02:41:24 R 19 548.1
wrapper.bat /exper
135.17 bchan 1/28 18:14 0+01:51:46 R 19 241.6
wrapper.bat /exper
135.18 bchan 1/28 18:14 0+01:51:54 R 19 500.0
wrapper.bat /exper
135.21 bchan 1/28 18:14 0+01:32:34 R 19 500.0
wrapper.bat /exper
135.35 bchan 1/28 18:14 0+01:02:32 R 19 704.8
wrapper.bat /exper
135.55 bchan 1/28 18:14 0+00:18:48 R 19 500.0
wrapper.bat /exper
135.57 bchan 1/28 18:14 0+00:18:07 R 19 500.0
wrapper.bat /exper
135.59 bchan 1/28 18:14 0+00:35:10 R 19 489.1
wrapper.bat /exper
135.62 bchan 1/28 18:14 0+00:07:15 R 19 500.0
wrapper.bat /exper
135.66 bchan 1/28 18:14 0+00:01:08 R 19 500.0
wrapper.bat /exper
135.67 bchan 1/28 18:14 0+00:01:06 R 19 500.0
wrapper.bat /exper
135.69 bchan 1/28 18:14 0+00:00:00 R 19 500.0
wrapper.bat /exper
135.71 bchan 1/28 18:14 0+00:15:00 R 19 500.0
wrapper.bat /exper
142.0 bchan 1/31 11:41 0+00:16:57 R 20 900.0
wrapper.bat /exper
142.1 bchan 1/31 11:41 0+00:08:38 R 20 700.0
wrapper.bat /exper
Which is wrong. There are only 14 startd's available between those 7
machines:
[0] > condor_status -const
'Machine=="TTC-BS866-008.altera.com"||Machine=="TTC-BS866-011.altera.com
"||Machine=="TTC-BS866-012.altera.com"||Machine=="TTC-BS866-013.altera.c
om"||Machine=="TTC-BS866-014.altera.com"||Machine=="TTC-BS866-015.altera
.com"||Machine=="TTC-BS866-016.altera.com"'
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
vm1@TTC-BS866 WINNT51 INTEL Claimed Retiring 0.980
1023[?????]
vm2@TTC-BS866 WINNT51 INTEL Claimed Retiring 1.020
1023[?????]
vm1@TTC-BS866 WINNT51 INTEL Claimed Retiring 1.660
1023[?????]
vm2@TTC-BS866 WINNT51 INTEL Claimed Retiring 0.370
1023[?????]
vm1@TTC-BS866 WINNT51 INTEL Claimed Busy 0.830
1023[?????]
vm2@TTC-BS866 WINNT51 INTEL Claimed Busy 1.190 1023
0+00:13:42
vm1@TTC-BS866 WINNT51 INTEL Claimed Retiring 1.570 1023
0+00:14:33
vm2@TTC-BS866 WINNT51 INTEL Claimed Retiring 1.520 1023
0+00:17:17
vm1@TTC-BS866 WINNT51 INTEL Claimed Busy 1.060 1023
0+00:14:14
vm2@TTC-BS866 WINNT51 INTEL Claimed Busy 1.050
1023[?????]
vm1@TTC-BS866 WINNT51 INTEL Claimed Busy 1.010
1023[?????]
vm2@TTC-BS866 WINNT51 INTEL Claimed Retiring 1.040 1023
0+00:15:52
vm1@TTC-BS866 WINNT51 INTEL Claimed Busy 0.500 1023
0+00:12:07
vm2@TTC-BS866 WINNT51 INTEL Claimed Idle 0.240
1023[?????]
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/WINNT51 14 0 14 0 0 0
Total 14 0 14 0 0 0
How can this stale information be corrected? How did her schedd get into
such an inconsitent state? This has my users freaked out really.
- Ian