Carl, I just checked. No CORE files of any sort on any of the
affected machines. -Bryan From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On
Behalf Of carl langlois Hi Bryan, On Fri, Mar 28, 2008 at 11:02 AM, Bryan S. Maher <Bryan.Maher@xxxxxxxxxx> wrote: Hi All: I have a new Condor pool
uniformly running v7.0.1 on Windows. After a day or two the slot1
resources fail to show up when issuing a condor_status command. Here is
sample output: Name
OpSys Arch
State Activity LoadAv Mem ActvtyTime slot1@xxxxxxxxxxx. WINNT51 INTEL
Owner Idle 0.030
1023 0+04:32:59 slot2@xxxxxxxxxxx. WINNT51 INTEL
Owner Idle 0.000
1023 0+04:33:00 slot2@xxxxxxxxxxxx WINNT51 INTEL
Owner Idle 0.000
1534 0+04:35:05 slot2@xxxxxxxxxxxx WINNT52 INTEL
Unclaimed Idle 0.000 1006 5+14:26:38 slot2@xxxxxxxxxxxx WINNT52 INTEL
Unclaimed Idle 0.000 1006 0+02:25:07 slot2@xxxxxxxxxxxx WINNT52 INTEL
Unclaimed Idle 0.000 1006 0+02:25:05 slot2@xxxxxxxxxxxx WINNT52 INTEL
Unclaimed Idle 0.000 1006 0+02:25:05 slot2@xxxxxxxxxxxx WINNT52 INTEL
Unclaimed Idle 0.000 1006 0+02:25:07
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/WINNT51 3
3
0
0 0
0 0
INTEL/WINNT52 5
0
0
5 0
0 0
Total 8
3
0
5
0 0
0 As you can see, even the totals
fail to count the slot1 resources. A condor_reconfig is sufficient to
bring slot1 back to life. The StartLog on an affected machine looks
like: 3/17 12:03:18 ****************************************************** 3/17 12:03:18 ** condor_startd.exe
(CONDOR_STARTD) STARTING UP 3/17 12:03:18 **
C:\condor\bin\condor_startd.exe 3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27
2008 BuildID: 76180 $ 3/17 12:03:18 ** $CondorPlatform:
INTEL-WINNT50 $ 3/17 12:03:18 ** PID = 1880 3/17 12:03:18 ** Log last touched 3/17
11:01:32 3/17 12:03:18
****************************************************** 3/17 12:03:18 Using config source:
C:\condor\condor_config 3/17 12:03:18 Using local config sources: 3/17 12:03:18
C:\condor\condor_config.local 3/17 12:03:18 DaemonCore: Command Socket at
<x.x.x.x:1071> 3/17 12:03:18 MachAttributes::publish: failed
to get Windows version information 3/17 12:03:24 slot1: New machine resource
allocated 3/17 12:03:24 slot2: New machine resource
allocated 3/17 12:03:29 About to run initial
benchmarks. 3/17 12:03:33 Completed initial benchmarks. . . slot2 continues to run
benchmarks, slot1 never runs benchmarks … . 3/17 12:03:33 slot2: State change: IS_OWNER
is false 3/17 12:03:33 slot2: Changing state: Owner
-> Unclaimed 3/17 12:03:33 slot1: State change: IS_OWNER
is false 3/17 12:03:33 slot1: Changing state: Owner
-> Unclaimed 3/17 16:03:33 State change: RunBenchmarks is
TRUE 3/17 16:03:33 slot2: Changing activity: Idle
-> Benchmarking 3/17 16:03:36 State change: benchmarks
completed 3/17 16:03:36 slot2: Changing activity:
Benchmarking -> Idle 3/17 20:03:36 State change: RunBenchmarks is
TRUE 3/17 20:03:36 slot2: Changing activity: Idle
-> Benchmarking 3/17 20:03:39 State change: benchmarks
completed . . reconfig sent, slot1
begins to run benchmarks in lieu of slot2 . slot1 is reappears in
condor_status for a while … . 3/22 21:50:06 Got SIGHUP. Re-reading
config files. 3/23 00:10:06 State change: RunBenchmarks is
TRUE 3/23 00:10:06 slot1: Changing activity: Idle
-> Benchmarking 3/23 00:10:10 State change: benchmarks
completed 3/23 00:10:10 slot1: Changing activity:
Benchmarking -> Idle 3/23 04:10:10 State change: RunBenchmarks is
TRUE 3/23 04:10:10 slot1: Changing activity: Idle
-> Benchmarking 3/23 04:10:14 State change: benchmarks
completed 3/23 04:10:14 slot1: Changing activity:
Benchmarking -> Idle . . slot1 benchmarks
continue but slot1 is no longer visible in condor_status … . 3/28 04:12:18 slot1: Changing activity:
Benchmarking -> Idle 3/28 08:12:19 State change: RunBenchmarks is
TRUE 3/28 08:12:19 slot1: Changing activity: Idle
-> Benchmarking 3/28 08:12:22 State change: benchmarks
completed 3/28 08:12:22 slot1: Changing activity:
Benchmarking -> Idle <end> Any ideas? -Bryan
|