Hi All:
I have a new Condor pool uniformly running v7.0.1 on Windows. After a day or two the slot1 resources fail to show up when issuing a condor_status command. Here is sample output:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxx. WINNT51 INTEL Owner Idle 0.030 1023 0+04:32:59
slot2@xxxxxxxxxxx. WINNT51 INTEL Owner Idle 0.000 1023 0+04:33:00
slot2@xxxxxxxxxxxx WINNT51 INTEL Owner Idle 0.000 1534 0+04:35:05
slot2@xxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 1006 5+14:26:38
slot2@xxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:07
slot2@xxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:05
slot2@xxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:05
slot2@xxxxxxxxxxxx WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:07
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/WINNT51 3 3 0 0 0 0 0
INTEL/WINNT52 5 0 0 5 0 0 0
Total 8 3 0 5 0 0 0
As you can see, even the totals fail to count the slot1 resources. A condor_reconfig is sufficient to bring slot1 back to life. The StartLog on an affected machine looks like:
3/17 12:03:18 ******************************************************
3/17 12:03:18 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
3/17 12:03:18 ** C:\condor\bin\condor_startd.exe
3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27 2008 BuildID: 76180 $
3/17 12:03:18 ** $CondorPlatform: INTEL-WINNT50 $
3/17 12:03:18 ** PID = 1880
3/17 12:03:18 ** Log last touched 3/17 11:01:32
3/17 12:03:18 ******************************************************
3/17 12:03:18 Using config source: C:\condor\condor_config
3/17 12:03:18 Using local config sources:
3/17 12:03:18 C:\condor\condor_config.local
3/17 12:03:18 DaemonCore: Command Socket at <x.x.x.x:1071>
3/17 12:03:18 MachAttributes::publish: failed to get Windows version information
3/17 12:03:24 slot1: New machine resource allocated
3/17 12:03:24 slot2: New machine resource allocated
3/17 12:03:29 About to run initial benchmarks.
3/17 12:03:33 Completed initial benchmarks.
.
. slot2 continues to run benchmarks, slot1 never runs benchmarks …
.
3/17 12:03:33 slot2: State change: IS_OWNER is false
3/17 12:03:33 slot2: Changing state: Owner -> Unclaimed
3/17 12:03:33 slot1: State change: IS_OWNER is false
3/17 12:03:33 slot1: Changing state: Owner -> Unclaimed
3/17 16:03:33 State change: RunBenchmarks is TRUE
3/17 16:03:33 slot2: Changing activity: Idle -> Benchmarking
3/17 16:03:36 State change: benchmarks completed
3/17 16:03:36 slot2: Changing activity: Benchmarking -> Idle
3/17 20:03:36 State change: RunBenchmarks is TRUE
3/17 20:03:36 slot2: Changing activity: Idle -> Benchmarking
3/17 20:03:39 State change: benchmarks completed
.
. reconfig sent, slot1 begins to run benchmarks in lieu of slot2
. slot1 is reappears in condor_status for a while …
.
3/22 21:50:06 Got SIGHUP. Re-reading config files.
3/23 00:10:06 State change: RunBenchmarks is TRUE
3/23 00:10:06 slot1: Changing activity: Idle -> Benchmarking
3/23 00:10:10 State change: benchmarks completed
3/23 00:10:10 slot1: Changing activity: Benchmarking -> Idle
3/23 04:10:10 State change: RunBenchmarks is TRUE
3/23 04:10:10 slot1: Changing activity: Idle -> Benchmarking
3/23 04:10:14 State change: benchmarks completed
3/23 04:10:14 slot1: Changing activity: Benchmarking -> Idle
.
. slot1 benchmarks continue but slot1 is no longer visible in condor_status …
.
3/28 04:12:18 slot1: Changing activity: Benchmarking -> Idle
3/28 08:12:19 State change: RunBenchmarks is TRUE
3/28 08:12:19 slot1: Changing activity: Idle -> Benchmarking
3/28 08:12:22 State change: benchmarks completed
3/28 08:12:22 slot1: Changing activity: Benchmarking -> Idle
<end>
Any ideas?
-Bryan
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/