I saw on the posts from March that Ian Chesal had this same
problem. Does anyone know if it was resolved off the user list? I
had every machine matched but then would go back to the unclaimed state.
Then I changed the match timeout from the default of 300 to 600, and suddenly
they all became claimed, but idle. I’m still at this current
problem. Only a select few ever go to busy and run jobs, though they are
usually the faster CPU machines, and usually when there are fewer machines in the
pool (if there are only 8 vm’s in the pool, they almost always just run
with no problem, does this mean the submitter is being overloaded?). In condor_q, it SAYS as many jobs are running as are there
are machines claimed, but doing a condor_status –verbose on any idle
machine shows something like: M:\>condor_status -l bsanchez MyType = "Machine" TargetType = "Job" Name = "vm2@xxxxxxxxxxxxxxxxx" Machine = "BSANCHEZ.BHI.CORP" Rank = 0.000000 CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) COLLECTOR_HOST_STRING = "a-abq-lic.bhi.corp" CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005
$" CondorPlatform = "$CondorPlatform: INTEL-WINNT40
$" VirtualMachineID = 2 ImageSize = 1 ExecutableSize = 1 JobUniverse = 5 NiceUser = FALSE VirtualMemory = 1186088 Disk = 22151548 CondorLoadAvg = 0.000000 LoadAvg = 0.000000 KeyboardIdle = 40204 ConsoleIdle = 40204 Memory = 511 Cpus = 1 StartdIpAddr = "<192.168.100.190:2394>" Arch = "INTEL" OpSys = "WINNT51" UidDomain = "bhi.corp" FileSystemDomain = "bhi.corp" Subnet = "192.168.100" HasIOProxy = TRUE TotalVirtualMemory = 2372176 TotalDisk = 44303096 KFlops = 879778 Mips = 2804 LastBenchmark = 1111828400 TotalLoadAvg = 0.000000 TotalCondorLoadAvg = 0.000000 ClockMin = 803 ClockDay = 6 TotalVirtualMachines = 2 HasFileTransfer = TRUE HasMPI = TRUE HasJICLocalConfig = TRUE HasJICLocalStdin = TRUE StarterAbilityList =
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin" CpuBusyTime = 0 CpuIsBusy = FALSE State = "Claimed" EnteredCurrentState = 1111829916 Activity = "Idle" EnteredCurrentActivity = 1111866159 Start = KeyboardIdle > 5 * 60 Requirements = START CurrentRank = 0.000000 RemoteUser = "jnipper@xxxxxxxx" RemoteOwner = "jnipper@xxxxxxxx" ClientMachine = "cy2-conferece" DaemonStartTime = 1111828391 UpdateSequenceNumber = 149 MyAddress = "<192.168.100.190:2394>" LastHeardFrom = 1111868605 UpdatesTotal = 148 UpdatesSequenced = 148 UpdatesLost = 6 UpdatesHistory =
"0x005000a0006000000000000000000000" I’m running everything on Windows XP, a mix of SP1 and
SP2. I changed the condor_config on the submitting machine so it could
run up to 2000 jobs, I changed the value in the registry to 1280 as it
suggested on the “Windows specific issues” in the manual, and the
submitter has a gigabit Ethernet card so it never goes over about 10%.
Worker nodes are 100Base-T and the jobs are only about 50 MB, and nobody is on
these at night, so I don’t think network bandwidth is not a
problem. A fetchlog on the above machine for STARD will look like this
typically… 3/26 12:08:05 DaemonCore: Command received via UDP
from host <192.168.100.190:3600> 3/26 12:08:05 DaemonCore: received command 60001
(DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 3/26 12:08:05 Starter pid 1020 exited with status 0 3/26 12:08:05 vm1: State change: starter exited 3/26 12:08:05 vm1: Changing activity: Busy ->
Idle 3/26 12:38:12 DaemonCore: Command received via TCP
from host <192.168.101.116:4695> 3/26 12:38:12 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 3/26 12:38:12 vm2: Got activate_claim request from
shadow (<192.168.101.116:4695>) 3/26 12:38:12 vm2: Remote job ID is 5772.0 3/26 12:38:12 vm2: Got universe "VANILLA"
(5) from request classad 3/26 12:38:12 vm2: State change: claim-activation
protocol successful 3/26 12:38:12 vm2: Changing activity: Idle ->
Busy 3/26 12:42:39 DaemonCore: Command received via TCP
from host <192.168.101.116:4893> 3/26 12:42:39 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 3/26 12:42:39 vm2: Called
deactivate_claim_forcibly() 3/26 12:42:39 DaemonCore: Command received via UDP
from host <192.168.100.190:3675> 3/26 12:42:39 DaemonCore: received command 60001
(DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 3/26 12:42:39 Starter pid 4052 exited with status 0 3/26 12:42:39 vm2: State change: starter exited 3/26 12:42:39 vm2: Changing activity: Busy ->
Idle 3/26 12:55:18 DaemonCore: Command received via TCP
from host <192.168.101.116:1543> 3/26 12:55:18 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim) 3/26 12:55:18 vm1: Got activate_claim request from
shadow (<192.168.101.116:1543>) 3/26 12:55:18 vm1: Remote job ID is 5807.0 3/26 12:55:18 vm1: Got universe "VANILLA"
(5) from request classad 3/26 12:55:18 vm1: State change: claim-activation
protocol successful 3/26 12:55:18 vm1: Changing activity: Idle ->
Busy 3/26 12:59:17 DaemonCore: Command received via TCP
from host <192.168.101.116:1608> 3/26 12:59:17 DaemonCore: received command 404
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 3/26 12:59:17 vm1: Called
deactivate_claim_forcibly() 3/26 12:59:17 DaemonCore: Command received via UDP
from host <192.168.100.190:3725> 3/26 12:59:17 DaemonCore: received command 60001
(DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) 3/26 12:59:17 Starter pid 3164 exited with status 0 3/26 12:59:17 vm1: State change: starter exited 3/26 12:59:17 vm1: Changing activity: Busy ->
Idle And a fetchlog on the submitting machine for SCHEDD will
look like: 3/26 13:35:45 Started shadow for job 5870.0 on
"<192.168.100.190:2394>", (shadow pid = 1276) 3/26 13:35:45 DaemonCore: Command received via UDP
from host <192.168.101.116:3154> 3/26 13:35:45 DaemonCore: received command 60001
(DC_PROCESSEXIT), calling handler (HandleProcessExitCommand()) The machine 192.168.100.190 is bsanchez, so I just included
the relevant part of the log for that machine. I don’t know if
bsanchez isn’t waiting long enough to start the job somehow, or how to
change that setting, or if the submitter, cy2-conf, is timing out before
sending it out or what, but it seems to be a timing/load issue, since if there
are only 2 machines in the pool with 4 processors each, they usually run fine. Zachary L. Stauber Systems
Analyst Spatial Data Bohannan▲Huston Courtyard One, 7500 Albuquerque, New Mexico 87109 Office: 505-823-1000 Direct: 505-798-7970 Fax: 505-798-7932 Email: zstauber@xxxxxxxxx |