Hi,
I'm trying to test a newly-configured condor pool (condor version
6.6.7, all machines use Fedora Core 1) using a few binaries in
standard universe. When I submit jobs, however, only the submitting
machine can execute -- all other jobs are matched to idle nodes,
begin to execute, and are immediately vacated from the nodes.
When I examine the logs of these machines, I always see the
following lines in the StarterLog file:
> Process XXXXX exited with status 129
> EXEC of user process failed, probably with insufficient swap
They always occur within 1 second of the the exceve call.
I found this thread in the mailing list archives, dealing with a
similar problem:
http://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/
msg00253.shtml
But (wouldn't you know it), the thread goes dead before any useful
information is given about the problem. Sigh.
What could be going on here? It isn't related to the binaries, as
far as I can tell (I can log into the nodes and run the programs
without condor), so I'm at a loss.
Thanks in advance,
Tim
PS: If it helps anyone, I've copied the result of running
condor_status -l on one of the nodes below.
------------------------------------ condor_status -l :
MyType = "Machine"
TargetType = "Job"
Name = "baloo1.bagley069.varanilab"
Machine = "baloo1.bagley069.varanilab"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "agni.bagley069.varanilab"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 1831400
Disk = 33307820
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 2899
ConsoleIdle = 41895557
Memory = 945
Cpus = 1
StartdIpAddr = "<xxxxxxx:32798>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "localdomain"
FileSystemDomain = "localdomain"
Subnet = "10.0.1"
HasIOProxy = TRUE
TotalVirtualMemory = 1831400
TotalDisk = 33307820
KFlops = 723715
Mips = 2587
LastBenchmark = 1105527071
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 200
ClockDay = 3
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRem
oteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1105528090
Activity = "Idle"
EnteredCurrentActivity = 1105528090
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1104733528
UpdateSequenceNumber = 2657
MyAddress = "<xxxxxxxx:32798>"
LastHeardFrom = 1105531215
UpdatesTotal = 2237
UpdatesSequenced = 2233
UpdatesLost = 36
UpdatesHistory = "0x00000000008808000000000100000000"
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users