| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Jobs immediately evicted with code 129
- Date: Wed, 12 Jan 2005 03:19:26 -0800
- From: Tim Robertson <timr@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Jobs immediately evicted with code 129
Hi,
I'm trying to test a newly-configured condor pool (condor version  
6.6.7, all machines use Fedora Core 1) using a few binaries in standard  
universe.  When I submit jobs, however, only the submitting machine can  
execute -- all other jobs are matched to idle nodes, begin to execute,  
and are immediately vacated from the nodes.
When I examine the logs of these machines, I always see the following  
lines in the StarterLog file:
> Process XXXXX exited with status 129
> EXEC of user process failed, probably with insufficient swap
They always occur within 1 second of the the exceve call.
I found this thread in the mailing list archives, dealing with a  
similar problem:
http://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/ 
msg00253.shtml
But (wouldn't you know it), the thread goes dead before any useful  
information is given about the problem.  Sigh.
What could be going on here?  It isn't related to the binaries, as far  
as I can tell (I can log into the nodes and run the programs without  
condor), so I'm at a loss.
Thanks in advance,
Tim
PS: If it helps anyone, I've copied the result of running condor_status  
-l on one of the nodes below.
------------------------------------ condor_status -l :
MyType = "Machine"
TargetType = "Job"
Name = "baloo1.bagley069.varanilab"
Machine = "baloo1.bagley069.varanilab"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "agni.bagley069.varanilab"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 1831400
Disk = 33307820
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 2899
ConsoleIdle = 41895557
Memory = 945
Cpus = 1
StartdIpAddr = "<xxxxxxx:32798>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "localdomain"
FileSystemDomain = "localdomain"
Subnet = "10.0.1"
HasIOProxy = TRUE
TotalVirtualMemory = 1831400
TotalDisk = 33307820
KFlops = 723715
Mips = 2587
LastBenchmark = 1105527071
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 200
ClockDay = 3
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =  
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRem 
oteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1105528090
Activity = "Idle"
EnteredCurrentActivity = 1105528090
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=  
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1104733528
UpdateSequenceNumber = 2657
MyAddress = "<xxxxxxxx:32798>"
LastHeardFrom = 1105531215
UpdatesTotal = 2237
UpdatesSequenced = 2233
UpdatesLost = 36
UpdatesHistory = "0x00000000008808000000000100000000"