The subject tells it all: I can run the job from the command
line and it will go to completion (about 100 hrs), but when I submit it under
Condor, it starts and runs for 40 mins (while it is mostly reading in data).
Condor then gets a "SIGQUIT" and thinks it's done. I suspect it is running out of memory under Condor, but it works
from the command line because all the memory is available. I've tried reconfiguring
the VMs, RAM available, etc. We even pumped up one box to 10GB RAM, so that
each vm had 5 GB! No luck. The boxes are dual processor, 64bit AMDs, running RHEL 4 and
condor 6.6.10. with 4 GB. The job, however, was compiled on a 32 bit box,
since the compiler is currently only available in 32 bit. Is there some reason Condor won't let the executable
use the entire advertised space? Submitting 60 jobs from the command line every
few days isn't fun, and it keeps us from efficiently using the farm. Any ideas where else to troubleshoot? Thanks, Jim James A. Cox |