Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor: More jobs running than nodes available
- Date: Thu, 15 May 2008 11:56:38 +0200
- From: Arnau Bria <arnau@xxxxxxxxxxxxx>
- Subject: [Condor-users] condor: More jobs running than nodes available
Hi,
after doing a condor_q I get:
539 jobs; 20 idle, 190 running, 329 held
but doing a condor_status:
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 104 0 87 17 0 0 0
Total 104 0 87 17 0 0 0
Looking to shadow logs I see many errors like:
5/15 11:52:03 (239919.0) (15089): Job 239919.0 going into Hold state (code 13,2): Error from starter on vm4@host: STARTER failed to receive file(s) from <Master_IP:33088>; SHADOW at Master_IP failed to send file(s) to <host:42514>: error reading from /cafdata/cafIn/submit_mmp_long_260583_14146/stage/__job_in__.tgz: (errno 2) No such file or directory
Obviusly, that file does not exist
[cdfcaf@ log]$ ls -lsa /cafdata/cafIn/submit_mmp_long_260583_14146/
total 820
16 drwxrwxrwx 2 cdfcaf cdfcaf 16384 May 14 23:06 .
24 drwxrwxr-x 345 cdfcaf cdfcaf 24576 May 15 09:16 ..
0 -rw-r--r-- 1 cdfcaf cdfcaf 0 May 14 22:54 dprintf_failure.DAGMAN
4 -rw-r--r-- 1 cdfcaf cdfcaf 7 May 12 11:19 job.ClusterId
16 -rw-r--r-- 1 cdfcaf cdfcaf 14476 May 12 11:19 job.dag
4 -rw-r--r-- 1 cdfcaf cdfcaf 571 May 12 11:19 job.dagman.ClassAd
36 -rw-r--r-- 1 cdfcaf cdfcaf 36864 May 14 22:54 job.dagman.dagman.out
0 -rw-rw-r-- 1 cdfcaf cdfcaf 0 May 14 22:54 job.dagman.lib.out
0 -rw-r--r-- 1 cdfcaf cdfcaf 0 May 12 11:23 job.dagman.lock
4 -rw-r--r-- 1 cdfcaf cdfcaf 452 May 12 11:19 job.Descript
4 -rw-r--r-- 1 cdfcaf cdfcaf 17 May 12 11:09 job.email
452 -rw------- 1 cdfcaf cdfcaf 458302 May 15 11:53 job.log
20 -rw-rw-rw- 1 cdfcaf cdfcaf 20030 May 12 12:04 job.log.01.dmpi
4 -rw-r--r-- 1 cdfcaf cdfcaf 98 May 12 11:09 job.outurl
4 -rwxr--r-- 1 cdfcaf cdfcaf 101 May 12 11:09 mark_removed.sh
4 -rwxr--r-- 1 cdfcaf cdfcaf 19 May 12 11:09 return_OK.sh
220 -rw-rw-rw- 1 cdfcaf cdfcaf 217888 May 14 23:06 sections.ClassAd.zip
8 -rw-rw-rw- 1 cdfcaf cdfcaf 4424 May 14 23:06 sections.log.tgz
And jobs go to HOLD state.
So, I think I must resubmit this job, isn't it? Yesterday condor worked
fine, but disk got full, I have free some space and restarted condor,
and now nothing works...
What must be the correct procedure when disk is full? Why all jobs are
corrupted now?
TIA,
Arnau