Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor: More jobs running than nodes available

Date: Thu, 15 May 2008 11:56:38 +0200
From: Arnau Bria <arnau@xxxxxxxxxxxxx>
Subject: [Condor-users] condor: More jobs running than nodes available

Hi,

after doing a condor_q I get:


539 jobs; 20 idle, 190 running, 329 held

but doing a condor_status:

  Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX   104     0      87        17       0          0        0

               Total   104     0      87        17       0          0        0


Looking to shadow logs I see many errors like:


5/15 11:52:03 (239919.0) (15089): Job 239919.0 going into Hold state (code 13,2): Error from starter on vm4@host: STARTER failed to receive file(s) from <Master_IP:33088>; SHADOW at Master_IP failed to send file(s) to <host:42514>: error reading from /cafdata/cafIn/submit_mmp_long_260583_14146/stage/__job_in__.tgz: (errno 2) No such file or directory

Obviusly, that file does not exist

[cdfcaf@ log]$ ls -lsa /cafdata/cafIn/submit_mmp_long_260583_14146/
total 820
  16 drwxrwxrwx    2 cdfcaf   cdfcaf      16384 May 14 23:06 .
  24 drwxrwxr-x  345 cdfcaf   cdfcaf      24576 May 15 09:16 ..
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 May 14 22:54 dprintf_failure.DAGMAN
   4 -rw-r--r--    1 cdfcaf   cdfcaf          7 May 12 11:19 job.ClusterId
  16 -rw-r--r--    1 cdfcaf   cdfcaf      14476 May 12 11:19 job.dag
   4 -rw-r--r--    1 cdfcaf   cdfcaf        571 May 12 11:19 job.dagman.ClassAd
  36 -rw-r--r--    1 cdfcaf   cdfcaf      36864 May 14 22:54 job.dagman.dagman.out
   0 -rw-rw-r--    1 cdfcaf   cdfcaf          0 May 14 22:54 job.dagman.lib.out
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 May 12 11:23 job.dagman.lock
   4 -rw-r--r--    1 cdfcaf   cdfcaf        452 May 12 11:19 job.Descript
   4 -rw-r--r--    1 cdfcaf   cdfcaf         17 May 12 11:09 job.email
 452 -rw-------    1 cdfcaf   cdfcaf     458302 May 15 11:53 job.log
  20 -rw-rw-rw-    1 cdfcaf   cdfcaf      20030 May 12 12:04 job.log.01.dmpi
   4 -rw-r--r--    1 cdfcaf   cdfcaf         98 May 12 11:09 job.outurl
   4 -rwxr--r--    1 cdfcaf   cdfcaf        101 May 12 11:09 mark_removed.sh
   4 -rwxr--r--    1 cdfcaf   cdfcaf         19 May 12 11:09 return_OK.sh
 220 -rw-rw-rw-    1 cdfcaf   cdfcaf     217888 May 14 23:06 sections.ClassAd.zip
   8 -rw-rw-rw-    1 cdfcaf   cdfcaf       4424 May 14 23:06 sections.log.tgz


And jobs go to HOLD state.


So, I think I must resubmit this job, isn't it? Yesterday condor worked
fine, but disk got full, I have free some space and restarted condor,
and now nothing works...

What must be the correct procedure when disk is full? Why all jobs are
corrupted now?


TIA,
Arnau

Prev by Date: Re: [Condor-users] GAHP and proxy
Next by Date: Re: [Condor-users] Problems submitting jobs from windows to Linux and vice versa
Previous by thread: Re: [Condor-users] GAHP and proxy
Next by thread: [Condor-users] [newbie question: how can i run a job on machine that have a owner state]
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor: More jobs running than nodes available