| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Jobs getting held for no obvious reason
- Date: Fri, 28 Nov 2008 10:59:19 -0000
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Jobs getting held for no obvious reason
Sorry only just had a chance to look at this again. 
Looks like the  "Permission denied" denied
error might have been correct. I stupidly looked just at the permissions on the
directory in question
as root and when I've now tried to create a file there >as the user< guess what
- it doesn't work
as he doesn't have read/execute permission further up. DOH !
Fixed that and I've kicked the jobs off again  with D_FULLDEBUG in place on the
schedd so 
we'll see what happens. 
The strange thing is that other users have clocked up tens of 1000 s of hours
without a problem.
-ian.
PS Yes they are vanilla jobs and SYSTEM_PERIODIC_RELEASE is set correctly.
Ian,
I'm afraid I don't have any ideas about what could be causing "Permission
denied" in the transfer of output files to the spool directory. If you hit a
dead end in tracking that down, it may be necessary to add more information to
the shadow debug log when it hits this problem.
Just to be clear: how were these jobs submitted? Are these vanilla jobs
submitted with condor_submit -s? Or SOAP?
When you look at the files in the job's spool directory, what ownership do you
see? While the job is running, I would expect the files to be owned by the user.
At other times, I would expect to see the files owned by condor.
The apparent failure of SYSTEM_PERIODIC_RELEASE is also mysterious. Things to
try:
1. Confirm that the schedd is using the setting you expect: condor_config_val
-schedd SYSTEM_PERIODIC_RELEASE
2. Add D_FULLDEBUG to SCHEDD_DEBUG and check for messages like this:
Evaluated periodic expressions in 1.3s, scheduling next run in 60s
--Dan
Smith, Ian wrote:
    Hi,
    I've noticed that a lot of jobs on our pool are being held for no obvious
    reason. It seems to happen to the longer running jobs ( > 1 day )
    but there's no apparent pattern. The hold reason is given as typically:
    HoldReason = "Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER
at
    138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
    138.253.100.27 failed to write to file
    /opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno
13)
    Permission denied"
    and the job log file shows:
            Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
    138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
    138.253.100.27 failed to write to file
    /opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno
13)
    Permission denied
            58428532  -  Run Bytes Sent By Job
            178835  -  Run Bytes Received By Job
    ...
    012 (9648.000.000) 11/07 17:08:37 Job was held.
            Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
    138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
    138.253.100.27 failed to write to file
    /opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno
13)
    Permission denied
            Code 12 Subcode 13
    But the directory in question is there and the permissions are OK. I'm
running
    the central manager/submit
    host on a Sun V440 with Solaris 10. Execute hosts are all Win XP SP2 and
    everything is Condor 7.0.2.
    As I workaround I placed  this is in the config file:
    #ICS workaround for "failed to write to file ... permission denied problem"
    #ICS release the job upto 10 times if on hold for over 10 minutes
    SYSTEM_PERIODIC_RELEASE = (JobRunCount < 10 && CurrentTime -
    EnteredCurrentStatus > 600) &&\
                              (HoldReasonCode == 12 || HoldReasonSubCode == 13)
    but as far as I can see the jobs aren't getting released automatically.
    Any help would be most appreciated -  this has me baffled.
    -ian.
    --------------------------------------------
    Dr Ian C. Smith,
    e-Science Team,
    The University Of Liverpool,
    Computing Services Department,
--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University Of Liverpool,
Computing Services Department,
Room 4.09,
Chadwick Tower.
tel: +44 (0)151 794 3545
int: 43745