Dear Condor experts:
On svr019 (centos 6.6, 2.6.32-504.16.2.el6.x86_64 ) we are running ARC-CE 5.0.0 with condor-8.2.2-265643.x86_64 as backend , sometimes some jobs could get hold due to 'permission denied' problem: svr019:/home/atlas/atlas003# condor_q -analyze 3155 svr019.gla.scotgrid.ac.uk : <130.209.239.19:56581> : svr019.gla.scotgrid.ac.uk --- 3155.000: Request is held. Hold reason: Error from slot1@node128: STARTER at 10.141.0.128 failed to send file(s) to <10.141.255.19:57731>; SHADOW at 10.141.255.19 failed to write to file /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/_condor_stderr.aipanda063.cern.ch_15422080.0_1433368150: (errno 13) Permission denied This only happens to several jobs among hundreds, seems not to be a general security issue. In the log I can see: 012 (3155.000.000) 06/04 02:55:16 Job was held. Error from slot1@node128: STARTER at 10.141.0.128 failed to send file(s) to <10.141.255.19:57731>; SHADOW at 10.141.255.19 failed to write to file /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/_condor_stderr.aipanda063.cern.ch_15422080.0_1433368150: (errno 13) Permission denied Code 12 Subcode 13 The jdl for this job prepared by ARC is: # HTCondor job description built by grid-manager Executable = condorjob.sh Input = /dev/null Log = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/log Output = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.comment Error = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.comment +NordugridQueue = condor_q2d Description = arc_pilot GetEnv = True Universe = vanilla Notification = Never Requirements = (OpSys == "LINUX") Priority = 0 x509userproxy = /var/spool/arc/jobstatus/job.dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon.proxy request_cpus = 1 +JobTimeLimit = 172800 request_memory=4000 +JobMemoryLimit = 8192000 should_transfer_files = YES When_to_transfer_output = ON_EXIT_OR_EVICT Transfer_input_files = /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/.gahp_complete, /var/spool/arc/grid/dfgMDmjkVKmnbbfC3pqhhxZmABFKDmABFKDmZnFKDmABFKDm7g3Yon/runpilot3-wrapper.sh Periodic_remove = FALSE || RemoteWallClockTime > JobTimeLimit || ResidentSetSize > JobMemoryLimit Queue Any idea where the problem might be? Cheers,Gang |