Hi all, we are currently wondering about an odd behaviour of schedds on ARC grid CEs job shadows seem to not able to update their local status anymore with a symptome, that they cannot write to the ShadowLog anymore [1]. Such a shadow process tries to open the ShadowLog but seem to fail with a 'Operation not permitted' [2]. However, all file system operations work for us when trying them manually; the VM admins insist, that there are no problems with the underlying HVs; and also there are no CPU steals are significant I/O waits. So it is not clear, why the Shadows really fail. The job itself seem to run happy on its node [3]. The odd thing is, that the Shadow does not know about is proper sub-slot but just knows about the partitionable parent slot [4]. Also the remote walltime etc. are 0.0 - which does not look healthy [5] as on the job has been running for ~3h on the node already. Best guess is at the moment, that the Shadows cannot update their states on the schedds. As the schedd machines themselves look good (Condor versions [6]), it is currently a mystery for us, what upsets the shadows?? The schedd VMs are on different HVs and have SSD backends, so we have two different machines showing the same behaviour?? Maybe somebody has seen something before and has an idea for us? Cheers and thanks, Thomas btw: attached are the local starter log excerpt for the job pid and the jobs local class ads files. [1] 02/04/20 16:20:27 (22252523.0) (1701263): WriteUserLog checking for event log rotation, but no lock 02/04/20 16:20:27 (22252523.0) (1701263): WriteUserLog checking for event log rotation, but no lock [2] root@grid-arcce0: [/var/log/condor] strace -p 1701263 strace: Process 1701263 attached select(21, [1 6 20], [], [], {3, 241992}) = 0 (Timeout) clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 974658071}) = 0 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV PROF RTMIN RT_1], NULL, 8) = 0 clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 974887253}) = 0 read(1, 0x7ffe3a5d1b60, 8) = -1 EAGAIN (Resource temporarily unavailable) clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975123273}) = 0 chmod("/var/log/condor/ShadowLog", 0644) = -1 EPERM (Operation not permitted) clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975352330}) = 0 clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975439287}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 select(21, [1 6 20], [], [], {2, 0}) = 0 (Timeout) clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 981660291}) = 0 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV PROF RTMIN RT_1], NULL, 8) = 0 clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 981850254}) = 0 read(1, 0x7ffe3a5d1b60, 8) = -1 EAGAIN (Resource temporarily unavailable) clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 982017237}) = 0 clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 982136875}) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 select(21, [1 6 20], [], [], {13, 0}^Cstrace: Process 1701263 detached <detached ...> [3] [root@batch0819 ~]# condor_who | grep 22251616.0 belleprd004@xxxxxxx grid-arcce0.desy.de 1_20 22251616.0 0+03:02:45 7203 /var/lib/condor/execute/dir_6929/condor_exec.exe [4] > condor_q -l 22251616.0 | grep -i host CurrentHosts = 1 MaxHosts = 1 MinHosts = 1 OrigMaxHosts = 1 RemoteHost = "slot1@xxxxxxxxxxxxxxxxx" > condor_q -l 22251616.0 | grep -i remote CumulativeRemoteSysCpu = 287.0 CumulativeRemoteUserCpu = 9136.0 PeriodicRemove = false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit RemoteHost = "slot1@xxxxxxxxxxxxxxxxx" RemoteSlotID = 1 RemoteSysCpu = 287.0 RemoteUserCpu = 9136.0 RemoteWallClockTime = 0.0 WantRemoteIO = true WantRemoteSyscalls = false [5] condor_q -l 22251616.0 | grep -i time AutoClusterAttrs = "_cp_orig_RequestCpus,_cp_orig_RequestDisk,_cp_orig_RequestMemory,JobUniverse,LastCheckpointPlatform,MachineLastMatchTime,NumCkpts,RequestCpus,RequestDisk,RequestMemory,ConcurrencyLimits,FlockTo,NiceUser,Rank,Requirements,NODE_IS_HEALTHY,DESYAcctGroup,DESYAcctSubGroup,DiskUsage,Owner,x509UserProxyVOName" CommittedSlotTime = 0 CommittedSuspensionTime = 0 CommittedTime = 0 CumulativeSlotTime = 0 CumulativeSuspensionTime = 0 JobTimeLimit = 172800 LastMatchTime = 1580819160 LastRejMatchTime = 1580819160 LastSuspensionTime = 0 PeriodicRemove = false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit RecentStatsLifetimeStarter = 1200 RemoteWallClockTime = 0.0 ServerTime = 1580830525 StatsLifetimeStarter = 10528 [6] condor-external-libs-8.9.5-1.el7.x86_64 condor-8.9.5-1.el7.x86_64 python2-condor-8.9.5-1.el7.x86_64 condor-classads-8.9.5-1.el7.x86_64 condor-procd-8.9.5-1.el7.x86_64
Attachment:
6929.tar.xz
Description: application/xz
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature