Hi again,maybe related(??) - I just noticed, that a restart of the condor unit caused the Schedd to loose all its jobs [1]. Since the restart was more or less instantaneous, I would have expected the Sched to pick up its jobs.
Cheers, Thomas [1]03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deleted with proc ads still attached to it. This should only happen during schedd shutdown.
[2]Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopping Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has begun shutting down.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopped Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has finished shutting down.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Starting Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has begun starting up.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Started Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel On 05/03/2021 10.26, Thomas Hartmann wrote:
Hi Brian,yes, we have periodic removes [1]. But 'in principle' these should mostly only work on longer time scales ~O(days) - except for the JobRunCount hedge. Idea behind the `JobRunCount > 1` is to avboid automatic reruns of jobs as to avoid clashes with the VO factories, if these would resend the jobs again on a failure and would then result in two job instances.But the problem with the missing out/err affected also CLUSTERID.0 jobs, that should be the initial iteration and not fall under `JobRunCount > 1`, or?Cheers,  Thomas [1] > grep -v "#" /etc/condor/config.d/90_21_condor_cleanup.confRemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 60 * 60 * 24 * 2) )RemoveMultipleRunJobs = ( JobRunCount > 1 ) RemoveDefaultJobWallTime = ( RemoteWallClockTime > 4 * 24 * 60 * 60 ) RemoveAllJobsOlderThan2Weeks = (( CurrentTime - QDate > 60 * 60 * 24 * 14)) SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs) || \  $(RemoveMultipleRunJobs) || \  $(RemoveDefaultJobWallTime) || \  $(RemoveAllJobsOlderThan2Weeks) On 04/03/2021 17.55, Brian Lin wrote:Hi Thomas,Jaime reminded me of another common cause of this issue: that the routed job is removed from under the CE so when the CE tries to transfer files back out to the submitter, it can't find the files it needs. Do you have any periodic removes in your local HTCondor config?Thanks, Brian On 3/4/21 10:07 AM, Thomas Hartmann wrote:Hi Brian unfortunately, I have not found a smoking gun yet :-/ The CE is currently on [1].selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious [2]. The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.Cheers,  Thomas [1] condor-8.9.11-1.el7.x86_64 condor-boinc-7.16.11-1.el7.x86_64 condor-classads-8.9.11-1.el7.x86_64 condor-externals-8.9.11-1.el7.x86_64 condor-procd-8.9.11-1.el7.x86_64 htcondor-ce-4.4.1-3.el7.noarch htcondor-ce-apel-4.4.1-3.el7.noarch htcondor-ce-bdii-4.4.1-3.el7.noarch htcondor-ce-client-4.4.1-3.el7.noarch htcondor-ce-condor-4.4.1-3.el7.noarch htcondor-ce-view-4.4.1-3.el7.noarch python2-condor-8.9.11-1.el7.x86_64 python3-condor-8.9.11-1.el7.x86_64 CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64 [2]root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0total 80 drwx------ 2 belleprd000 belleprd 4096 Mar 3 06:51 . drwxr-xr-x 4 condor condor 4096 Mar 3 06:51 .. -rw-r--r-- 1 belleprd000 belleprd 1028 Mar 3 10:36 406446.0.log-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 3 06:51 DIRAC_nd5lYU_pilotwrapper.py-rw------- 1 belleprd000 belleprd 10354 Mar 3 06:51 tmpBU9zHQ > sestatus SELinux status: disabled > cat /proc/sys/fs/file-max 1552725_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature