Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] HTCondor-CE spool removes bl_home directories
- Date: Thu, 30 Jul 2020 10:18:14 +0000
- From: "Birngruber,Erich" <erich.birngruber@xxxxxxxxxxxxxx>
- Subject: [HTCondor-users] HTCondor-CE spool removes bl_home directories
Dear all,
Weâve upgraded a HTcondor-CE from 3.2.1 (UMD4 repositories) to 3.4.2 (https://research.cs.wisc.edu/htcondor/yum/stable/8.8/rhel7 ).
The update was done as we wanted to benefit from the improved APEL integration.
The setup is a CE Htcondor-CE that is submitting to a SLURM instance.
After the update, we have grid jobs failing, it looks like some cleanup is happening to early.
What we can see it the spool directory is created, i.e.
/users/condor/spool/8429/0/cluster8429.proc0.subproc0
Jobs are correctly routed (as previously) to the slurm instance, and slurm jobs are started.
The grid pilot jobs start executing. A few seconds, up to ca. 2 minutes later, the bl_home directories in there get removed
i.e. from Condor stderr of the job:
_condor_stderr:mkdir: cannot create directory â/users/condor/spool/8429/0/cluster8429.proc0.subproc0/home_bl_7e83452f3d9a/.alienâ: No such file or directory
The "home_bl_7e83452f3d9a" subdirectory of the grid job has been removed. We have this same pattern happening for all jobs from multiple grid VOs.
I've also enabled debugging in the condor config:
ALL_DEBUG = D_ALWAYS:2 D_CAT D_SECURITY
But still I have not been able to find out what's going wrong.
Tbh, I'm not sure this is the right place for that kind of question, but any help / pointers are really appreciated.
I'm quite new to HTCondor-CE so at This point, I'm not even sure what to look for in the logs.
Best,
Erich