Hi Zach,
At one point we had this issue because the starter writing the files to the jobs scratch directory prior to the mounting of the ephemeral filesystem/logical volume. I believe what is happening here is we change the owner of the mount prior to writing the Machine
and Job Ad files as user condor which fails due to permissions issues.
I am working on confirming my theory and fixing the issue if I am correct.
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Zach McGrew <mcgrewz@xxxxxxx>
Sent: Wednesday, June 25, 2025 6:32 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] STARTD_ENFORCE_DISK_LIMITS and .job.ad / .machine.ad? Hi All,
I noticed that with STARTD_ENFORCE_DISK_LIMITS enabled that the .job.ad and .machine.ad don't get created in the usual place. I couldn't find this mentioned in the documentation or code, so I figured I'd ask here. In the StarterLog for the slot I see: Failed to open "/var/lib/condor/execute/dir_303915/.job.ad" for to write job ad: Permission denied (errno 13) Failed to open "/var/lib/condor/execute/dir_303915/.machine.ad" for to write machine ad: Permission denied (errno 13) $_CONDOR_JOB_AD and $_CONDOR_MACHINE_AD are still set in the environment: $ printenv | grep -E 'CONDOR_.*_AD' _CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_303915/.machine.ad _CONDOR_JOB_AD=/var/lib/condor/execute/dir_303915/.job.ad They just point to files that don't exist: $ cat $_CONDOR_JOB_AD cat: /var/lib/condor/execute/dir_303915/.job.ad: No such file or directory $ cat $_CONDOR_MACHINE_AD cat: /var/lib/condor/execute/dir_303915/.machine.ad: No such file or directory $ ls -al $_CONDOR_SCRATCH_DIR total 9 drwx------ 5 mcgrewz mcgrewz 1024 Jun 25 15:40 . drwxr-xr-x 3 condor condor 4096 Jun 25 15:40 .. -rwx------ 1 mcgrewz mcgrewz 48 Jun 25 15:40 .chirp.config drwxr-xr-x 2 mcgrewz mcgrewz 1024 Jun 25 15:40 .condor_ssh_to_job_1 drwx------ 2 mcgrewz mcgrewz 1024 Jun 25 15:40 tmp drwx------ 3 mcgrewz mcgrewz 1024 Jun 25 15:40 var When I disable STARTD_ENFORCE_DISK_LIMITS (and restart condor), the two files get created as the user:group condor:condor with permissions of 644. This might be relevant because the other files in the directory are all my user:group that submitted the job, including the .chirp.config. Extra info that might be relevant: Tested HTCondor 24.0.7, 24.0.8, and 24.8.1. Also tried 24.8.1 with the cool new STARTER_NESTED_SCRATCH that got mentioned at Throughput Computing Week. Similar results, but it looks like it's trying to write the ad files to scratch/ and not htcondor/ where the .update.ad moved. The htcondor/ directory is condor:condor so I'm guessing it wouldn't fail if it tried to write there instead? I'm using LVM_BACKING_FILE to create the loopback file. This feature rocks. LVM_HIDE_MOUNT is unset. On 24.0.8 condor_config_val says the default is false, on 24.8.1 it became auto; Though this didn't seem to matter. JOB_EXECDIR_PERMISSIONS is unset, according to condor_config_val this defaults to user, making the dir_### permissions 700. EP is running Debian 12 (Bookworm), with the latest updates. HTCondor packages from the research.cs.wisc.edu repository. Thanks, -Zach _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ |