[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] STARTD_ENFORCE_DISK_LIMITS and .job.ad / .machine.ad?



Hi All,

I noticed that with STARTD_ENFORCE_DISK_LIMITS enabled that the .job.ad and .machine.ad don't get created in the usual place. I couldn't find this mentioned in the documentation or code, so I figured I'd ask here.

In the StarterLog for the slot I see:

Failed to open "/var/lib/condor/execute/dir_303915/.job.ad" for to write job ad: Permission denied (errno 13)
Failed to open "/var/lib/condor/execute/dir_303915/.machine.ad" for to write machine ad: Permission denied (errno 13)

$_CONDOR_JOB_AD and $_CONDOR_MACHINE_AD are still set in the environment:

$ printenv | grep -E 'CONDOR_.*_AD'
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_303915/.machine.ad
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_303915/.job.ad

They just point to files that don't exist:

$ cat $_CONDOR_JOB_AD 
cat: /var/lib/condor/execute/dir_303915/.job.ad: No such file or directory
$ cat $_CONDOR_MACHINE_AD 
cat: /var/lib/condor/execute/dir_303915/.machine.ad: No such file or directory

$ ls -al $_CONDOR_SCRATCH_DIR 
total 9
drwx------ 5 mcgrewz mcgrewz 1024 Jun 25 15:40 .
drwxr-xr-x 3 condor  condor  4096 Jun 25 15:40 ..
-rwx------ 1 mcgrewz mcgrewz   48 Jun 25 15:40 .chirp.config
drwxr-xr-x 2 mcgrewz mcgrewz 1024 Jun 25 15:40 .condor_ssh_to_job_1
drwx------ 2 mcgrewz mcgrewz 1024 Jun 25 15:40 tmp
drwx------ 3 mcgrewz mcgrewz 1024 Jun 25 15:40 var

When I disable STARTD_ENFORCE_DISK_LIMITS (and restart condor), the two files get created as the user:group condor:condor with permissions of 644. This might be relevant because the other files in the directory are all my user:group that submitted the job, including the .chirp.config. 


Extra info that might be relevant:

Tested HTCondor 24.0.7, 24.0.8, and 24.8.1. Also tried 24.8.1 with the cool new STARTER_NESTED_SCRATCH that got mentioned at Throughput Computing Week. Similar results, but it looks like it's trying to write the ad files to scratch/ and not htcondor/ where the .update.ad moved. The htcondor/ directory is condor:condor so I'm guessing it wouldn't fail if it tried to write there instead?

I'm using LVM_BACKING_FILE to create the loopback file. This feature rocks.

LVM_HIDE_MOUNT is unset. On 24.0.8 condor_config_val says the default is false, on 24.8.1 it became auto; Though this didn't seem to matter.

JOB_EXECDIR_PERMISSIONS is unset, according to condor_config_val this defaults to user, making the dir_### permissions 700.

EP is running Debian 12 (Bookworm), with the latest updates. HTCondor packages from the research.cs.wisc.edu repository.


Thanks,
-Zach