Hi,
I have setup HTCondor on linux cluster. I installed from yum repo, on Centos7.8. CM is dual nic and all exec nodes are on private LAN. I plan to use file transfer method rather than use a shared filesystem. I submit jobs and slots of the exec node are alotted but job fails because of file transfer failure. Below is clipping from the job log
007 (024.009.000) 06/12 01:40:08 Shadow exception!
ÂÂÂÂÂÂÂ Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to
transfer files
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Sent By Job
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Received By Job
...
Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In the
security config file, the following is the line
SEC_PASSWORD_FILE = /etc/condor/password.d/POOL
However, in the StarterLog of the particular slot on the exec
node, the directory is "passwords.d". I am unable to figure out
where the directory is set as "passwords.d" instead of
"password.d". I grepped through the config files, failed to find.
Below are more lines from the StarterLog of the slog (on the exec
node)
06/12/20 02:43:29 (pid:39209) Can't
open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno:
2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name in
starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd in
starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO false,
Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime ads to
/var/lib/condor/execute/dir_39209/.job.ad and
/var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/box_plugin.py -classad" did not produce any
output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/gdrive_plugin.py -classad" did not produce
any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/onedrive_plugin.py -classad" did not
produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket closed
abnormally when trying to read 5 bytes from daemon at
<158.144.55.71:9618>, errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed (status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer files"
at line 2533 in file
/var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket closed when
trying to write 222 bytes to <192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write(): condor_write()
failed
Where could it be picking up different setting than what is
in the file in config.d? Or any other error?
Thanks for helping out!
Nagaraj