Hi,
I have setup HTCondor on linux cluster. I installed from
yum repo, on Centos7.8. CM is dual nic and all exec nodes
are on private LAN. I plan to use file transfer method
rather than use a shared filesystem. I submit jobs and
slots of the exec node are alotted but job fails because
of file transfer failure. Below is clipping from the job
log
007 (024.009.000) 06/12 01:40:08 Shadow exception!
ÂÂÂÂÂÂÂ Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx:
Failed to transfer files
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Sent By Job
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Received By Job
...
Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In
the security config file, the following is the line
SEC_PASSWORD_FILE = /etc/condor/password.d/POOL
However, in the StarterLog of the particular slot on the
exec node, the directory is "passwords.d". I am unable to
figure out where the directory is set as "passwords.d"
instead of "password.d". I grepped through the config
files, failed to find.
Below are more lines from the StarterLog of the slog (on
the exec node)
06/12/20 02:43:29 (pid:39209) Can't
open directory "/etc/condor/passwords.d" as PRIV_ROOT,
errno: 2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name
in starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd
in starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO
false, Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource
limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime
ads to /var/lib/condor/execute/dir_39209/.job.ad and
/var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/box_plugin.py -classad" did not
produce any output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/gdrive_plugin.py -classad" did not
produce any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER:
"/usr/libexec/condor/onedrive_plugin.py -classad" did
not produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket
closed abnormally when trying to read 5 bytes from
daemon at <MailScanner warning: numerical links
are often malicious: 158.144.55.71:9618>,
errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed
(status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer
files" at line 2533 in file
/var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket
closed when trying to write 222 bytes to <MailScanner
warning: numerical links are often malicious:
192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write():
condor_write() failed
Where could it be picking up different setting than
what is in the file in config.d? Or any other error?
Thanks for helping out!
Nagaraj
_______________________________________________