_______________________________________________Hi,
I have setup HTCondor on linux cluster. I installed from yum repo, on Centos7.8. CM is dual nic and all exec nodes are on private LAN. I plan to use file transfer method rather than use a shared filesystem. I submit jobs and slots of the exec node are alotted but job fails because of file transfer failure. Below is clipping from the job log
007 (024.009.000) 06/12 01:40:08 Shadow exception!
Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...Secondly, I notice an anomaly about SEC_PASSWORD_FILE. In the security config file, the following is the line
SEC_PASSWORD_FILE = /etc/condor/password.d/POOL
However, in the StarterLog of the particular slot on the exec node, the directory is "passwords.d". I am unable to figure out where the directory is set as "passwords.d" instead of "password.d". I grepped through the config files, failed to find.
Below are more lines from the StarterLog of the slog (on the exec node)
06/12/20 02:43:29 (pid:39209) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 2 (No such file or directory)
06/12/20 02:43:29 (pid:39209) setting the orig job name in starter
06/12/20 02:43:29 (pid:39209) setting the orig job iwd in starter
06/12/20 02:43:29 (pid:39209) Chirp config summary: IO false, Updates false, Delayed updates true.
06/12/20 02:43:29 (pid:39209) Initialized IO Proxy.
06/12/20 02:43:29 (pid:39209) Done setting resource limits
06/12/20 02:43:29 (pid:39209) Set filetransfer runtime ads to /var/lib/condor/execute/dir_39209/.job.ad and /var/lib/condor/execute/dir_39209/.machine.ad.
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/box_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:29 (pid:39209) FILETRANSFER: "/usr/libexec/condor/gdrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39209) FILETRANSFER: "/usr/libexec/condor/onedrive_plugin.py -classad" did not produce any output, ignoring
06/12/20 02:43:30 (pid:39334) condor_read(): Socket closed abnormally when trying to read 5 bytes from daemon at <158.144.55.71:9618>, errno=104 Connection reset by peer
06/12/20 02:43:30 (pid:39209) File transfer failed (status=0).
06/12/20 02:43:30 (pid:39209) ERROR "Failed to transfer files" at line 2533 in file /var/lib/condor/execute/slot3/dir_3977/userdir/.tmpEsbepJ/BUILD/condor-8.9.7/src/condor_starter.V6.1/jic_shadow.cpp
06/12/20 02:43:30 (pid:39209) ShutdownFast all jobs.
06/12/20 02:43:30 (pid:39209) condor_write(): Socket closed when trying to write 222 bytes to <192.168.55.71:4652>, fd is 8
06/12/20 02:43:30 (pid:39209) Buf::write(): condor_write() failed
Where could it be picking up different setting than what is in the file in config.d? Or any other error?
Thanks for helping out!
Nagaraj
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/