Hi Todd The re-run is with all 5,000 jobs and with no errors occurring if encrypt_execute_directory is false. I think some sort of race condition is likely as it seems? worse with nodes with more cores/slots. I re-ran just 50 jobs, and targeted (via the requirements statement) a single windows execute node that has 36 cores/slots (2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz - 18 cores/cpu). At one stage there were 24/50 jobs on hold: 120822.0 na-hit023 9/15 10:50 Error from slot1@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56553>: error reading from C:\PROGRA~1\condor\execute\dir_16464\_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63593> 120822.1 na-hit023 9/15 10:51 Error from slot2@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56582>: error reading from C:\PROGRA~1\condor\execute\dir_14952\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63687> 120822.2 na-hit023 9/15 10:51 Error from slot3@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56581>: error reading from C:\PROGRA~1\condor\execute\dir_18480\_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63699> 120822.3 na-hit023 9/15 10:52 Error from slot4@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56583>: error reading from C:\PROGRA~1\condor\execute\dir_32332\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63757> 120822.4 na-hit023 9/15 10:51 Error from slot5@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56584>: error reading from C:\PROGRA~1\condor\execute\dir_33716\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63751> 120822.11 na-hit023 9/15 10:52 Error from slot13@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56645>: error reading from C:\PROGRA~1\condor\execute\dir_6656\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63756> 120822.13 na-hit023 9/15 10:50 Error from slot15@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_13636\condor_exec.exe: (errno 13) Permission denied 120822.14 na-hit023 9/15 10:50 Error from slot16@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_29624\_condor_stdout' as standard output: Permission denied (errno 13) 120822.15 na-hit023 9/15 10:50 Error from slot17@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_23092\condor_exec.exe: (errno 13) Permission denied 120822.16 na-hit023 9/15 10:50 Error from slot18@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_13068\_condor_stdout' as standard output: Permission denied (errno 13) 120822.17 na-hit023 9/15 10:50 Error from slot19@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_11092\_condor_stdout' as standard output: Permission denied (errno 13) 120822.18 na-hit023 9/15 10:50 Error from slot20@xxxxxxxxxxxxxxx: Failed to open 'C:\PROGRA~1\condor\execute\dir_22992\_condor_stdout' as standard output: Permission denied (errno 13) 120822.20 na-hit023 9/15 10:50 Error from slot22@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_11184\condor_exec.exe: (errno 13) Permission denied 120822.21 na-hit023 9/15 10:50 Error from slot23@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_19296\condor_exec.exe: (errno 13) Permission denied 120822.22 na-hit023 9/15 10:50 Error from slot24@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_31936\condor_exec.exe: (errno 13) Permission denied 120822.25 na-hit023 9/15 10:50 Error from slot27@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_10332\condor_exec.exe: (errno 13) Permission denied 120822.26 na-hit023 9/15 10:51 Error from slot28@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to send file(s) to <aaa.bbb.106.167:56725>: error reading from C:\PROGRA~1\condor\execute\dir_29008\_condor_stderr: (errno 13) Permission denied; SHADOW failed to receive file(s) from <xxx.yyy.160.92:63725> 120822.27 na-hit023 9/15 10:50 Error from slot29@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_21172\condor_exec.exe: (errno 13) Permission denied 120822.29 na-hit023 9/15 10:50 Error from slot31@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_28180\condor_exec.exe: (errno 13) Permission denied 120822.30 na-hit023 9/15 10:50 Error from slot32@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_33184\condor_exec.exe: (errno 13) Permission denied 120822.32 na-hit023 9/15 10:50 Error from slot34@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_22980\condor_exec.exe: (errno 13) Permission denied 120822.33 na-hit023 9/15 10:50 Error from slot35@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_17172\condor_exec.exe: (errno 13) Permission denied 120822.34 na-hit023 9/15 10:50 Error from slot36@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_17596\condor_exec.exe: (errno 13) Permission denied 120822.47 na-hit023 9/15 10:51 Error from slot17@xxxxxxxxxxxxxxx: STARTER at xxx.yyy.160.92 failed to write to file C:\PROGRA~1\condor\execute\dir_30664\condor_exec.exe: (errno 13) Permission denied but 9 jobs still ran to completion OK, before I killed the rest of the jobs. I ran this with full debug (ALL_DEBUG = D_FULLDEBUG) on both submit and execute nodes but there didn't seem to be any extra info in the logs that explained what was happening. I can send you the logs offline if you think that may help. Meanwhile I'll try the output remap as another way of getting the output file onto the fileserver, although that is a separate issue to the above errors. Thanks Cheers Greg P.S. I ran the 50 jobs twice more, running on the one execute node, each time with periodic_release set to true. Theses jobs just chew cpu for 5 mins, plus file download/upload times. I have attached a ganglia graph of the jobs progress for each run. Run 1 - encrypt_execute_directory = true 50 jobs took 21 mins total throughput time. 30 jobs were put on "hold" at some stage, 18 once, 10 twice, 2 three times. All eventually ran to completion. Run 2 - encrypt_execute_directory = false 50 jobs took 13 mins total throughput time. No jobs were put on hold. -----Original Message----- From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd L Miller Sent: Wednesday, 15 September 2021 2:14 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] EncryptExecuteDirectory issues on Windows execute nodes without run_as_owner > One kludge around is to use the âcipherâ command to decrypt the file > before uploading it, e.g. You could also potentially use HTCondor's file-transfer mechanism, although it will end up being a little less efficient in this case: if the submit node can mount \\fileserver, your jobs could terminate after creating outputfile.dat but specify transfer_output_files = outputfile.dat transfer_output_remaps = outputfile.dat=\\fileserver\user\output HTCondor will read outputfile.dat as the condor-slot user and transfer if to a daemon running on the submit node as the owner of the job, which (should) allow that daemon to write to \\fileserver\user\output. > So thatâs the FYI bit, and once users can run_as_owner I donât think > this shouldnât be a problem? Indeed. > These must be related to the encrypt_execute_directory stuff because we > can re-run the jobs with NO execute directory encryption enabled and do > not get these errors. Do you re-run all 5,000 jobs and get no failures, or just the failed 150? > So I guess the question is does anyone have any ideas as to why these > errors are occurring? And only when encryptexecutedirectory is set to > true? I'm a little more worried by failing to read from the standard error log after the job has finished than the two errors failing to create the log files. Failing to write to the log after creating it is also very strange. It makes me wonder if there's a clean-up process going astray somewhere, possibly because of a race condition made worse by encrypting the execute directory. - ToddM
Attachment:
jobs.JPG
Description: jobs.JPG