Hi Jaime,
Added the additional information from starter log [1] and shadow log [2].
While I was unable to recover the original files of the test, I've copied other grid job files for testing (proxy + .gahp_complete).
Local submission worked again (even as a grid user).
With network debug enabled, I get
04/17/20 09:56:05 condor_read(fd=15 <137.222.79.6:22455>,,size=5,timeout=1,flags=2,non_blocking=0)
04/17/20 09:56:05 condor_read(fd=15 <137.222.79.6:22455>,,size=5,timeout=0,flags=0,non_blocking=1)
04/17/20 09:56:05 condor_read(fd=15 <137.222.79.6:22455>,,size=733,timeout=0,flags=0,non_blocking=1)
04/17/20 09:56:05 condor_read(fd=15 <137.222.79.6:22455>,,size=21,timeout=20,flags=0,non_blocking=0)
04/17/20 09:56:05 condor_read(fd=15 <137.222.79.6:22455>,,size=191,timeout=20,flags=0,non_blocking=0)
04/17/20 09:56:05 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#79#...)
for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
04/17/20 09:56:05 condor_write(fd=15 <137.222.79.6:22455>,,size=29,timeout=20,flags=0,non_blocking=0)
04/17/20 09:56:05 condor_write(): socket 15 is readable
04/17/20 09:56:05 condor_write(): Socket closed when trying to write 29 bytes to <137.222.79.6:22455>, fd is 15
04/17/20 09:56:05 Buf::write(): condor_write() failed
04/17/20 09:56:05 CLOSE TCP <10.129.5.223:9618> fd=15
Should I be concerned about "Error: can't find resource with ClaimId" or "socket 15 is readable" (should it be writable)?
Cheers,
Luke
[1]
04/15/20 16:15:50 (pid:119763) Submitting machine is "lcgce01.phy.bris.ac.uk"
04/15/20 16:15:50 (pid:119763) setting the orig job name in starter
04/15/20 16:15:50 (pid:119763) setting the orig job iwd in starter
04/15/20 16:15:50 (pid:119763) Chirp config summary: IO false, Updates false, Delayed updates true.
04/15/20 16:15:50 (pid:119763) Initialized IO Proxy.
04/15/20 16:15:50 (pid:119763) Done setting resource limits
04/15/20 16:15:50 (pid:119763) Got SIGTERM. Performing graceful shutdown.
04/15/20 16:15:50 (pid:119763) ShutdownGraceful all jobs.
04/15/20 16:15:50 (pid:119763) Failed to open '.update.ad' to read update ad: No such file or directory (2).
04/15/20 16:15:50 (pid:119763) All jobs have exited... starter exiting
04/15/20 16:15:50 (pid:119763) After chmod(), still can't remove "/condor/condor/dir_119763" as directory owner, giving up!
04/15/20 16:15:50 (pid:119763) **** condor_starter (condor_STARTER) pid 119763 EXITING WITH STATUS 0
04/15/20 16:15:50 (pid:119763) ERROR "Assertion ERROR on (daemonCore)" at line 4032 in file /var/lib/condor/execute/slot2/dir_13658/userdir/.tmpgZgx4N/BUILD/condor-8.8.8/src/condor_utils/file_transfer.cpp
04/15/20 16:15:51 (pid:119772) ******************************************************
04/15/20 16:15:51 (pid:119772) ** condor_starter (CONDOR_STARTER) STARTING UP
04/15/20 16:15:51 (pid:119772) ** /usr/sbin/condor_starter
04/15/20 16:15:51 (pid:119772) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/15/20 16:15:51 (pid:119772) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/15/20 16:15:51 (pid:119772) ** $CondorVersion: 8.8.8 Mar 19 2020 BuildID: 498525 PackageID: 8.8.8-1 $
04/15/20 16:15:51 (pid:119772) ** $CondorPlatform: x86_64_CentOS7 $
04/15/20 16:15:51 (pid:119772) ** PID = 119772
04/15/20 16:15:51 (pid:119772) ** Log last touched 4/15 16:15:50
04/15/20 16:15:51 (pid:119772) ******************************************************
04/15/20 16:15:51 (pid:119772) Using config source: /etc/condor/condor_config
04/15/20 16:15:51 (pid:119772) Using local config sources:
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/00_config_local.config
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/10_security.config
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/20_workernode.config
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/50_singularity.config
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/60_docker.config
04/15/20 16:15:51 (pid:119772) /etc/condor/config.d/888_ipv6_off.config
04/15/20 16:15:51 (pid:119772) /etc/condor/condor_config.local
04/15/20 16:15:51 (pid:119772) config Macros = 162, Sorted = 161, StringBytes = 6393, TablesBytes = 5920
04/15/20 16:15:51 (pid:119772) CLASSAD_CACHING is OFF
04/15/20 16:15:51 (pid:119772) Daemon Log is logging: D_ALWAYS D_ERROR
04/15/20 16:15:51 (pid:119772) SharedPortEndpoint: waiting for connections to named socket 114634_c5ef_13
04/15/20 16:15:51 (pid:119772) DaemonCore: command socket at <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=114634_c5ef_13>
04/15/20 16:15:51 (pid:119772) DaemonCore: private command socket at <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=114634_c5ef_13>
04/15/20 16:15:51 (pid:119772) Communicating with shadow <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 (pid:119772) Submitting machine is "lcgce01.phy.bris.ac.uk"
04/15/20 16:15:51 (pid:119772) setting the orig job name in starter
04/15/20 16:15:51 (pid:119772) setting the orig job iwd in starter
04/15/20 16:15:51 (pid:119772) Chirp config summary: IO false, Updates false, Delayed updates true.
04/15/20 16:15:51 (pid:119772) Initialized IO Proxy.
04/15/20 16:15:51 (pid:119772) Done setting resource limits
04/15/20 16:15:51 (pid:119772) Got SIGTERM. Performing graceful shutdown.
04/15/20 16:15:51 (pid:119772) ShutdownGraceful all jobs.
04/15/20 16:15:51 (pid:119772) Failed to open '.update.ad' to read update ad: No such file or directory (2).
04/15/20 16:15:51 (pid:119772) All jobs have exited... starter exiting
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/.gahp_complete, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119775) get_file(): consumed 0 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/.gahp_complete: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/glidein_startup.sh, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119772) **** condor_starter (condor_STARTER) pid 119772 EXITING WITH STATUS 0
04/15/20 16:15:51 (pid:119772) ERROR "Assertion ERROR on (daemonCore)" at line 4032 in file /var/lib/condor/execute/slot2/dir_13658/userdir/.tmpgZgx4N/BUILD/condor-8.8.8/src/condor_utils/file_transfer.cpp
04/15/20 16:15:51 (pid:119775) get_file(): consumed 69546 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/glidein_startup.sh: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/condor_exec.exe, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119775) get_file(): consumed 13373 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/condor_exec.exe: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) Failed to set execute bit on /condor/condor/dir_119772/condor_exec.exe, errno=2 (No such file or directory)
[2]
04/15/20 16:15:51 config Macros = 142, Sorted = 142, StringBytes = 6011, TablesBytes = 2352
04/15/20 16:15:51 CLASSAD_CACHING is OFF
04/15/20 16:15:51 Daemon Log is logging: D_ALWAYS D_ERROR
04/15/20 16:15:51 SharedPortEndpoint: waiting for connections to named socket 2337399_cbe3_600
04/15/20 16:15:51 DaemonCore: command socket at <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 DaemonCore: private command socket at <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 Initializing a VANILLA shadow for job 3712168.0
04/15/20 16:15:51 (3712168.0) (2359091): Request to run on slot1_1@xxxxxxxxxxxxxx <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3> was ACCEPTED
04/15/20 16:15:51 (3712168.0) (2359091): Job 3712168.0 going into Hold state (code 5,0): The job attribute PeriodicRemove _expression_ 'false || ResidentSetSize > JobMemoryLimit' evaluated to UNDEFINED
04/15/20 16:15:51 (3712168.0) (2359091): Job 3712168.0 going into Hold state (code 5,0): The job attribute PeriodicRemove _expression_ 'false || ResidentSetSize > JobMemoryLimit' evaluated to UNDEFINED
04/15/20 16:15:51 (3712168.0) (2359091): DoUpload: (Condor error code 12, subcode 2) SHADOW at 137.222.79.6 failed to send file(s) to <10.129.5.223:38734>; STARTER at 10.129.5.223 failed to write to file /condor/condor/dir_119772/condor_exec.exe: (errno 2) No such file or directory
04/15/20 16:15:51 (3712168.0) (2359091): File transfer failed (status=0).
04/15/20 16:15:52 (3712168.0) (2359091): Job 3712168.0 is being evicted from slot1_1@xxxxxxxxxxxxxx
04/15/20 16:15:52 (3712168.0) (2359091): logEvictEvent with unknown reason (112), not logging.
04/15/20 16:15:52 (3712168.0) (2359091): **** condor_shadow (condor_SHADOW) pid 2359091 EXITING WITH STATUS 112
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: 16 April 2020 03:26 To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Jobs failing to transfer files "condor_write(): Socket closed when trying to write" The starter log on the execute machine and the shadow log on the submit machine will have more details on the file transfers.
In attempting to reproduce the failure with a local submission, have you tried replicating the exact set of input files to be transferred?
- Jaime
|