Hi Steffen, just guessing, but have you checked the number of open file handles? Cheers, Thomas On 03/09/2020 09.38, Steffen Grunewald wrote: > Hi Mark, all, > > please find my comments below... > > On Wed, 2020-09-02 at 17:05:52 -0500, Mark Coatsworth wrote: >> Hi Steffen, a few things to think about here. >> >> Since your condor_starter is able to create the >> /var/lib/condor/execute/dir_34972 directory, this implies it's not a >> higher level permission or write access problem. > > Indeed - and since "all users are equal" with respect to ownership/permission > settings, I ran a test to verify that there would be no "black hole" nodes - > and found none. > >> Are these jobs failing consistently in your pool? Or does the problem >> seem isolated to a subset of misbehaving nodes? > > There is/was only a small subset of nodes provisding enough resources, > so the effect looked isolated, but it isn't. At least not in terms of > Condor or OS setup. > > What I found is that all affected jobs are DAG nodes that had been running > before, and it rather looks like their shadows have a problem. > In the worst case I will have to tell the (single) user affected to condor_rm > the held jobs and go for rescue DAGs. Since this involves extra work for him > I'd like to find out more and whether I still can do something. > >> You mentioned the execute nodes have other >> /var/lib/condor/execute/dir_NNNNN folders. Can you let us know what >> the ownership and permissions look like on these folders, and the >> files inside them? > > NNNNN doesn't seem to be the PID of the started, nor related to the job id. > Is there a translation table somewhere? It doesn't seem that NNNNN stays > constant over multiple restart attempts, thus I would not expect any > collisions to be persistent while changing compute nodes. > >> I'm wondering if it's possible your target execute directory and files >> already exist, and they have ownership or permissions that do not >> allow us to overwrite files. Can you verify the folders mentioned in >> the error messages do not exist? > > That would not be able to explain why the effect travels across the cluster. > No other jobs have shown such a behaviour since; it's just a set of 17 jobs > that got harmed by a network failure (a switch disconnected a whole rack > from the pool). > >> Is /var mounting from the local disk, or from a shared file system? >> (I'm assuming not a shared file system! But it's worth making sure) > > /var is local everywhere. I've been thinking about mounting /var/lib/condor(/spool) > from a file server for the headnodes only, to more easily preserve their > histories, but didn't get that far yet. > >> Lastly, could you show us the job submit file you're using? This might >> have some clues. > > Since there are multiple DAGs involved I've got to ask for them... > > - S > >> >> Mark >> >> >> On Wed, Sep 2, 2020 at 2:56 AM Steffen Grunewald >> <steffen.grunewald@xxxxxxxxxx> wrote: >>> >>> Good morning/afternoon/whatever, >>> >>> starting two days ago, I'm getting reports of failed job starts. Jobs affected >>> get held with the following reason: >>> >>> (12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied >>> >>> on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well. >>> >>> Checking the starter logs on the node, I see >>> >>> StarterLog.slot1_4:20-09-02_09:39:36 ****************************************************** >>> StarterLog.slot1_4:20-09-02_09:39:36 ** condor_starter (CONDOR_STARTER) STARTING UP >>> StarterLog.slot1_4:20-09-02_09:39:36 ** /usr/sbin/condor_starter >>> StarterLog.slot1_4:20-09-02_09:39:36 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) >>> StarterLog.slot1_4:20-09-02_09:39:36 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON >>> StarterLog.slot1_4:20-09-02_09:39:36 ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $ >>> StarterLog.slot1_4:20-09-02_09:39:36 ** $CondorPlatform: X86_64-Debian_9 $ >>> StarterLog.slot1_4:20-09-02_09:39:36 ** PID = 34972 >>> StarterLog.slot1_4:20-09-02_09:39:36 ** Log last touched 9/1 13:46:10 >>> StarterLog.slot1_4:20-09-02_09:39:36 ****************************************************** >>> StarterLog.slot1_4:20-09-02_09:39:36 Using config source: /etc/condor/condor_config >>> StarterLog.slot1_4:20-09-02_09:39:36 Using local config sources: >>> StarterLog.slot1_4:20-09-02_09:39:36 /etc/default/condor_config| >>> StarterLog.slot1_4:20-09-02_09:39:36 config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964 >>> StarterLog.slot1_4:20-09-02_09:39:36 CLASSAD_CACHING is OFF >>> StarterLog.slot1_4:20-09-02_09:39:36 Daemon Log is logging: D_ALWAYS D_ERROR >>> StarterLog.slot1_4:20-09-02_09:39:36 Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock). >>> StarterLog.slot1_4:20-09-02_09:39:36 DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421> >>> StarterLog.slot1_4:20-09-02_09:39:36 DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421> >>> StarterLog.slot1_4:20-09-02_09:39:36 Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP> >>> StarterLog.slot1_4:20-09-02_09:39:36 Submitting machine is "hypatia2.hypatia.local" >>> StarterLog.slot1_4:20-09-02_09:39:36 setting the orig job name in starter >>> StarterLog.slot1_4:20-09-02_09:39:36 setting the orig job iwd in starter >>> StarterLog.slot1_4:20-09-02_09:39:36 Chirp config summary: IO false, Updates false, Delayed updates true. >>> StarterLog.slot1_4:20-09-02_09:39:36 Initialized IO Proxy. >>> StarterLog.slot1_4:20-09-02_09:39:36 Done setting resource limits >>> StarterLog.slot1_4:20-09-02_09:39:37 get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied. >>> StarterLog.slot1_4:20-09-02_09:39:37 get_file(): consumed 7358 bytes of file transmission >>> StarterLog.slot1_4:20-09-02_09:39:37 DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied >>> StarterLog.slot1_4:20-09-02_09:39:37 get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied. >>> StarterLog.slot1_4:20-09-02_09:39:37 get_file(): consumed 7788 bytes of file transmission >>> StarterLog.slot1_4:20-09-02_09:39:37 DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied >>> StarterLog.slot1_4:20-09-02_09:39:37 File transfer failed (status=0). >>> StarterLog.slot1_4:20-09-02_09:39:37 ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp >>> StarterLog.slot1_4:20-09-02_09:39:37 ShutdownFast all jobs. >>> StarterLog.slot1_4:20-09-02_09:39:37 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>. >>> StarterLog.slot1_4:20-09-02_09:39:37 IO: Failed to read packet header >>> StarterLog.slot1_4:20-09-02_09:39:37 Lost connection to shadow, waiting 2400 secs for reconnect >>> StarterLog.slot1_4:20-09-02_09:39:37 All jobs have exited... starter exiting >>> StarterLog.slot1_4:20-09-02_09:39:37 **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0 >>> >>> /var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp; >>> it contains other active dir_* entries. >>> /var is mounted read-only and almost empty. >>> On the submit node, /var/lib/condor/execute is empty, and apparently always has been. >>> >>> Any suggestion how to debug this further? >>> >>> Thanks, >>> Steffen >>> >>> -- >>> Steffen Grunewald, Cluster Administrator >>> Max Planck Institute for Gravitational Physics (Albert Einstein Institute) >>> Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany >>> ~~~ >>> Fon: +49-331-567 7274 >>> Mail: steffen.grunewald(at)aei.mpg.de >>> ~~~ >>> _______________________________________________ >>> HTCondor-users mailing list >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a >>> subject: Unsubscribe >>> You can also unsubscribe by visiting >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >>> >>> The archives can be found at: >>> https://lists.cs.wisc.edu/archive/htcondor-users/ >> >> >> >> -- >> Mark Coatsworth >> Systems Programmer >> Center for High Throughput Computing >> Department of Computer Sciences >> University of Wisconsin-Madison >> >> _______________________________________________ >> HTCondor-users mailing list >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/htcondor-users/ >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature