[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Recurring problem with job starts



Hi Steffen,

just guessing, but have you checked the number of open file handles?

Cheers,
  Thomas


On 03/09/2020 09.38, Steffen Grunewald wrote:
> Hi Mark, all,
> 
> please find my comments below...
> 
> On Wed, 2020-09-02 at 17:05:52 -0500, Mark Coatsworth wrote:
>> Hi Steffen, a few things to think about here.
>>
>> Since your condor_starter is able to create the
>> /var/lib/condor/execute/dir_34972 directory, this implies it's not a
>> higher level permission or write access problem.
> 
> Indeed - and since "all users are equal" with respect to ownership/permission
> settings, I ran a test to verify that there would be no "black hole" nodes -
> and found none.
> 
>> Are these jobs failing consistently in your pool? Or does the problem
>> seem isolated to a subset of misbehaving nodes?
> 
> There is/was only a small subset of nodes provisding enough resources,
> so the effect looked isolated, but it isn't. At least not in terms of
> Condor or OS setup.
> 
> What I found is that all affected jobs are DAG nodes that had been running
> before, and it rather looks like their shadows have a problem.
> In the worst case I will have to tell the (single) user affected to condor_rm
> the held jobs and go for rescue DAGs. Since this involves extra work for him
> I'd like to find out more and whether I still can do something.
> 
>> You mentioned the execute nodes have other
>> /var/lib/condor/execute/dir_NNNNN folders. Can you let us know what
>> the ownership and permissions look like on these folders, and the
>> files inside them?
> 
> NNNNN doesn't seem to be the PID of the started, nor related to the job id.
> Is there a translation table somewhere? It doesn't seem that NNNNN stays
> constant over multiple restart attempts, thus I would not expect any
> collisions to be persistent while changing compute nodes.
> 
>> I'm wondering if it's possible your target execute directory and files
>> already exist, and they have ownership or permissions that do not
>> allow us to overwrite files. Can you verify the folders mentioned in
>> the error messages do not exist?
> 
> That would not be able to explain why the effect travels across the cluster.
> No other jobs have shown such a behaviour since; it's just a set of 17 jobs
> that got harmed by a network failure (a switch disconnected a whole rack
> from the pool).
> 
>> Is /var mounting from the local disk, or from a shared file system?
>> (I'm assuming not a shared file system! But it's worth making sure)
> 
> /var is local everywhere. I've been thinking about mounting /var/lib/condor(/spool)
> from a file server for the headnodes only, to more easily preserve their
> histories, but didn't get that far yet.
> 
>> Lastly, could you show us the job submit file you're using? This might
>> have some clues.
> 
> Since there are multiple DAGs involved I've got to ask for them...
> 
> - S
> 
>>
>> Mark
>>
>>
>> On Wed, Sep 2, 2020 at 2:56 AM Steffen Grunewald
>> <steffen.grunewald@xxxxxxxxxx> wrote:
>>>
>>> Good morning/afternoon/whatever,
>>>
>>> starting two days ago, I'm getting reports of failed job starts. Jobs affected
>>> get held with the following reason:
>>>
>>> (12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
>>>
>>> on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well.
>>>
>>> Checking the starter logs on the node, I see
>>>
>>> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** condor_starter (CONDOR_STARTER) STARTING UP
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** /usr/sbin/condor_starter
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorPlatform: X86_64-Debian_9 $
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** PID = 34972
>>> StarterLog.slot1_4:20-09-02_09:39:36  ** Log last touched 9/1 13:46:10
>>> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
>>> StarterLog.slot1_4:20-09-02_09:39:36  Using config source: /etc/condor/condor_config
>>> StarterLog.slot1_4:20-09-02_09:39:36  Using local config sources:
>>> StarterLog.slot1_4:20-09-02_09:39:36     /etc/default/condor_config|
>>> StarterLog.slot1_4:20-09-02_09:39:36  config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964
>>> StarterLog.slot1_4:20-09-02_09:39:36  CLASSAD_CACHING is OFF
>>> StarterLog.slot1_4:20-09-02_09:39:36  Daemon Log is logging: D_ALWAYS D_ERROR
>>> StarterLog.slot1_4:20-09-02_09:39:36  Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock).
>>> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
>>> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
>>> StarterLog.slot1_4:20-09-02_09:39:36  Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP>
>>> StarterLog.slot1_4:20-09-02_09:39:36  Submitting machine is "hypatia2.hypatia.local"
>>> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job name in starter
>>> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job iwd in starter
>>> StarterLog.slot1_4:20-09-02_09:39:36  Chirp config summary: IO false, Updates false, Delayed updates true.
>>> StarterLog.slot1_4:20-09-02_09:39:36  Initialized IO Proxy.
>>> StarterLog.slot1_4:20-09-02_09:39:36  Done setting resource limits
>>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied.
>>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7358 bytes of file transmission
>>> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied
>>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied.
>>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7788 bytes of file transmission
>>> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
>>> StarterLog.slot1_4:20-09-02_09:39:37  File transfer failed (status=0).
>>> StarterLog.slot1_4:20-09-02_09:39:37  ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp
>>> StarterLog.slot1_4:20-09-02_09:39:37  ShutdownFast all jobs.
>>> StarterLog.slot1_4:20-09-02_09:39:37  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>.
>>> StarterLog.slot1_4:20-09-02_09:39:37  IO: Failed to read packet header
>>> StarterLog.slot1_4:20-09-02_09:39:37  Lost connection to shadow, waiting 2400 secs for reconnect
>>> StarterLog.slot1_4:20-09-02_09:39:37  All jobs have exited... starter exiting
>>> StarterLog.slot1_4:20-09-02_09:39:37  **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0
>>>
>>> /var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp;
>>> it contains other active dir_* entries.
>>> /var is mounted read-only and almost empty.
>>> On the submit node, /var/lib/condor/execute is empty, and apparently always has been.
>>>
>>> Any suggestion how to debug this further?
>>>
>>> Thanks,
>>>  Steffen
>>>
>>> --
>>> Steffen Grunewald, Cluster Administrator
>>> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
>>> Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
>>> ~~~
>>> Fon: +49-331-567 7274
>>> Mail: steffen.grunewald(at)aei.mpg.de
>>> ~~~
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>> --
>> Mark Coatsworth
>> Systems Programmer
>> Center for High Throughput Computing
>> Department of Computer Sciences
>> University of Wisconsin-Madison
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature