[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dead jobs (even remove doen't work)

Date: Wed, 08 Dec 2004 12:56:46 +0100
From: Thomas Lisson <lisson@xxxxxxxxxxxxxxxxx>
Subject: [Condor-users] dead jobs (even remove doen't work)

hello

I queued a dagman with about 1000 nodes. They were all queued properly but 2 of that jobs hang. If I hold them and then release them, they stay idle forever. If I remove them they stay marked as removed in the queue. Even a condor_restart doesn't help. Only a reboot of that machine.

My ShaodwLog has some Authentication errors: 12/7 19:47:37 ****************************************************** 12/7 19:47:37 ** condor_shadow (CONDOR_SHADOW) STARTING UP 12/7 19:47:37 ** $CondorVersion: 6.6.5 May 3 2004 $ 12/7 19:47:37 ** $CondorPlatform: I386-LINUX-RH9 $ 12/7 19:47:37 ** PID = 10166 12/7 19:47:37 ****************************************************** 12/7 19:47:37 Using config file: /opt/condor//condor_config 12/7 19:47:37 Using local config files: /opt/condor/etc/condor_config.local 12/7 19:47:37 DaemonCore: Command Socket at <134.130.4.77:9688> 12/7 19:47:38 (2689.0) (9690): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100 12/7 19:47:38 Initializing a VANILLA shadow 12/7 19:47:38 (2700.0) (10166): Request to run on <137.226.70.92:9615> was ACCEPTED 12/7 19:47:40 (2660.0) (9097): condor_write(): Socket closed when trying to write buffer 12/7 19:47:40 (2660.0) (9097): Buf::write(): condor_write() failed 12/7 19:47:40 (2660.0) (9097): AUTHENTICATE: handshake failed! 12/7 19:47:40 (2660.0) (9097): Authentication Error AUTHENTICATE:1002:Failure performing handshake 12/7 19:47:40 (2660.0) (9097): Failed to update job queue! 12/7 19:47:40 (2660.0) (9097): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100 12/7 19:47:41 (2675.0) (9596): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

ScheddLog: 12/7 19:47:40 DC_AUTHENTICATE: attempt to open invalid session condor1:2339:1102444850:4316, failing.

What can I do to ensure that all jobs will be executed or that jobs that seem to hang will be restarted? Every job takes about 20-40min.

In my pool there is one machine that is submitter and master, all the other machine are execute-only.

Thanks in Advance
Thomas Lisson
RWTH-Grid

Prev by Date: Re: [Condor-users] glidein resources not found
Next by Date: [Condor-users] Condor + SUSE + rpm
Previous by thread: Re: [Condor-users] glidein resources not found
Next by thread: [Condor-users] Condor + SUSE + rpm
Index(es):
- Date
- Thread