Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] dead jobs (even remove doen't work)
- Date: Wed, 08 Dec 2004 12:56:46 +0100
- From: Thomas Lisson <lisson@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] dead jobs (even remove doen't work)
hello
I queued a dagman with about 1000 nodes. They were all queued properly
but 2 of that jobs hang. If I hold them and then release them, they stay
idle forever. If I remove them they stay marked as removed in the queue.
Even a condor_restart doesn't help. Only a reboot of that machine.
My ShaodwLog has some Authentication errors:
12/7 19:47:37 ******************************************************
12/7 19:47:37 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/7 19:47:37 ** $CondorVersion: 6.6.5 May 3 2004 $
12/7 19:47:37 ** $CondorPlatform: I386-LINUX-RH9 $
12/7 19:47:37 ** PID = 10166
12/7 19:47:37 ******************************************************
12/7 19:47:37 Using config file: /opt/condor//condor_config
12/7 19:47:37 Using local config files: /opt/condor/etc/condor_config.local
12/7 19:47:37 DaemonCore: Command Socket at <134.130.4.77:9688>
12/7 19:47:38 (2689.0) (9690): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100
12/7 19:47:38 Initializing a VANILLA shadow
12/7 19:47:38 (2700.0) (10166): Request to run on <137.226.70.92:9615>
was ACCEPTED
12/7 19:47:40 (2660.0) (9097): condor_write(): Socket closed when trying
to write buffer
12/7 19:47:40 (2660.0) (9097): Buf::write(): condor_write() failed
12/7 19:47:40 (2660.0) (9097): AUTHENTICATE: handshake failed!
12/7 19:47:40 (2660.0) (9097): Authentication Error
AUTHENTICATE:1002:Failure performing handshake
12/7 19:47:40 (2660.0) (9097): Failed to update job queue!
12/7 19:47:40 (2660.0) (9097): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100
12/7 19:47:41 (2675.0) (9596): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100
ScheddLog:
12/7 19:47:40 DC_AUTHENTICATE: attempt to open invalid session
condor1:2339:1102444850:4316, failing.
What can I do to ensure that all jobs will be executed or that jobs that
seem to hang will be restarted? Every job takes about 20-40min.
In my pool there is one machine that is submitter and master, all the
other machine are execute-only.
Thanks in Advance
Thomas Lisson
RWTH-Grid