Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs removed automatically by dags?
- Date: Wed, 18 Mar 2009 10:28:20 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs removed automatically by dags?
On Wed, 18 Mar 2009, Carsten Aulbert wrote:
things are getting more mysteriously. A set of my jobs (the set of dags
from the previous email) were hindered by a flaky running schedd:
007 (10293330.000.000) 03/17 23:27:45 Shadow exception!
Failed to connect to schedd!
1161 - Run Bytes Sent By Job
6592404 - Run Bytes Received By Job
...
007 (10293318.000.000) 03/17 23:27:45 Shadow exception!
Failed to connect to schedd!
1161 - Run Bytes Sent By Job
6592404 - Run Bytes Received By Job
...
These I do understand and will probably restartable by the rescue dags,
however a few minutes later, when I'm not near the machines (nor the
other admin who could have the rights) this happened:
009 (10293330.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293324.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293318.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293342.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293336.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293348.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
009 (10293055.000.000) 03/17 23:41:07 Job was aborted by the user.
via condor_rm (by user carsten)
...
Will dagman condor_rm jobs on its own?
Under certain circumstances, yes. If you condor_rm the DAGMan job, it
will condor_rm it's node jobs. I wonder if it's possible that the schedd
problems caused the DAGMan job to get condor_rm'ed. You can find out by
looking at the dagman.out file: if DAGMan was condor_rm'ed, you should see
something like this:
3/18 10:26:53 Received SIGUSR1
3/18 10:26:53 Aborting DAG...
3/18 10:26:53 Writing Rescue DAG to dag_files/diamond.dag.rescue001...
You could also look at the DAGMan jobs .dagman.log file and see what it
says.
Kent Wenger
Condor Team