Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons
- Date: Tue, 1 Apr 2008 19:06:53 +0200 (CEST)
- From: Lucia Santamaria <lucia.santamaria@xxxxxxxxxx>
- Subject: Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons
Thank you so much for your prompt reply, Kent,
Does this mean that the node job user logs *are* on NFS? If that's the
case, is it possible to move them to a local file system? I'm not *sure*
I will ask Steffen about this possibility.
Hmm -- are you running more than one instance of the same DAG at a time?
That will almost certainly cause problems. Also, even if you are not
I'm not sure, I hope there's some ihope expert that can answer this.
but I'd tend to think so...
Could you send the dagman.out file corresponding to this run? That is
generally the first place to look when DAGMan has a problem.
Yes, you can have a look at it here:
http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out
If you can also send the DAG file itself, and the entire user log file for
the node jobs, that would help diagnose things.
Yes, sorry, I tried to attach them, but they were too big for the mailing
list.
* The dag:
http://pandora.aei.mpg.de/~lucia/ihope.dag
* The log file
http://pandora.aei.mpg.de/~lucia/error_log
Also: LATEST NEWS:
The rescue dag that I get now dies with this signal:
----------------
This is an automated email from the Condor system
on machine "deepthought.merlin2.aei.mpg.de". Do not reply.
Your condor job was killed by signal 11.
Job: /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile
ihope.dag.rescue.lock
-Condorlog
/.auto/home/lucia/playground_20080314/857232370-859651570/playground/inspira
l_hipe_playground.PLAYGROUND.dag.dagman.log -Dag ihope.dag.rescue -Rescue
ihope.dag.rescue.rescue
------------------
This is _not_ a new behaviour: I moved from morgane to deepthought
node because _every_ dag submitted from morgane would get killed with this
signal 11. We thought that submitting from deepthought solved this problem
since it has more RAM than morgane (and sig 11 is known to have sth to do
with memory requirements).
But now it's the first time that I'm getting a sig 11 from deepthought.
And I've been sending this dag with small variations ~5 times already.
Note that I never got sig 11 from deepthought when I used 'standard'
universe. This issue I'm showing now is in vanilla.
After I get 5 such sig 11 emails from Condor master, the queue looks like
this:
-----------------------
lucia@deepthought:~/playground_20080314/857232370-859651570$ condor_q
lucia
-- Submitter: deepthought.merlin2.aei.mpg.de : <10.100.200.92:60979> :
deepthought.merlin2.aei.mpg.de
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
107014.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107015.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107016.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107017.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107018.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107021.0 lucia 4/1 18:26 0+00:36:09 R 0 317.4
lalapps_tmpltbank
107024.0 lucia 4/1 18:26 0+00:35:47 R 0 317.4
lalapps_tmpltbank
(...)
56 jobs; 0 idle, 56 running, 0 held
-----------------------
I never saw such a thing before, no condor_dagman -f - at the beginning of
the queue...
I'm sorry this is getting more complicated by the minute.
Thanks again for any help,
Lucia
Any insight in what might be causing this problem is much appreciated.
If I can get a look at the dagman.out file, that should help a lot.
Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------