| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Hawkeye module and condor_q problems in condor-6.6.6
- Date: Mon, 06 Jun 2005 15:13:50 -0500 (CDT)
- From: Chris Green <greenc@xxxxxxxx>
- Subject: [Condor-users] Hawkeye module and condor_q problems in condor-6.6.6
Hi,
I'm having problems with hawkeye modules under condor v6.6.6: sometimes I 
get continuous: 'Cron: Job 'blah' is still running!' messages, even though 
I can't find the processes in the process list any more. is there any way 
to fix this short of bouncing the startd?
Second: we deal with some very large-footprint condor jobs in the vanilla 
universe. Most of it (static FORTRAN array space) gets swapped out, but in 
the event that a job gets killed on a machine, it will then never run on 
another machine because its ImageSize is greater than the (per-vm) memory 
available on the machine. I have been running a command:
condor_qedit -name lawrence -constraint \
'JobStatus == 1 && ImageSize > 0.0' \
ImageSize 0.0
which works, but condor_q then says:
-- Schedd: rockwell.fnal.gov : <131.225.52.131:32774>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 --- ???? ---
3034.0   jocelyn         6/6  11:23   0+03:45:52 R  0   1389.5 AnalysisFramework_
 --- ???? ---
 --- ???? ---
3037.0   jocelyn         6/6  11:23   0+03:45:29 R  0   1389.5 AnalysisFramework_
3040.0   jocelyn         6/6  11:23   0+03:45:22 R  0   1385.5 AnalysisFramework_
3041.0   jocelyn         6/6  11:23   0+03:45:18 R  0   1106.5 AnalysisFramework_
 --- ???? ---
3043.0   jocelyn         6/6  11:24   0+03:44:51 R  0   1108.5 AnalysisFramework_
3044.0   jocelyn         6/6  11:24   0+03:44:51 R  0   1106.5 AnalysisFramework_
 --- ???? ---
3048.0   jocelyn         6/6  11:24   0+03:45:03 R  0   1396.0 AnalysisFramework_
where the " --- ???? --- " lines represent the jobs that were edited (job 
numbers 3033.0, 3035.0, 3036.0, 3042.0 and 3047.0 here). Is this a bug or 
something I did wrong? Regardless, how do I fix or workaround the problem?
Thanks,
Chris.
--
Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx
Tel: (630) 840-2167. Fax: (630) 840-3867