Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] ' --- ???? ---' in v6.7.8 WAS Hawkeye module and condor_q problems in condor-6.6.6
- Date: Tue, 21 Jun 2005 09:07:50 -0500 (CDT)
- From: Chris Green <greenc@xxxxxxxx>
- Subject: [Condor-users] ' --- ???? ---' in v6.7.8 WAS Hawkeye module and condor_q problems in condor-6.6.6
Hi,
I'd be grateful if someone could respond further on the two issues I
raised in the parent thread:
1) The "hawkeye modules stuck running" problem:
Where is the lock file or accounting information that registers a
hawkeye module as running even when, according to "ps", it is not? How to
re-synchronize this information with reality short of restarting
condor_startd?
2) The "condor_q shows garbage problem"
Yesterday, I upgraded our system to condor v6.7.8, and this problem is
currently still manifest on a linux-glibc2.3 system. We have a queue with
~6400 jobs on it, all marked simply with
--- ???? ---
according to condor_q (no job information here, although condor_q -long
works just fine). Apparently according to my users this has been an
intermittent problem for some time, but this appears to be worse since I
started running the attached hawkeye module. In addition: in the current
case, the jobs will not start at all for no normal reason that I can find.
"condor_hold -all" and similar commands fail with errors like:
Could not hold all jobs.
Pointers would be appreciated. At least one of the condor team (Peter
Couvares) has login privileges to our machines (eg maxwell.fnal.gov, our
queue manager) if a first-hand look would be helpful.
Thanks for any help,
Chris.
On Mon, 6 Jun 2005, Chris Green wrote:
Hi,
The problem is that I can't find any evidence outside of this message (with
ps, for example) that this process really is running! So, my job is not being
run every five minutes like it's supposed to and the values it publishes are
never being updated. What I need is a way to "unstick" the startd so that its
idea of whether a hawkeye module is still running re-aligns itself with
reality.
Thanks,
Chris.
Hope this helps. :-)
-Nick
--
Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx
Tel: (630) 840-2167. Fax: (630) 840-3867
#!/usr/bin/perl -w
use strict;
# Obtain path and add to search path
BEGIN
{
my $Dir = $0;
if ( $Dir =~ /(.*)\/.*/ )
{
push @INC, "$1";
}
}
# Include Hawkeye support libraries
use HawkeyePublish;
use HawkeyeLib;
my $command_options = "-constraint 'JobStatus == 1 && ImageSize > 0.0'";
my @jobs_to_fix = `\$BIN/condor_q $command_options 2>/dev/null`;
@jobs_to_fix = map { ($_ =~ /^\s*(\d+\.\d+).*$/?$1:()) } @jobs_to_fix;
if ($#jobs_to_fix > -1) {
print STDERR "Fixing jobs: ", join(", ", @jobs_to_fix), "\n";
system("\$BIN/condor_qedit $command_options ImageSize 0.0");
}
# Return true
1;