Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Why does machine reject job for unknown reasons
- Date: Tue, 15 May 2007 15:16:55 +0100
- From: "Kewley, J \(John\)" <j.kewley@xxxxxxxx>
- Subject: Re: [Condor-users] Why does machine reject job for unknown reasons
I
would suggest looking at the log files on the submission and central manager
(Condor gurus will
be
more specific with exactly where to look).
My
(automatic these days) first response is to ensure that there are no firewalls
between
the
submission node and any of the prospective execute nodes. And if there is, are
the appropriate
fixed
and ephemeral ports open for both UDP and TCP.
This
scenario where jobs match to a machine and then never get there
can
also
be caused by NATs causing similar connection problems.
Both
of above would cause "evidence" to appear in the log files.
Another problem might be where the job cannot start at the machine
because of file transfer,
remote
filestore issues (although I can't recall whether the symptoms would be the
same). Again
the
log files would give useful hints as to what was happening.
Do you
have other jobs running OK in the pool? If so, what is different about this
one?
If
not, then I'd suggest running a more trivial job (like /bin/hostname or
equivalent).
BTW
this group is for users so we don't always have time to respond to queries.
While often it is
the
condor team themselves, quite often it is fellow users.
Cheers
JK
Hi,
sorry to bother you again
with my question, but this problem still persists. I have recieved so far no
idea how to find out why condor-jobs are rejected ...
Cheers
Alex
On 5/14/07, Alexander
Dietz <Alexander.Dietz@xxxxxxxxxxxxxx>
wrote:
Hi,
thanks
for this suggestion, but the output really does not help me further (see
below). It looks like that 150 machine are good to run the jobs on, but
still they are rejected for unknown reasons! I need them to start
immediately because of a timely limited online-demonstration for the work I
am doing.
Any other suggestions?
Cheers
Alex
> condor_q -better-analyze
1082109.0
1082109.000: Run analysis summary. Of 152
machines,
2 are rejected by
your job's requirements
0 reject your job
because of their own requirements
0 match
but are serving users with a better priority in the
pool
150 match but reject the job for unknown
reasons
0 match but will not
currently preempt their existing job
0
are available to run your job
The Requirements _expression_ for
your job is:
( target.Arch == "X86_64" ) && ( target.OpSys ==
"LINUX" ) &&
( ( target.CkptArch == target.Arch ) || (
target.CkptArch is undefined ) ) &&
( ( target.CkptOpSys ==
target.OpSys ) || ( target.CkptOpSys is undefined ) ) &&
(
target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >=
ImageSize )
Condition
Machines Matched Suggestion
---------
---------------- ----------
1 ( target.Disk
>= 10000 )
150
2 ( target.Arch == "X86_64"
) 152
3 ( target.OpSys ==
"LINUX" ) 152
4 ( (
target.CkptArch == target.Arch ) || ( target.CkptArch is undefined )
)
152
5 ( ( target.CkptOpSys == target.OpSys ) || (
target.CkptOpSys is undefined )
)
152
6 ( ( 1024 * target.Memory ) >= 10000
)152