Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor job submission delayed
- Date: Tue, 31 Aug 2004 16:26:24 +0200
- From: Marc Saric <marc.saric@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor job submission delayed
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi all,
I am experimenting with a small Condor cluster (Condor 6.6.6, mostly on
Windows-boxes unfortunately) as you can see from my various beginners
mails popping up in the forum.
I have set up a bunch of Windows-machines (Win2k SP6 and WinXP Pro SP1)
and a central Linux-Master-Server.
Submission of jobs works in principle (tested it with the
hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
but sometimes I observe a strange behaviour in that certain jobs need a
very long time until they are beeing executed.
This happens while most of the machines are not busy and are listed as
availabe (15 min no user + low CPU-utilization).
"condor_status" gives something like:
saric@u-191-srv2:~/tmp> condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
u-191-srv2.pr LINUX INTEL Unclaimed Idle 0.010 1004
0+01:52:13
u-099-cpc-esi WINNT50 INTEL Owner Idle 0.240 512
0+01:16:34
vm1@u-099-csr WINNT50 INTEL Claimed Busy 0.000 1024
0+00:10:56
vm2@u-099-csr WINNT50 INTEL Unclaimed Idle 0.000 1024
0+01:43:03
u-099-cbb1 WINNT51 INTEL Unclaimed Idle 0.000 511
0+01:46:27
u-099-cnb2 WINNT51 INTEL Owner Idle 0.020 511
0+04:31:59
u-099-cpc-sek WINNT51 INTEL Owner Idle 0.040 512
0+00:10:14
u-099-cpc1 WINNT51 INTEL Owner Idle 0.000 512
0+00:06:20
u-099-cpc2 WINNT51 INTEL Owner Idle 0.030 512
0+00:01:20
u-099-cpc3 WINNT51 INTEL Unclaimed Idle 0.000 512
0+00:06:21
u-099-cpc4 WINNT51 INTEL Owner Idle -0.010 512
0+04:57:30
u-099-cpc5 WINNT51 INTEL Unclaimed Idle 0.000 512
0+00:31:21
so there are at least 4 unclaimed machines in the pool which should
match requirements ((OpSys == "WINNT50") || (OpSys == "WINNT51")).
The result of a "condor_q -analyze" takes quite a long time and gives
back something like:
045.000: Run analysis summary. Of 12 machines,
~ 1 are rejected by your job's requirements
~ 6 reject your job because of their own requirements
~ 0 match, but are serving users with a better priority in the pool
~ 4 match, match, but reject the job for unknown reasons
~ 1 match, but will not currently preempt their existing job
~ 0 are available to run your job
I can't see why the 4 should reject for unknown reasons. Is there any
place where I could look at to find out these unknown reasons
(systemlog, local condor-log on machines???).
Thanks in advance!
- --
Bye,
Marc Saric
Dr. Marc Saric, Bioinformatik, Proteom Centrum Tübingen,
Auf der Morgenstelle 15, D-72076 Tübingen, Germany,
Tel: +49 (0)7071 29 70557, marc.saric@xxxxxxxxxxxxxxxx
http://www.proteom-centrum-tuebingen.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFBNIqQBLD6PjSWyL4RAlKLAJ4l64RE870+vfqESQJL5Cz5oMSGjQCbBmA6
WLrzxNGTr1sGB3oJv4bDW48=
=nKWt
-----END PGP SIGNATURE-----