[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Condor job submission delayed



I have one client (Windows XP) for running jobs, one master (Linux, RH9) for control and one machine for submitting jobs from (Windows XP) and I'm seeing this long delay as well so I'm not thinking it's a load issue. Hopefully one of the condor team members has some insight into the issue...

Ian

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of John Wheez
Sent: September 1, 2004 2:04 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condor job submission delayed


Hi,

 Funny I was just going to post about this same problem. I have the same 
setup as Marc Saric described and I have the same problem as he 
described. Mainly, sometimes it takes a long time for the pool controler 
to send out a new job to a CPU that has recently finished.I see this 
often in my dual processor machines as well.

I have windows processing nodes and a linux master controller. I do have 
lots of cluster based node submissions all being submitted by my pool 
master. I wonder if this is the problem..The fact that the pool master 
is also the one that submits the jobs might be overloading the pool 
master somehow since it is also keeping track of all the jobs?? I wonder 
if i making so my client machines submit teh job will help clear up the 
problem..i'll try this today....

JW



Ian Chesal wrote:

>I never saw an answer to this question. Did one get proffered off the 
>list? Could you please cross post it if that is the case. I too am 
>curious about this delay as I'm seeing this in my flock of Windows XP 
>machines.
>
>
>Thanks!
>Ian
>
>-----Original Message-----
>From: condor-users-bounces@xxxxxxxxxxx 
>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Marc Saric
>Sent: August 31, 2004 10:26 AM
>To: condor-users@xxxxxxxxxxx
>Subject: [Condor-users] Condor job submission delayed
>
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Hi all,
>
>I am experimenting with a small Condor cluster (Condor 6.6.6, mostly on 
>Windows-boxes unfortunately) as you can see from my various beginners 
>mails popping up in the forum.
>
>I have set up a bunch of Windows-machines (Win2k SP6 and WinXP Pro SP1) 
>and a central Linux-Master-Server.
>
>Submission of jobs works in principle (tested it with the 
>hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
>but sometimes I observe a strange behaviour in that certain jobs need a very long time until they are beeing executed.
>
>This happens while most of the machines are not busy and are listed as 
>availabe (15 min no user + low CPU-utilization).
>
>"condor_status" gives something like:
>
>saric@u-191-srv2:~/tmp> condor_status
>
>Name          OpSys       Arch   State      Activity   LoadAv Mem
>ActvtyTime
>
>u-191-srv2.pr LINUX       INTEL  Unclaimed  Idle       0.010  1004
>0+01:52:13
>u-099-cpc-esi WINNT50     INTEL  Owner      Idle       0.240   512
>0+01:16:34
>vm1@u-099-csr WINNT50     INTEL  Claimed    Busy       0.000  1024
>0+00:10:56
>vm2@u-099-csr WINNT50     INTEL  Unclaimed  Idle       0.000  1024
>0+01:43:03
>u-099-cbb1    WINNT51     INTEL  Unclaimed  Idle       0.000   511
>0+01:46:27
>u-099-cnb2    WINNT51     INTEL  Owner      Idle       0.020   511
>0+04:31:59
>u-099-cpc-sek WINNT51     INTEL  Owner      Idle       0.040   512
>0+00:10:14
>u-099-cpc1    WINNT51     INTEL  Owner      Idle       0.000   512
>0+00:06:20
>u-099-cpc2    WINNT51     INTEL  Owner      Idle       0.030   512
>0+00:01:20
>u-099-cpc3    WINNT51     INTEL  Unclaimed  Idle       0.000   512
>0+00:06:21
>u-099-cpc4    WINNT51     INTEL  Owner      Idle       -0.010   512
>0+04:57:30
>u-099-cpc5    WINNT51     INTEL  Unclaimed  Idle       0.000   512
>0+00:31:21
>
>so there are at least 4 unclaimed machines in the pool which should 
>match requirements ((OpSys == "WINNT50") || (OpSys == "WINNT51"))..
>
>The result of a "condor_q -analyze" takes quite a long time and gives 
>back something like:
>
>045.000:  Run analysis summary.  Of 12 machines,
>~      1 are rejected by your job's requirements
>~      6 reject your job because of their own requirements
>~      0 match, but are serving users with a better priority in the pool
>~      4 match, match, but reject the job for unknown reasons
>~      1 match, but will not currently preempt their existing job
>~      0 are available to run your job
>
>I can't see why the 4 should reject for unknown reasons. Is there any 
>place where I could look at to find out these unknown reasons 
>(systemlog, local condor-log on machines???).
>
>Thanks in advance!
>
>- --
>Bye,
>Marc Saric
>
>Dr. Marc Saric, Bioinformatik, Proteom Centrum Tübingen,
>Auf der Morgenstelle 15, D-72076 Tübingen, Germany,
>Tel: +49 (0)7071 29 70557, marc.saric@xxxxxxxxxxxxxxxx 
>http://www.proteom-centrum-tuebingen.de
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.2.4 (GNU/Linux)
>Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
>iD8DBQFBNIqQBLD6PjSWyL4RAlKLAJ4l64RE870+vfqESQJL5Cz5oMSGjQCbBmA6
>WLrzxNGTr1sGB3oJv4bDW48=
>=nKWt
>-----END PGP SIGNATURE----- 
>_______________________________________________
>Condor-users mailing list
>Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>_______________________________________________
>Condor-users mailing list
>Condor-users@xxxxxxxxxxx
>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>  
>


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users