Hi,I am using Condor to submit a lot of jobs to a bunch of CREAM and LCG CE execution hosts. When I start with a freshly started Condor everything works fine, but after a few days of sustained submission, held jobs begin to pile up with the HoldReason in the subject. By looking at the Globus logs on one execution machine, we found out that the GRAM two-phase submission for these jobs is never completed, so the Globus state file is removed. Still, Condor seems to ask for the jobs' status even if they were never submitted, and Globus answers with the "error 121".
The machine where the Condor runs is a virtual machine with one CPU and 2GB of RAM. The load average is always over 3, and most of the CPU is taken by these processes (percentage changes, but they are always the top 4 processes):
top - 11:15:20 up 5 days, 22:06, 1 user, load average: 3.77, 3.36, 3.13 Tasks: 106 total, 4 running, 101 sleeping, 0 stopped, 1 zombieCpu(s): 55.5%us, 44.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2058304k total, 1945344k used, 112960k free, 92836k buffers Swap: 2000368k total, 64k used, 2000304k free, 434464k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21591 rebatto 17 0 52384 15m 3244 R 80.4 0.8 638:38.04 gahp_server 21571 rebatto 18 0 154m 126m 3696 R 14.0 6.3 538:42.25 condor_gridmana 21607 rebatto 18 0 244m 63m 2928 S 5.0 3.2 26:26.13 cream_gahp 21600 rebatto 15 0 54400 17m 3248 S 0.3 0.9 786:30.39 gahp_server [...] The average number of jobs managed by Condor is ~ 5000.My only guess at the moment is that the gahp_server (or the grid_manager) cannot cope with all the submissions, either for CPU or for network limitations. Still, I'd like to have an opinion from more experienced users before asking the system managers for a bigger machine...
Thanks for any hint you can give me. -- David Rebatto I.N.F.N. - Sezione di Milano Via Celoria, 16 - 20133 Milano ITALY tel: +39 02503.17623 e-mail: David.Rebatto@xxxxxxxxxx URL: http://www.mi.infn.it/~rebatto "There are 10 kinds of people in the world: those who understand binary and those who don't..."
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature