Subject: [Condor-users] Condor performance problem
Hello everyone:
I realised a common performance problem (might be) in condor. For example I submit 10 jobs in condor pool, even if at that moment there were absolutely enough idle condor nodes (50 nodes), not all the 10 jobs could run immediately. Only some of them instantly started to run while others started to run later on. This is one log:
*************************************************** 000 (126.000.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385> ... 000 (126.001.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385> ... 000 (126.002.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385> ... 000 (126.003.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385> ... 000 (126.004.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385> ... 000 (126.005.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385> ... 000 (126.006.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385> ... 000 (126.007.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385> ... 000 (126.008.000) 01/12 16:20:43 Job submitted from host: <128.16.3.68:58385> ... 000 (126.009.000) 01/12 16:20:43 Job submitted from host: <
128.16.3.68:58385> ... 001 (126.000.000) 01/12 16:20:48 Job executing on host: <128.16.9.11:33303> ... 001 (126.006.000) 01/12 16:20:50 Job executing on host: <
128.16.13.22:32975> ... 001 (126.001.000) 01/12 16:20:52 Job executing on host: <128.16.13.27:33551> ... 001 (126.002.000
) 01/12 16:20:54 Job executing on host: <128.16.13.42:33469> ... 001 (126.003.000) 01/12 16:20:56 Job executing on host: <128.16.9.23:33266
> ... 005 (126.000.000) 01/12 16:20:58 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:02, Sys 0 00:00:05 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:02, Sys 0 00:00:05 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job
51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 005 (126.006.000) 01/12 16:21:00 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job
5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 001 (126.005.000) 01/12 16:21:00 Job executing on host: <
128.16.13.38:34509> ... 001 (126.007.000) 01/12 16:21:03 Job executing on host: <128.16.13.34:33987> ... 005 (126.001.000) 01/12 16:21:03 Job terminated.
(1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job
... 001 (126.008.000) 01/12 16:21:04 Job executing on host: <128.16.13.37:33208> ... 005 (126.002.000) 01/12 16:21:04 Job terminated. (1) Normal termination (return value 0)
Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 005 (126.003.000) 01/12 16:21:06 Job terminated.
(1) Normal termination (return value 0) Usr 0 00:00:02, Sys 0 00:00:05 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:02, Sys 0 00:00:05 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job
... 001 (126.009.000) 01/12 16:21:07 Job executing on host: <128.16.13.28:34609> ... 001 (126.004.000) 01/12 16:21:08 Job executing on host: <
128.16.9.11:33303> ... 005 (126.005.000) 01/12 16:21:10 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job
51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 005 (126.007.000) 01/12 16:21:13 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job
5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 005 (126.008.000) 01/12 16:21:14 Job terminated. (1) Normal termination (return value 0)
Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ... 005 (126.009.000) 01/12 16:21:17 Job terminated.
(1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:04 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:04 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job 51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job
... 005 (126.004.000) 01/12 16:21:18 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:02, Sys 0 00:00:05 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:02, Sys 0 00:00:05 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 51 - Run Bytes Sent By Job 5419 - Run Bytes Received By Job
51 - Total Bytes Sent By Job 5419 - Total Bytes Received By Job ******************************************************
It is not a big problem if all the jobs are relatively long, but if the jobs are very short compared with the delay and we got huge numbers of jobs, that would be a apparent problem.
So, could anybody tell me why this problem happen? Is that because the match making process only can process one job per time?