Hello,
I've been having some troubles with Condor while experimenting with fault tolerance. I've posted some email on the list about my troubles while I keep investigating. Since I couldn't find a solution I turned my attention to some other tests. The test I was working on was enabling a new execute machine while some jobs were executing (and more were on queue), ie. dynamically add more nodes. My test was simple, the executable is a simple C program that sleeps for 5 minutes and then prints the UID of the program executor (which since I have a common UID domain, is the submitter's UID). If I run it without adding nodes, it works flawlessly. However, if I do add a new node while some jobs are executing (with condor_on node-2 for example), the output files don't get returned to me. Is this a Condor bug? Here are some outputs:
Executable source code:
#include <unistd.h>
int main(int argc, char *argv[])
{
int num;
sleep(300); /* Sleep 60 seconds */
num = getuid();
printf("UID: %d", num);
return 0;
}
Normal output (no new execute machines added to condor pool):
UID: 500
Erroneous output (new execute machines added to condor pool) is an empty file. To see this, check out the file sizes:
$ ls -sh out.*
0 out.0 4.0K out.11 4.0K out.14 4.0K out.17 4.0K out.2 4.0K out.22 4.0K out.4 0 out.7
0 out.1 4.0K out.12 4.0K out.15 4.0K out.18 4.0K out.20 0 out.23 4.0K out.5 0 out.8
4.0K out.10 4.0K out.13 4.0K out.16 4.0K out.19 4.0K out.21 4.0K out.3 0 out.6 0 out.9
Here you can clearly see my testing method:
- Start with 1 execute machine with two CPUs. Submit the job, two begin executing.
- condor_on second execute machine (also two CPUs) before the jobs are finsihed (the machines enter Owner state).
- When the jobs complete (no output is transfered), the second machine leaves owner state and begins job execution too.
- After these four jobs are completed (output is transfered), condor_on third machine (also two CPUs), which becomes Owner.
- 6 jobs finish, no output is transfered. The new machine leaves Owner state and executes jobs too.
- All other jobs finish normally and have their output transfered.
Finally, here are two log files (one from a job with transfered output and one without):
$ cat log.2
000 (263.002.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.002.000) 08/15 15:06:58 Job executing on host: <10.255.255.252:9670>
...
005 (263.002.000) 08/15 15:11:58 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
8 - Run Bytes Sent By Job
7075 - Run Bytes Received By Job
8 - Total Bytes Sent By Job
7075 - Total Bytes Received By Job
...
$ cat log.0
000 (263.000.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.000.000) 08/15 15:01:55 Job executing on host: <10.255.255.252:9670>
...
005 (263.000.000) 08/15 15:06:55 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
7075 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
7075 - Total Bytes Received By Job
...
If it is my fault this happened, an someone help me fix it? If it isn't, I hope this helps. Thank you,
JVFF