Hello, I've been having some troubles with Condor while experimenting with fault tolerance. I've posted some email on the list about my troubles while I keep investigating. Since I couldn't find a solution I turned my attention to some other tests. The test I was working on was enabling a new execute machine while some jobs were executing (and more were on queue), ie. dynamically add more nodes. My test was simple, the executable is a simple C program that sleeps for 5 minutes and then prints the UID of the program executor (which since I have a common UID domain, is the submitter's UID). If I run it without adding nodes, it works flawlessly. However, if I do add a new node while some jobs are executing (with condor_on node-2 for example), the output files don't get returned to me. Is this a Condor bug? Here are some outputs: Executable source code: #include <unistd.h> int main(int argc, char *argv[]) { int num; sleep(300); /* Sleep 60 seconds */ num = getuid(); printf("UID: %d", num); return 0; } Normal output (no new execute machines added to condor pool): UID: 500 Erroneous output (new execute machines added to condor pool) is an empty file. To see this, check out the file sizes: $ ls -sh out.* 0 out.0 4.0K out.11 4.0K out.14 4.0K out.17 4.0K out.2 4.0K out.22 4.0K out.4 0 out.7 0 out.1 4.0K out.12 4.0K out.15 4.0K out.18 4.0K out.20 0 out.23 4.0K out.5 0 out.8 4.0K out.10 4.0K out.13 4.0K out.16 4.0K out.19 4.0K out.21 4.0K out.3 0 out.6 0 out.9 Here you can clearly see my testing method: - Start with 1 execute machine with two CPUs. Submit the job, two begin executing. - condor_on second execute machine (also two CPUs) before the jobs are finsihed (the machines enter Owner state). - When the jobs complete (no output is transfered), the second machine leaves owner state and begins job execution too. - After these four jobs are completed (output is transfered), condor_on third machine (also two CPUs), which becomes Owner. - 6 jobs finish, no output is transfered. The new machine leaves Owner state and executes jobs too. - All other jobs finish normally and have their output transfered. Finally, here are two log files (one from a job with transfered output and one without): $ cat log.2 000 (263.002.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663> ... 001 (263.002.000) 08/15 15:06:58 Job executing on host: <10.255.255.252:9670> ... 005 (263.002.000) 08/15 15:11:58 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 8 - Run Bytes Sent By Job 7075 - Run Bytes Received By Job 8 - Total Bytes Sent By Job 7075 - Total Bytes Received By Job ... $ cat log.0 000 (263.000.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663> ... 001 (263.000.000) 08/15 15:01:55 Job executing on host: <10.255.255.252:9670> ... 005 (263.000.000) 08/15 15:06:55 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 7075 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 7075 - Total Bytes Received By Job ... If it is my fault this happened, an someone help me fix it? If it isn't, I hope this helps. Thank you, JVFF check out the rest of the Windows Live™. More than mail–Windows Live™ goes way beyond your inbox. More than messages |