I just rebuilt by test bed grid with Condor 6.7.14
and ran a little test c++ program that searches for prime numbers as a
test. For some reason the program gets evicted from the nodes. It
eventually completed without any errors, but took a very long time with a lot of
evictions.
Is there something simple I can do to the
configuration to stop these evictions?
Here is part of the log file, the
error file was empty.
Example from Log:
001 (010.000.000) 01/28 14:54:41 Job
executing on host: <192.168.0.2:32773>
... 004 (010.000.000) 01/28 14:54:41 Job was evicted. (0) Job was not checkpointed. Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 224 - Run Bytes Sent By Job 13518504 - Run Bytes Received By Job after being evicted multiple times if finally ran
almost half an hour later:
001 (010.000.000) 01/28 15:23:43 Job
executing on host: <192.168.0.1:32775>
... 005 (010.000.000) 01/28 15:23:55 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:03, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:03, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1096 - Run Bytes Sent By Job 13520753 - Run Bytes Received By Job 2440 - Total Bytes Sent By Job 94631776 - Total Bytes Received By Job I sent 28 versions of this in the submittal program
and each job had this problem with evictions, there were no other jobs in the
queue. All of the 28 jobs eventually completed without
errors.
Thanks,
Steve Broughton
University of
Idaho |