Johnson koil Raj wrote:
Hi I have submitted a job in my Condor Pool. It is executing in X system. Now I have stopped the Network at system X. But the Shadow in the Submitter is not detecting it and still showing the Job as running. can we make shadow not to wait for job like that and change status back to idle.
By default, Condor will detect the network failure and change the job status back to idle. Unfortunately, it could take up to a maximum of two hours to do so. The reason for this is in the above circumstance Condor uses TCP/IP KEEP ALIVE packets, and the standard for TCP/IP keep alives says it is allowable for a max of two hours to pass in between keep alive pings.
So if it is "good enough" for Condor to take up to two hours to detect the network failure and mark the job as idle, then you are done, since this is what will happen by default.
If it is not good enough, you can configure Condor to send its own keep alive packets between at a time interval that you specify. So you could configure Condor to detect the above network failure within 10 minutes, for instance, and the cost of additional network traffic (for the pings) and overhead on the submit machine. If you need to know how to do this, just ask...
Note we are talking here only about the situation of the execute machine falling off the network; i.e. a power failure of the execute node, or network failure, etc. If the execute machine is shutdown, or rebooted, or Condor processes are killed, etc, then the shadow notices right away and the job switches back to idle.
regards, Todd -- Todd Tannenbaum University of Wisconsin-Madison Condor Project Research Department of Computer Sciences tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257