John Horne wrote: Hi John,Hello, Installed Condor 6.7.6, and am currently trying to run the example programs using just one remote client. Not sure how long they are supposed to take but the first job seems to have run for about half an hour or so and is still going (I think). I noticed in the workstations 'log' directory on the condor master server, in the StartLog file: 3/29 13:17:46 StatInfo::fstat64(/dev/stdin) failed, errno: 9 = Bad file descriptor 3/29 13:17:46 StatInfo::fstat64(/dev/stdout) failed, errno: 9 = Bad file descriptor 3/29 13:17:46 StatInfo::fstat64(/dev/stderr) failed, errno: 9 = Bad file descriptor The workstation is running the Linux terminal server project (LTSP) version 4.1. I can see no obvious problem with /dev/stdin or the others; they lead off to soft links pointing eventually to (for stdin) /dev/vc/1 which has the attributes: crw------- 1 root root 4, 1 Mar 29 13:25 /dev/vc/1 Anyone any ideas as to the problem with the file descriptors? Thanks, John. We see the following error messages in our cluster also. However, we are able to submit and run our jobs successfully. It is unlikely that this is what is preventing your job from terminating. 3/29 13:17:46 StatInfo::fstat64(/dev/stdin) failed, errno: 9 = Bad file descriptor 3/29 13:17:46 StatInfo::fstat64(/dev/stdout) failed, errno: 9 = Bad file descriptor 3/29 13:17:46 StatInfo::fstat64(/dev/stderr) failed, errno: 9 = Bad file descriptor Here are a few things that you could try out. 1. Run a /bin/sleep job with arguments=60 (for one minute). I've attached job description file for such a job. InitialDir = /tmp executable = /bin/sleep Universe = Vanilla output = /tmp/test.out error = /tmp/test.err log = /tmp/test.log arguments = 60 queue 2. Do check the /tmp/test.log file to see whether the job was run and terminated properly. You'd see something like this - 000 (1358.000.000) 03/24 15:20:28 Job submitted from host: <192.168.25.208:53222> 001 (1358.000.000) 03/24 15:25:15 Job executing on host: <192.168.25.195:41686> 005 (1358.000.000) 03/24 15:25:36 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 3. If you can successfully run a sleep job, you might want to try executing your original job by hand at the command prompt to see whether it runs. How long does it take - is the execution time bounded or completely non-deterministic? Have you checked the status of the job using condor_q? If the job is not running, what does condor_q -analyze report? Is it possible that the job starts running and then gets pre-empted because of the policy you've used? If so, it should be reflected in the log file. Let me know how it goes, -- Rajesh Rajamani Senior Member of Technical Staff Direct : +1.408.321.9000 Fax : +1.408.904.5992 Mobile : +1.650.218.9131 raj@xxxxxxxxxx Optena Corporation 2860 Zanker Road, Suite 201 San Jose, CA 95134 www.optena.com This electronic transmission (and any attached documents) contains information from Optena Corporation and is for the sole use of the individual or entity it is addressed to. If you receive this message in error, please notify me and destroy the attached message (and all attached documents) immediately. |