I’m having a hard time troubleshooting a problem with seeing with our HTCondor jobs and was hoping to get some help from the community. I looked at a lot of the online documentation and even consulted with our internal subject matter experts
and we still can’t figure out what’s going on. Basically, we’re submitting 121 test programs to be run on an HTCondor cluster of 20 machines with 12 CPU cores each. Each test has a “launcher” script that we create and gets passed to the executing host. These
test programs run under valgrind and a few of them go for several hours. After submitting the jobs, we have a process that waits up to 12 hours for the tests to complete. If after 12 hours, all jobs have not completed, it assumes there is a problem and
removes the remaining jobs. All of the tests except for 10 run successfully as HTCondor jobs. For about 10 of them, something strange happens. The programs run for a while and then then we get messages that the job disconnected and job reconnection failed.
So, the job gets restarted. This pattern usually repeats a few times. In one of these repeated executions, the test program completes successfully. The STDOUT and STDERR get transferred back to the submitting host. However, the condor_q still shows the
job as running. So, what’s happening is that the launcher script is completing successfully, but condor still thinks the job is running. What’s puzzling is that the .sub file specifies to transfer files back to the submitting host ON_EXIT and we see the
STDOUT and STDERR indicating that the job completed. Yet, HTCondor still reports the job as running. How is this possible ? We are sure that the output files aren’t stale because all files are removed before the build and test. We also checked out network
statistics and there does not appear to be anything unusual going on that would cause the socket to close between the submitting host and the executing host. Does HTCondor have a bug in how it handles jobs that are restarted ? Also, does HTCondor try to
detect if a job is hung ? We think this might be what’s going on because these are some of the longer running tests. I have included some of a test’s log file below. In this specific instance, the test succeeded and the STDOUT/STDERR was successfully transferred
back to the submitting host at 11:46AM. Any help would be greatly appreciated as we are really clueless as to what’s going on. Kris Wempa 000 (4323.000.000) 09/28 01:26:29 Job submitted from host: <10.11.129.10:39405> ... 001 (4323.000.000) 09/28 01:26:43 Job executing on host: <10.11.132.73:37914?soc k=9516_67e9_3> 006 (4323.000.000) 09/28 01:26:52 Image size of job updated: 325032 78 - MemoryUsage of job (MB) 79152 - ResidentSetSize of job (KB) ... 006 (4323.000.000) 09/28 01:31:53 Image size of job updated: 2152844 1664 - MemoryUsage of job (MB) 1703752 - ResidentSetSize of job (KB) { Several more heart beat messages } 022 (4323.000.000) 09/28 03:29:24 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.73:37914 ?sock=9516_67e9_3> ... 024 (4323.000.000) 09/28 03:29:24 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 03:29:54 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 022 (4323.000.000) 09/28 05:24:06 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529 ?sock=27166_f857_3> ... 024 (4323.000.000) 09/28 05:24:06 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 05:24:41 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 022 (4323.000.000) 09/28 07:05:35 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529 ?sock=27166_f857_3> ... 024 (4323.000.000) 09/28 07:05:35 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 07:06:14 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 022 (4323.000.000) 09/28 09:01:37 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529 ?sock=27166_f857_3> ... 024 (4323.000.000) 09/28 09:01:45 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 09:02:15 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 022 (4323.000.000) 09/28 10:40:54 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529 ?sock=27166_f857_3> ... 024 (4323.000.000) 09/28 10:40:54 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 10:41:19 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 022 (4323.000.000) 09/28 12:45:39 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.11.132.72:51529 ?sock=27166_f857_3> ... 024 (4323.000.000) 09/28 12:45:39 Job reconnection failed Job not found at execution machine Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job ... 001 (4323.000.000) 09/28 12:45:57 Job executing on host: <10.11.132.72:51529?soc k=27166_f857_3> ... 004 (4323.000.000) 09/28 13:31:14 Job was evicted. (0) Job was not checkpointed. Usr 0 00:41:38, Sys 0 00:02:18 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 0 - Run Bytes Sent By Job 2013 - Run Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 150 150 6747722 Memory (MB) : 1709 1709 1709 ... IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. |