Hi I have the following job, that should last 2-3 days, but seems to finish, then restarts again, without sending any results : _____ Universe = vanilla Executable = closest_condor.sh output = closest.out error = closest.err Log = closest.log should_transfer_files = YES when_to_transfer_output = ON_EXIT requirements = (machine == "atlas.galaxy.ibpc.fr") notify_user = user_email@xxxxxxx notification = always queue _________ I attach several files, sorry for flooding your mailbox, but I think the answer is somewhere here -the log file (.out and .err are empty) -the SchedLog file -the StartLog file of the target machine If you could explain me what happens to my job (id n° 254), I would be very grateful Nicolas ----------------------------------------------- CNRS - UPR 9080 : Laboratoire de Biochimie Theorique Institut de Biologie Physico-Chimique 13 rue Pierre et Marie Curie 75005 PARIS - FRANCE Tel : +33 158 41 51 70 Fax : +33 158 41 50 26 ------------------------------------------------
Attachment:
Atlas-StartLog
Description: Binary data
Attachment:
Fab-SchedLog
Description: Binary data
001 (077.000.000) 04/04 17:43:52 Job executing on host: <193.49.27.66:32772> ... 006 (077.000.000) 04/04 18:04:00 Image size of job updated: 58352 ... 001 (077.000.000) 04/04 18:23:56 Job executing on host: <193.49.27.56:32772> ... 006 (077.000.000) 04/04 18:44:05 Image size of job updated: 58352 ... 006 (077.000.000) 04/05 05:24:20 Image size of job updated: 58840 ... 010 (077.000.000) 04/05 06:28:36 Job was suspended. Number of processes actually suspended: 3 ... 011 (077.000.000) 04/05 06:30:42 Job was unsuspended. ... 004 (077.000.000) 04/05 11:28:59 Job was evicted. (0) Job was not checkpointed. Usr 0 16:03:45, Sys 0 00:02:39 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 0 - Run Bytes Sent By Job 1346 - Run Bytes Received By Job ... 009 (077.000.000) 04/05 11:28:59 Job was aborted by the user. via condor_rm (by user cailliez) ... 000 (254.000.000) 04/05 11:53:59 Job submitted from host: <193.49.27.73:32772> ... 001 (254.000.000) 04/05 11:54:07 Job executing on host: <193.49.27.56:32772> ... 006 (254.000.000) 04/05 11:54:15 Image size of job updated: 31140 ... 006 (254.000.000) 04/05 12:14:15 Image size of job updated: 58352 ... 006 (254.000.000) 04/05 22:34:30 Image size of job updated: 58608 ... 010 (254.000.000) 04/06 06:28:29 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/06 06:30:49 Job was unsuspended. ... 010 (254.000.000) 04/07 06:28:32 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/07 06:38:23 Job was unsuspended. ... 004 (254.000.000) 04/07 06:38:23 Job was evicted. (0) Job was not checkpointed. Usr 1 18:13:21, Sys 0 00:07:01 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 0 - Run Bytes Sent By Job 1346 - Run Bytes Received By Job ... 001 (254.000.000) 04/07 14:46:15 Job executing on host: <193.49.27.56:32772> ... 006 (254.000.000) 04/07 15:06:24 Image size of job updated: 58352 ... 006 (254.000.000) 04/07 16:06:25 Image size of job updated: 58608 ... 010 (254.000.000) 04/08 06:28:41 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/08 06:30:46 Job was unsuspended. ... 010 (254.000.000) 04/08 12:04:06 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/08 12:06:51 Job was unsuspended. ... 001 (254.000.000) 04/10 11:52:24 Job executing on host: <193.49.27.56:32772> ... 006 (254.000.000) 04/10 12:12:32 Image size of job updated: 58120 ... 006 (254.000.000) 04/10 12:32:33 Image size of job updated: 58352 ... 006 (254.000.000) 04/10 16:12:38 Image size of job updated: 58608 ... 010 (254.000.000) 04/11 06:28:42 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/11 06:30:37 Job was unsuspended. ... 006 (254.000.000) 04/11 20:12:38 Image size of job updated: 58840 ... 010 (254.000.000) 04/12 06:28:27 Job was suspended. Number of processes actually suspended: 3 ... 011 (254.000.000) 04/12 06:30:28 Job was unsuspended. ... 001 (254.000.000) 04/13 09:47:31 Job executing on host: <193.49.27.56:32772> ... 006 (254.000.000) 04/13 10:07:40 Image size of job updated: 58352 ...