Hey all, I’ve got two machines configured what I sure thought was identically, but one’s started misbehaving. The misbehaving one now writes logfiles (from the submission, not the daemon logs) which are missing the “000 Job submitted” entries. It didn’t do so a week or two ago – I can see some correct submission logfiles on the misbehaving box. Having the “job submitted” entries go missing breaks the Condor Perl module. That module counts up jobs as they’re submitted, and counts them back down again as they retire, and if there aren’t any submissions, the retires immediately push the count negative, never to return to zero and terminate the Monitor. (The missing part isn’t exclusive to Perl; I can reproduce the problem when invoking condor_submit manually.) They’re both Windows 2012 Server R2, running $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ $CondorPlatform: x86_64_Windows8 $ When I run a hello world submission: #---------- Condor Variables ----------------------------------------- universe = vanilla priority = 0 output = $(Cluster).$(Process).out error = $(Cluster).$(Process).err log = $(Cluster).log #------------------------------------------------------------------------- executable = CondorTestJob.bat should_transfer_files = YES transfer_input_files = CondorTestJob.pl when_to_transfer_output = ON_EXIT notification = never run_as_owner = TRUE #============================================== # Machines #============================================== requirements = ((Arch == "Intel") || (Arch == "X86_64")) && ((OpSys == "WINNT51") || (OpSys == "WINNT52") || (OpSys == "WINNT61") || (OpSys == "WINDOWS")) && ((LocalCredd =?= "mylocalcredd") ||
(LocalCredd =?= "mylocalcredd:9620")) queue A good logfile like this: 000 (001.000.000) 11/04 11:05:47 Job submitted from host: <myip:57120?addrs=myip-57120> ... 001 (001.000.000) 11/04 11:06:06 Job executing on host: <otherip:58825?addrs=otherip-58825> ... 006 (001.000.000) 11/04 11:06:06 Image size of job updated: 1 0 - MemoryUsage of job (MB) 0 - ResidentSetSize of job (KB) ... 005 (001.000.000) 11/04 11:06:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1145 - Run Bytes Sent By Job 84 - Run Bytes Received By Job 1145 - Total Bytes Sent By Job 84 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 11 2 97755693 Memory (MB) : 0 1 8189 ... While a bad one misses the all-important first message: 001 (012.000.000) 11/19 17:15:08 Job executing on host: <other2ip:59153?addrs=other2ip -59153> ... 006 (012.000.000) 11/19 17:15:08 Image size of job updated: 1 0 - MemoryUsage of job (MB) 0 - ResidentSetSize of job (KB) ... 005 (012.000.000) 11/19 17:15:08 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1153 - Run Bytes Sent By Job 84 - Run Bytes Received By Job 1153 - Total Bytes Sent By Job 84 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 11 2 91585753 Memory (MB) : 0 1 8189 ... The only other piece of weirdness I can find is that the misbehaving box doesn’t want me to open a logfile by double-clicking in Windows Explorer – but right-clicking and using Open With… (anything) is fine. I’ve diffed the daemon logs against a box that’s still working correctly, and nothing jumps out. Any ideas where I might look next? Thanks! ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Jason Ross
Intel Corporation Graphics Architect
FM5-64 VPG Architecture
1900 Prairie City Road (916) 356-8964
Folsom, CA 95630 |