On 12/21/2017 2:36 PM, Lee Mitchell wrote:
Hello All,
I am upgrading from
$CondorVersion: 7.8.1 Jun 08 2012 $
$CondorPlatform: x86_64_rhap_6.2 $
to
$CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
$CondorPlatform: x86_64_RedHat7 $
I set all SEC_*Â knobs to NEVER; I have been relaxing security trying to get my test job (shell script that calls sleep) to run.
I finally have it negotiating and matching, but when it begins the start a core.STARTER file is created in my log dir, with the below in myÂStarterLog.slot1
By the way, previously this same job would run sucessfully after it sat in the queue for some time (20 mintues?) after a message in the SchedLog saying something like "Have not heard from Negotiator for a while, running local jobs..."
Any advice is greatly appreciated. Thx, Lee
Some quick thoughts (hunches?) kinda in the order they occuried to me-
1. You upgraded your worker node (condor_startd/condor_starter) to v8.6.8. What version is your submit node (condor_schedd) running? Is it still running v7.8? We try to maintain compatibility across HTCondor versions, but there is a very big gap between v7.8 and v8.6, it would not surprise me if problems appear. Try your tests using a submit node also running v8.6.Â
Â
2. I would not advise setting SEC_DEFAULT_NEGOTIATION=NEVER. Try leaving that one at the default, or do SEC_DEFAULT_NEGOTIATION=REQUIR ED.
3. You are using binaries compiled for RHEL7... this is indeed running on a RHEL7 or Centos7 system, right?
4. If the above doesn't help, try putting STARTER_DEBUG = D_ALL in the condor_config on your worker node and run again. This time your StarterLog should contain a lot more messages which could help make it more obvious where things are going wrong.
5. If things aren't clarified from #4 above, you could make core.STARTER file available on the internet and send a message to the developers
at htcondor-admin@xxxxxxxxxxx (or at htcondor-support@xxxxxxxxxxx if you have a support contract) telling them how to pick it up.
best regards
Todd
12/21/17 15:22:39 (pid:22864) ****************************************************** 12/21/17 15:22:39 (pid:22864) DaemonCore: command socket at <10.245.9.29:9618?addrs=10.245
12/21/17 15:22:39 (pid:22864) ** condor_starter (CONDOR_STARTER) STARTING UP
12/21/17 15:22:39 (pid:22864) ** /opt/condor/sbin/condor_starter
12/21/17 15:22:39 (pid:22864) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/21/17 15:22:39 (pid:22864) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/21/17 15:22:39 (pid:22864) ** $CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
12/21/17 15:22:39 (pid:22864) ** $CondorPlatform: x86_64_RedHat7 $
12/21/17 15:22:39 (pid:22864) ** PID = 22864
12/21/17 15:22:39 (pid:22864) ** Log last touched 12/21 15:22:30
12/21/17 15:22:39 (pid:22864) ******************************************************
12/21/17 15:22:39 (pid:22864) Using config source: /opt/condor/etc/condor_config
12/21/17 15:22:39 (pid:22864) Using local config sources:
12/21/17 15:22:39 (pid:22864)Â Â /opt/condor/local/condor_config.local
12/21/17 15:22:39 (pid:22864) config Macros = 165, Sorted = 164, StringBytes = 5335, TablesBytes = 5988
12/21/17 15:22:39 (pid:22864) CLASSAD_CACHING is OFF
12/21/17 15:22:39 (pid:22864) Daemon Log is logging: D_ALWAYS D_ERROR
12/21/17 15:22:39 (pid:22864) SharedPortEndpoint: waiting for connections to named socket 19607_a7b6_4.9.29-9618+[--1]-9618&noUDP& <http://10.245.9.29:9618?addrssock=19607_a7b6_4 =10.245.9.29-9618+[--1]-9618& >>noUDP&sock=19607_a7b6_4
12/21/17 15:22:39 (pid:22864) DaemonCore: private command socket at <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP& <http://10.245.9.29:9618?addrssock=19607_a7b6_4 =10.245.9.29-9618+[--1]-9618& >>noUDP&sock=19607_a7b6_4
12/21/17 15:22:39 (pid:22864) Communicating with shadow <10.245.9.29:9618?addrs=10.245.9.29-9618+[--1]-9618&noUDP& <http://10.245.9.29:9618?addrssock=19606_f356_4 =10.245.9.29-9618+[--1]-9618& >>noUDP&sock=19606_f356_4
12/21/17 15:22:39 (pid:22864) Submitting machine is "njrarltapp001a8.mgmt.ams1907.com <http://njrarltapp001a8.mgmt.ams1907.com >"
12/21/17 15:22:39 (pid:22864) setting the orig job name in starter
12/21/17 15:22:39 (pid:22864) setting the orig job iwd in starter
12/21/17 15:22:39 (pid:22864) Chirp config summary: IO false, Updates false, Delayed updates true.
12/21/17 15:22:39 (pid:22864) Initialized IO Proxy.
12/21/17 15:22:39 (pid:22864) Done setting resource limits
12/21/17 15:22:39 (pid:22864) File transfer completed successfully.
Stack dump for process 22864 at timestamp 1513887760 (14 frames)
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(dprintf_ ______________________________dump_stack+0x72)[0x7fcd88c3cea 2]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z18linux_ sig_coredumpi+0x24)[0x7fcd88dc 74a4]
/lib64/libpthread.so.0(+0xf5e0)[0x7fcd873165e0]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZNK17Condo rVersionInfo19built_since_ versionEiii+0x10)[0x7fcd88c94d d0]
condor_starter(REMOTE_CONDOR_dprintf_stats+0x39)[0x440629]
condor_starter(_ZN9JICShadow17transferCompletedEP12FileTrans fer+0x13b)[0x4295fb]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN12FileTr ansfer6ReaperEP7Serviceii+ 0x1b8)[0x7fcd88c70a38]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10Daemon Core10CallReaperEiPKcii+0x12d) [0x7fcd88da73bd]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10Daemon Core17HandleProcessExitEii+ 0x1b9)[0x7fcd88da9ca9]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10Daemon Core24HandleDC_SERVICEWAITPIDS Ei+0x7c)[0x7fcd88da9e6c]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_ZN10Daemon Core6DriverEv+0x6b2)[ 0x7fcd88daa552]
/opt/condor/sbin/../lib/libcondor_utils_8_6_8.so(_Z7dc_ mainiPPc+0x13a4)[0x7fcd88dcab0 4]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fcd86f65c05]
condor_starter[0x422840]
_________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685