Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] exec failure
- Date: Fri, 24 Feb 2006 08:43:30 -0700
- From: "Todd Applewhite (Boulder)" <t.apple@xxxxxxxxxxx>
- Subject: Re: [Condor-users] exec failure
StarterLog with error:
'EXEC of user process failed, probably insufficient swap'
On Thu, 2006-02-23 at 09:11 -0700, Todd Applewhite (Boulder) wrote:
> I've seen this failure mentioned before, but
> haven't been able to resolve it.
>
> I have two RHE4 machines with a shared file system
> containing the condor binaries and libraries.
>
> Each machine has their own condor user and group and
> home directory.
>
> Install and startup are fine I'm able to submit standalone
> jobs on each of the machines, but when I submit a job from
> one machine to another it fails with an entry in the start log.
> condor_status behaves expectedly on each machine.
>
> EXEC of user process failed, probably insufficient swap
>
> RESERVED_SWAP is set to 0 in all config files, machine 1
> has 512M of swap, machine 2 over 3G
> ------------------------------------------------------------------
> [condor@geronimo log]$ free
> total used free shared buffers
> cached
> Mem: 256060 227312 28748 0 41984
> 138356
> -/+ buffers/cache: 46972 209088
> Swap: 514040 144 513896
> ----------------------------------------------------------------
> [condor@chinle test]$ free
> total used free shared buffers
> cached
> Mem: 1555884 875876 680008 0 71016
> 502888
> -/+ buffers/cache: 301972 1253912
> Swap: 3068372 0 3068372
>
> I'm trying to submit the following job from machine 2 to
> machine 1 (condor_compile gcc -o tester test.o). I've also
> tried submitting vanilla jobs on the nfs mount with the same
> result. The job runs fine on each of the machines as a standalone.
>
> Executable = tester
> Universe = standard
> Log = tester.log
> output = tester.out
> error = tester.error
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
> Requirements = machine == "geronimo.localdomain"
>
> machine 1 "central manager"
> ------------------------------
> MyType = "Scheduler"
> TargetType = ""
> CondorVersion = "$CondorVersion: 6.7.16 Feb 2 2006 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> Machine = "geronimo.localdomain"
> QuillEnabled = FALSE
> ScheddIpAddr = "<192.168.1.132:32871>"
> NumUsers = 0
> MaxJobsRunning = 200
> StartLocalUniverse = TRUE
> StartSchedulerUniverse = TRUE
> Name = "geronimo.localdomain"
> VirtualMemory = 2147483647
> TotalIdleJobs = 0
> TotalRunningJobs = 0
> TotalJobAds = 0
> TotalHeldJobs = 0
> TotalFlockedJobs = 0
> TotalRemovedJobs = 0
> MonitorSelfTime = 1140705379
> MonitorSelfCPUUsage = 0.004182
> MonitorSelfImageSize = 7992.000000
> MonitorSelfResidentSetSize = 3812
> MonitorSelfAge = 47761
> WantResAd = TRUE
> DaemonStartTime = 1140705434
> UpdateSequenceNumber = 0
> MyAddress = "<192.168.1.132:32871>"
> ServerTime = 1140705434
> LastHeardFrom = 1140705434
> UpdatesTotal = 569
> UpdatesSequenced = 568
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
>
> machine 2 dedicated node
> ------------------------
> MyType = "Scheduler"
> TargetType = ""
> CondorVersion = "$CondorVersion: 6.7.16 Feb 2 2006 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> Machine = "chinle.localdomain"
> QuillEnabled = FALSE
> ScheddIpAddr = "<192.168.1.130:34439>"
> MyAddress = "<192.168.1.130:34439>"
> NumUsers = 1
> MaxJobsRunning = 200
> StartLocalUniverse = TRUE
> StartSchedulerUniverse = TRUE
> Name = "chinle.localdomain"
> VirtualMemory = 2147483647
> TotalIdleJobs = 1
> TotalRunningJobs = 0
> TotalJobAds = 1
> TotalHeldJobs = 0
> TotalFlockedJobs = 0
> TotalRemovedJobs = 0
> MonitorSelfTime = 1140705915
> MonitorSelfCPUUsage = 0.008333
> MonitorSelfImageSize = 7992.000000
> MonitorSelfResidentSetSize = 3784
> MonitorSelfAge = 960
> WantResAd = TRUE
> DaemonStartTime = 1140704955
> DaemonStartTime = 1140704955
> UpdateSequenceNumber = 7
> ServerTime = 1140706088
> LastHeardFrom = 1140706088
> UpdatesTotal = 335
> UpdatesSequenced = 334
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
>
> Thanks,
> Todd Applewhite
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
2/24 08:28:10 NET_REMAP_ENABLE is undefined, using default value of False
2/24 08:28:10 NET_REMAP_ENABLE is undefined, using default value of False
2/24 08:28:10 PASSWD_CACHE_REFRESH is undefined, using default value of 300
2/24 08:28:10 ********** STARTER starting up ***********
2/24 08:28:10 ** $CondorVersion: 6.7.16 Feb 2 2006 $
2/24 08:28:10 ** $CondorPlatform: I386-LINUX_RH9 $
2/24 08:28:10 ******************************************
2/24 08:28:10 Submitting machine is "geronimo.localdomain"
2/24 08:28:10 EventHandler {
2/24 08:28:10 func = 0x80cced2
2/24 08:28:10 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
2/24 08:28:10 }
2/24 08:28:10 EventHandler::install() {
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGALRM, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGHUP, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGINT, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGUSR1, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGUSR2, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGCHLD, flags = 0x1
2/24 08:28:10 *FSM* Installed handler 0x80cced2 for signal SIGTSTP, flags = 0x1
2/24 08:28:10 }
2/24 08:28:10 Done moving to directory "/home/condor/execute"
2/24 08:28:10 1877624 kbytes available for "."
2/24 08:28:10 Looking up RESERVED_DISK parameter
2/24 08:28:10 Reserving 5120 kbytes for file system
2/24 08:28:10 Done setting resource limits
2/24 08:28:10 Done closing file descriptors
2/24 08:28:10 *FSM* Transitioning to state "GET_PROC"
2/24 08:28:10 *FSM* Executing state func "get_proc()" [ ]
2/24 08:28:10 Entering get_proc()
2/24 08:28:10 Entering get_job_info()
2/24 08:28:10 Startup Info:
2/24 08:28:10 Version Number: 1
2/24 08:28:10 Id: 26.0
2/24 08:28:10 JobClass: STANDARD
2/24 08:28:10 Uid: 501
2/24 08:28:10 Gid: 502
2/24 08:28:10 VirtPid: -1
2/24 08:28:10 SoftKillSignal: 20
2/24 08:28:10 Cmd: "/home/condor/dev/test/tester"
2/24 08:28:10 Args: ""
2/24 08:28:10 Env: ""
2/24 08:28:10 Iwd: "/home/condor/dev/test"
2/24 08:28:10 Ckpt Wanted: TRUE
2/24 08:28:10 Is Restart: FALSE
2/24 08:28:10 Core Limit Valid: TRUE
2/24 08:28:10 Coredump Limit 0
2/24 08:28:10 User uid set to 99
2/24 08:28:10 User uid set to 99
2/24 08:28:10 BIND_ALL_INTERFACES is undefined, using default value of False
2/24 08:28:10 User Process 26.0 {
2/24 08:28:10 cmd = /home/condor/dev/test/tester
2/24 08:28:10 args =
2/24 08:28:10 env = CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE CONDOR_SCRATCH_DIR=/home/condor/execute/dir_5435
2/24 08:28:10 local_dir = dir_5435
2/24 08:28:10 cur_ckpt = dir_5435/condor_exec.26.0
2/24 08:28:10 core_name = (either 'core' or 'core.<pid>')
2/24 08:28:10 uid = 99, gid = 99
2/24 08:28:10 v_pid = -1
2/24 08:28:10 pid = (NOT CURRENTLY EXECUTING)
2/24 08:28:10 exit_status_valid = FALSE
2/24 08:28:10 exit_status = (NEVER BEEN EXECUTED)
2/24 08:28:10 ckpt_wanted = TRUE
2/24 08:28:10 coredump_limit_exists = TRUE
2/24 08:28:10 coredump_limit = 0
2/24 08:28:10 soft_kill_sig = 20
2/24 08:28:10 job_class = STANDARD
2/24 08:28:10 state = NEW
2/24 08:28:10 new_ckpt_created = FALSE
2/24 08:28:10 ckpt_transferred = FALSE
2/24 08:28:10 core_created = FALSE
2/24 08:28:10 core_transferred = FALSE
2/24 08:28:10 exit_requested = FALSE
2/24 08:28:10 image_size = -1 blocks
2/24 08:28:10 user_time = 0
2/24 08:28:10 sys_time = 0
2/24 08:28:10 guaranteed_user_time = 0
2/24 08:28:10 guaranteed_sys_time = 0
2/24 08:28:10 }
2/24 08:28:10 *FSM* Transitioning to state "GET_EXEC"
2/24 08:28:10 *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE ]
2/24 08:28:10 Entering get_exec()
2/24 08:28:10 Executable is located on submitting host
2/24 08:28:10 Expanded executable name is "/home/condor/spool/cluster26.ickpt.subproc0"
2/24 08:28:10 Going to try 3 attempts at getting the initial executable
2/24 08:28:10 Entering get_file( /home/condor/spool/cluster26.ickpt.subproc0, dir_5435/condor_exec.26.0, 0755 )
2/24 08:28:10 Generated a data socket - fd = 0
2/24 08:28:10 Internet address structure set up
2/24 08:28:10 Connection completed - returning fd 0
2/24 08:28:10 Opened "/home/condor/spool/cluster26.ickpt.subproc0" via file stream
2/24 08:28:11 Get_file() transferred 13488873 bytes, 8649563 bytes/second
2/24 08:28:11 Fetched orig ckpt file "/home/condor/spool/cluster26.ickpt.subproc0" into "dir_5435/condor_exec.26.0" with 1 attempt
2/24 08:28:11 Executable 'dir_5435/condor_exec.26.0' is linked with "$CondorVersion: 6.7.16 Feb 2 2006 $" on a "$CondorPlatform: I386-LINUX_RH9 $"
2/24 08:28:11 Done verifying executable file
2/24 08:28:11 *FSM* Executing transition function "spawn_all"
2/24 08:28:11 Pipe built
2/24 08:28:11 New pipe_fds[14,1]
2/24 08:28:11 cmd_fd = 14
2/24 08:28:11 Calling execve( "/home/condor/execute/dir_5435/condor_exec.26.0", "condor_exec.26.0", "-_condor_cmd_fd", "14", 0, "CONDOR_VM=vm1", "_condor_BIND_ALL_INTERFACES=FALSE", "CONDOR_SCRATCH_DIR=/home/condor/execute/dir_5435", 0 )
2/24 08:28:11 Started user job - PID = 5436
2/24 08:28:11 cmd_fp = 0x831fe30
2/24 08:28:11 end
2/24 08:28:11 *FSM* Transitioning to state "SUPERVISE"
2/24 08:28:11 *FSM* Got asynchronous event "CHILD_EXIT"
2/24 08:28:11 *FSM* Executing transition function "reaper"
2/24 08:28:11 Canceled alarm
2/24 08:28:11 Process 5436 exited, searching process list...
2/24 08:28:11 Found object for process 5436
2/24 08:28:11 Process 5436 exited with status 110
2/24 08:28:11 EXEC of user process failed, probably insufficient swap
2/24 08:28:11 No core file was created
2/24 08:28:11 *FSM* Transitioning to state "PROC_EXIT"
2/24 08:28:11 *FSM* Executing state func "proc_exit()" [ DIE ]
2/24 08:28:11 *FSM* Executing transition function "dispose_one"
2/24 08:28:11 Sending final status for process 26.0
2/24 08:28:11 STATUS encoded as CKPT, *NOT* TRANSFERRED
2/24 08:28:11 User time = 0.000000 seconds
2/24 08:28:11 System time = 0.000000 seconds
2/24 08:28:11 Done sending final status for process 26.0
2/24 08:28:11 Unlinked "dir_5435/condor_exec.26.0"
2/24 08:28:11 Removed directory "dir_5435"
2/24 08:28:11 *FSM* Transitioning to state "SUPERVISE"
2/24 08:28:11 *FSM* Got asynchronous event "DIE"
2/24 08:28:11 *FSM* Executing transition function "req_die"
2/24 08:28:11 Canceled alarm
2/24 08:28:11 *FSM* Transitioning to state "TERMINATE"
2/24 08:28:11 *FSM* Executing state func "terminate_all()" [ ]
2/24 08:28:11 Canceled alarm
2/24 08:28:11 *FSM* Transitioning to state "SEND_STATUS_ALL"
2/24 08:28:11 *FSM* Executing state func "dispose_all()" [ ]
2/24 08:28:11 *FSM* Reached state "END"
2/24 08:28:11 EventHandler::de_install() {
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGALRM
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGHUP
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGINT
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGUSR1
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGUSR2
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGCHLD
2/24 08:28:11 *FSM* Installed handler (nil) for signal SIGTSTP
2/24 08:28:11 }
2/24 08:28:11 ********* STARTER terminating normally **********
2/24 08:28:30 NET_REMAP_ENABLE is undefined, using default value of False
2/24 08:28:30 NET_REMAP_ENABLE is undefined, using default value of False
2/24 08:28:30 PASSWD_CACHE_REFRESH is undefined, using default value of 300
2/24 08:28:30 ********** STARTER starting up ***********
2/24 08:28:30 ** $CondorVersion: 6.7.16 Feb 2 2006 $
2/24 08:28:30 ** $CondorPlatform: I386-LINUX_RH9 $
2/24 08:28:30 ******************************************
2/24 08:28:30 Submitting machine is "geronimo.localdomain"
2/24 08:28:30 EventHandler {
2/24 08:28:30 func = 0x80cced2
2/24 08:28:30 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
2/24 08:28:30 }
2/24 08:28:30 EventHandler::install() {
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGALRM, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGHUP, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGINT, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGUSR1, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGUSR2, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGCHLD, flags = 0x1
2/24 08:28:30 *FSM* Installed handler 0x80cced2 for signal SIGTSTP, flags = 0x1
2/24 08:28:30 }
2/24 08:28:30 Done moving to directory "/home/condor/execute"
2/24 08:28:30 1877624 kbytes available for "."
2/24 08:28:30 Looking up RESERVED_DISK parameter
2/24 08:28:30 Reserving 5120 kbytes for file system
2/24 08:28:30 Done setting resource limits
2/24 08:28:30 Done closing file descriptors
2/24 08:28:30 *FSM* Transitioning to state "GET_PROC"
2/24 08:28:30 *FSM* Executing state func "get_proc()" [ ]
2/24 08:28:30 Entering get_proc()
2/24 08:28:30 Entering get_job_info()
2/24 08:28:30 Startup Info:
2/24 08:28:30 Version Number: 1
2/24 08:28:30 Id: 26.0
2/24 08:28:30 JobClass: STANDARD
2/24 08:28:30 Uid: 501
2/24 08:28:30 Gid: 502
2/24 08:28:30 VirtPid: -1
2/24 08:28:30 SoftKillSignal: 20
2/24 08:28:30 Cmd: "/home/condor/dev/test/tester"
2/24 08:28:30 Args: ""
2/24 08:28:30 Env: ""
2/24 08:28:30 Iwd: "/home/condor/dev/test"
2/24 08:28:30 Ckpt Wanted: TRUE
2/24 08:28:30 Is Restart: FALSE
2/24 08:28:30 Core Limit Valid: TRUE
2/24 08:28:30 Coredump Limit 0
2/24 08:28:30 User uid set to 99
2/24 08:28:30 User uid set to 99
2/24 08:28:30 BIND_ALL_INTERFACES is undefined, using default value of False
2/24 08:28:30 User Process 26.0 {
2/24 08:28:30 cmd = /home/condor/dev/test/tester
2/24 08:28:30 args =
2/24 08:28:30 env = CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE CONDOR_SCRATCH_DIR=/home/condor/execute/dir_5444
2/24 08:28:30 local_dir = dir_5444
2/24 08:28:30 cur_ckpt = dir_5444/condor_exec.26.0
2/24 08:28:30 core_name = (either 'core' or 'core.<pid>')
2/24 08:28:30 uid = 99, gid = 99
2/24 08:28:30 v_pid = -1
2/24 08:28:30 pid = (NOT CURRENTLY EXECUTING)
2/24 08:28:30 exit_status_valid = FALSE
2/24 08:28:30 exit_status = (NEVER BEEN EXECUTED)
2/24 08:28:30 ckpt_wanted = TRUE
2/24 08:28:30 coredump_limit_exists = TRUE
2/24 08:28:30 coredump_limit = 0
2/24 08:28:30 soft_kill_sig = 20
2/24 08:28:30 job_class = STANDARD
2/24 08:28:30 state = NEW
2/24 08:28:30 new_ckpt_created = FALSE
2/24 08:28:30 ckpt_transferred = FALSE
2/24 08:28:30 core_created = FALSE
2/24 08:28:30 core_transferred = FALSE
2/24 08:28:30 exit_requested = FALSE
2/24 08:28:30 image_size = -1 blocks
2/24 08:28:30 user_time = 0
2/24 08:28:30 sys_time = 0
2/24 08:28:30 guaranteed_user_time = 0
2/24 08:28:30 guaranteed_sys_time = 0
2/24 08:28:30 }
2/24 08:28:30 *FSM* Transitioning to state "GET_EXEC"
2/24 08:28:30 *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE ]
2/24 08:28:30 Entering get_exec()
2/24 08:28:30 Executable is located on submitting host
2/24 08:28:30 Expanded executable name is "/home/condor/spool/cluster26.ickpt.subproc0"
2/24 08:28:30 Going to try 3 attempts at getting the initial executable
2/24 08:28:30 Entering get_file( /home/condor/spool/cluster26.ickpt.subproc0, dir_5444/condor_exec.26.0, 0755 )
2/24 08:28:30 Generated a data socket - fd = 0
2/24 08:28:30 Internet address structure set up
2/24 08:28:30 Connection completed - returning fd 0
2/24 08:28:30 Opened "/home/condor/spool/cluster26.ickpt.subproc0" via file stream
2/24 08:28:32 Get_file() transferred 13488873 bytes, 11725661 bytes/second
2/24 08:28:32 Fetched orig ckpt file "/home/condor/spool/cluster26.ickpt.subproc0" into "dir_5444/condor_exec.26.0" with 1 attempt
2/24 08:28:32 Executable 'dir_5444/condor_exec.26.0' is linked with "$CondorVersion: 6.7.16 Feb 2 2006 $" on a "$CondorPlatform: I386-LINUX_RH9 $"
2/24 08:28:32 Done verifying executable file
2/24 08:28:32 *FSM* Executing transition function "spawn_all"
2/24 08:28:32 Pipe built
2/24 08:28:32 New pipe_fds[14,1]
2/24 08:28:32 cmd_fd = 14
2/24 08:28:32 Calling execve( "/home/condor/execute/dir_5444/condor_exec.26.0", "condor_exec.26.0", "-_condor_cmd_fd", "14", 0, "CONDOR_VM=vm1", "_condor_BIND_ALL_INTERFACES=FALSE", "CONDOR_SCRATCH_DIR=/home/condor/execute/dir_5444", 0 )
2/24 08:28:32 Started user job - PID = 5445
2/24 08:28:32 cmd_fp = 0x831fe30
2/24 08:28:32 end
2/24 08:28:32 *FSM* Transitioning to state "SUPERVISE"
2/24 08:28:32 *FSM* Got asynchronous event "CHILD_EXIT"
2/24 08:28:32 *FSM* Executing transition function "reaper"
2/24 08:28:32 Canceled alarm
2/24 08:28:32 Process 5445 exited, searching process list...
2/24 08:28:32 Found object for process 5445
2/24 08:28:32 Process 5445 exited with status 110
2/24 08:28:32 EXEC of user process failed, probably insufficient swap
2/24 08:28:32 No core file was created
2/24 08:28:32 *FSM* Transitioning to state "PROC_EXIT"
2/24 08:28:32 *FSM* Executing state func "proc_exit()" [ DIE ]
2/24 08:28:32 *FSM* Executing transition function "dispose_one"
2/24 08:28:32 Sending final status for process 26.0
2/24 08:28:32 STATUS encoded as CKPT, *NOT* TRANSFERRED
2/24 08:28:32 User time = 0.000000 seconds
2/24 08:28:32 System time = 0.000000 seconds
2/24 08:28:32 Done sending final status for process 26.0
2/24 08:28:32 Unlinked "dir_5444/condor_exec.26.0"
2/24 08:28:32 Removed directory "dir_5444"
2/24 08:28:32 *FSM* Transitioning to state "SUPERVISE"
2/24 08:28:32 *FSM* Got asynchronous event "DIE"
2/24 08:28:32 *FSM* Executing transition function "req_die"
2/24 08:28:32 Canceled alarm
2/24 08:28:32 *FSM* Transitioning to state "TERMINATE"
2/24 08:28:32 *FSM* Executing state func "terminate_all()" [ ]
2/24 08:28:32 Canceled alarm
2/24 08:28:32 *FSM* Transitioning to state "SEND_STATUS_ALL"
2/24 08:28:32 *FSM* Executing state func "dispose_all()" [ ]
2/24 08:28:32 *FSM* Reached state "END"
2/24 08:28:32 EventHandler::de_install() {
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGALRM
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGHUP
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGINT
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGUSR1
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGUSR2
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGCHLD
2/24 08:28:32 *FSM* Installed handler (nil) for signal SIGTSTP
2/24 08:28:32 }
2/24 08:28:32 ********* STARTER terminating normally **********