Here is the clue about why the job is not running for long:What is your PREEMPT _expression_?
11/22 20:49:03 vm1: Changing activity: Idle -> Busy
11/22 20:49:08 vm1: State change: PREEMPT is TRUE
condor_config_val PREEMPT
--Dan
Nitin Gavhane wrote:
> Hello Dan,
> the following are the snapshot of log files, please look at them.
>
> *Shadow.log*
> =================================================================================
> 11/22 20:47:29 ( 4.0) (4473):My_UID_Domain = "niting-w2p.corp.cdac.in> 11/22 20:47:35 (4.0) (4473):Shadow: Job 4.0 exited, termsig = 9,> < http://192.168.7.221:57320>>", Job = 5.0
> coredump = 0, retcode = 0
> 11/22 20:47:35 ( 4.0) (4473):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:47:35 (4.0) (4473):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp'
> 11/22 20:47:36 ( 4.0) (4473):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster4.proc0.subproc0.tmp
> 11/22 20:47:36 (4.0) (4473):user_time = 1 ticks
> 11/22 20:47:36 (4.0) (4473):sys_time = 0 ticks
> 11/22 20:47:36 (4.0) (4473):Asked to write event of number 1.
> 11/22 20:47:36 (4.0) (4473):Asked to write event of number 4.
> 11/22 20:47:36 (4.0) (4473):********** Shadow Exiting(107) **********
> 11/22 20:49:02 (?.?) (4574):******* Standard Shadow starting up *******
> 11/22 20:49:02 (?.?) (4574):** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:49:02 (?.?) (4574):** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:49:02 (?.?) (4574):*******************************************
> 11/22 20:49:02 (?.?) (4574):uid=0, euid=900, gid=0, egid=900
> 11/22 20:49:02 (?.?) (4574):Hostname = "<192.168.7.221:57320> 11/22 20:49:02 (5.0) (4574):Requesting Primary Starter
> 11/22 20:49:02 (5.0) (4574):Shadow: Request to run a job was ACCEPTED
> 11/22 20:49:02 (5.0) (4574):Shadow: RSC_SOCK connected, fd = 17
> 11/22 20:49:03 (5.0) (4574):Shadow: CLIENT_LOG connected, fd = 18
> 11/22 20:49:03 (5.0) (4574):My_Filesystem_Domain = "> 11/22 20:49:03 (5.0) (4574):My_UID_Domain = "niting-w2p.corp.cdac.in> 11/22 20:49:10 (5.0) (4574):Shadow: Job 5.0 exited, termsig = 9,> < http://192.168.7.221:57320>>", Job = 5.0
> coredump = 0, retcode = 0
> 11/22 20:49:10 (5.0) (4574):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:49:10 (5.0) (4574):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
> 11/22 20:49:10 (5.0) (4574):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
> 11/22 20:49:10 (5.0) (4574):user_time = 0 ticks
> 11/22 20:49:10 (5.0) (4574):sys_time = 0 ticks
> 11/22 20:49:10 ( 5.0) (4574):Asked to write event of number 1.
> 11/22 20:49:10 (5.0) (4574):Asked to write event of number 4.
> 11/22 20:49:10 (5.0) (4574):********** Shadow Exiting(107) **********
> 11/22 20:59:02 (?.?) (4621):******* Standard Shadow starting up *******
> 11/22 20:59:02 (?.?) (4621):** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:59:02 (?.?) (4621):** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:59:02 (?.?) (4621):*******************************************
> 11/22 20:59:03 (?.?) (4621):uid=0, euid=900, gid=0, egid=900
> 11/22 20:59:03 (?.?) (4621):Hostname = "<192.168.7.221:57320> 11/22 20:59:03 (5.0) (4621):Requesting Primary Starter
> 11/22 20:59:03 (5.0) (4621):Shadow: Request to run a job was ACCEPTED
> 11/22 20:59:03 (5.0) (4621):Shadow: RSC_SOCK connected, fd = 17
> 11/22 20:59:03 (5.0) (4621):Shadow: CLIENT_LOG connected, fd = 18
> 11/22 20:59:03 (5.0) (4621):My_Filesystem_Domain = "> 11/22 20:59:03 (5.0) (4621):My_UID_Domain = "niting-w2p.corp.cdac.in> 11/22 20:59:10 (5.0) (4621):Shadow: Job 5.0 exited, termsig = 9,> <192.168.7.221:32863 <http://192.168.7.221:32863>>
> coredump = 0, retcode = 0
> 11/22 20:59:10 (5.0) (4621):Shadow: Job was kicked off without a
> checkpoint
> 11/22 20:59:10 (5.0) (4621):Shadow: DoCleanup: unlinking TmpCkpt
> '/home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp'
> 11/22 20:59:10 (5.0) (4621):Trying to unlink
> /home/condor/hosts/niting-w2p/spool/cluster5.proc0.subproc0.tmp
> 11/22 20:59:10 (5.0) (4621):user_time = 1 ticks
> 11/22 20:59:11 (5.0) (4621):sys_time = 0 ticks
> 11/22 20:59:11 ( 5.0) (4621):Asked to write event of number 1.
> 11/22 20:59:11 (5.0) (4621):Asked to write event of number 4.
> 11/22 20:59:11 (5.0) (4621):********** Shadow Exiting(107) **********
> =====================================================================================
>
> *startd.log
> *==================================
> 11/22 20:47:35 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
> 11/22 20:47:36 Starter pid 4474 exited with status 0
> 11/22 20:47:36 vm1: State change: starter exited
> 11/22 20:47:36 vm1: State change: No preempting claim, returning to owner
> 11/22 20:47:36 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:47:36 vm1: State change: IS_OWNER is false
> 11/22 20:47:36 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:47:37 DaemonCore: Command received via UDP from host> 11/22 20:47:37 DaemonCore: received command 443 (RELEASE_CLAIM),> 192.168.7.221:57320 < http://192.168.7.221:57320>>#1195742198#13#...)
> calling handler (command_release_claim)
> 11/22 20:47:37 Warning: can't find resource with ClaimId (<> 11/22 20:48:56 DaemonCore: Command received via UDP from host> <192.168.7.127:32845 <http://192.168.7.127:32845>>> 11/22 20:48:56 DaemonCore: received command 440 (MATCH_INFO), calling> < http://192.168.7.221:57320>>#1195742198#16#...
> handler (command_match_info)
> 11/22 20:48:56 vm1: match_info called
> 11/22 20:48:57 vm1: Received match <192.168.7.221:57320> 11/22 20:48:57 vm1: State change: match notification protocol successful> <192.168.7.221:38154 < http://192.168.7.221:38154>>
> 11/22 20:48:57 vm1: Changing state: Unclaimed -> Matched
> 11/22 20:48:57 DaemonCore: Command received via TCP from host> 11/22 20:48:57 DaemonCore: received command 442 (REQUEST_CLAIM),> <mailto:psegrid@xxxxxxxxxxxxxxxxxxxxxxx >
> calling handler (command_request_claim)
> 11/22 20:48:57 vm1: Request accepted.
> 11/22 20:48:57 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx> 11/22 20:48:57 vm1: State change: claiming protocol successful> <192.168.7.221:38436 <http://192.168.7.221:38436>>
> 11/22 20:48:57 vm1: Changing state: Matched -> Claimed
> 11/22 20:49:02 DaemonCore: Command received via TCP from host> 11/22 20:49:02 DaemonCore: received command 444 (ACTIVATE_CLAIM),> (<192.168.7.221:38436 < http://192.168.7.221:38436>>)
> calling handler (command_activate_claim)
> 11/22 20:49:02 vm1: Got activate_claim request from shadow> 11/22 20:49:02 vm1: Remote job ID is 5.0> <http://niting-w2p.corp.cdac.in>, 10, 11 ) : pid 4575
> 11/22 20:49:02 vm1: exec_starter( niting-w2p.corp.cdac.in> 11/22 20:49:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std,> <http://niting-w2p.corp.cdac.in >, 0)
> "condor_starter", niting-w2p.corp.cdac.in> 11/22 20:49:03 vm1: Got universe "STANDARD" (1) from request classad> < 192.168.7.221:37844 <http://192.168.7.221:37844>>
> 11/22 20:49:03 vm1: State change: claim-activation protocol successful
> 11/22 20:49:03 vm1: Changing activity: Idle -> Busy
> 11/22 20:49:08 vm1: State change: PREEMPT is TRUE
> 11/22 20:49:08 vm1: Changing activity: Busy -> Retiring
> 11/22 20:49:08 vm1: State change: retirement ended/expired
> 11/22 20:49:08 vm1: State change: WANT_VACATE is FALSE
> 11/22 20:49:08 vm1: Changing state and activity: Claimed/Retiring ->
> Preempting/Killing
> 11/22 20:49:10 DaemonCore: Command received via TCP from host> 11/22 20:49:10 DaemonCore: received command 404> 192.168.7.221:32878 < http://192.168.7.221:32878>>
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 11/22 20:49:10 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
> 11/22 20:49:10 Starter pid 4575 exited with status 0
> 11/22 20:49:10 vm1: State change: starter exited
> 11/22 20:49:10 vm1: State change: No preempting claim, returning to owner
> 11/22 20:49:10 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:49:11 vm1: State change: IS_OWNER is false
> 11/22 20:49:11 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:49:11 DaemonCore: Command received via UDP from host <> 11/22 20:49:12 DaemonCore: received command 443 (RELEASE_CLAIM),> 192.168.7.221:57320 <http://192.168.7.221:57320>>#1195742198#16#...)
> calling handler (command_release_claim)
> 11/22 20:49:12 Warning: can't find resource with ClaimId (<> 11/22 20:58:57 DaemonCore: Command received via UDP from host> <192.168.7.127:32861 <http://192.168.7.127:32861 >>> 11/22 20:58:57 DaemonCore: received command 440 (MATCH_INFO), calling> <http://192.168.7.221:57320>>#1195742198#18#...
> handler (command_match_info)
> 11/22 20:58:57 vm1: match_info called
> 11/22 20:58:57 vm1: Received match < 192.168.7.221:57320> 11/22 20:58:57 vm1: State change: match notification protocol successful> 192.168.7.221:40060 <http://192.168.7.221:40060>>
> 11/22 20:58:57 vm1: Changing state: Unclaimed -> Matched
> 11/22 20:58:57 DaemonCore: Command received via TCP from host <
> 11/22 20:58:58 DaemonCore: received command 442 (REQUEST_CLAIM),> <mailto:psegrid@xxxxxxxxxxxxxxxxxxxxxxx>
> calling handler (command_request_claim)
> 11/22 20:58:58 vm1: Request accepted.
> 11/22 20:58:58 vm1: Remote owner is psegrid@xxxxxxxxxxxxxxxxxxxxxxx> 11/22 20:58:58 vm1: State change: claiming protocol successful> 192.168.7.221:56177 < http://192.168.7.221:56177>>
> 11/22 20:58:58 vm1: Changing state: Matched -> Claimed
> 11/22 20:59:03 DaemonCore: Command received via TCP from host <> 11/22 20:59:03 DaemonCore: received command 444 (ACTIVATE_CLAIM),> 192.168.7.221:56177 <http://192.168.7.221:56177>>)
> calling handler (command_activate_claim)
> 11/22 20:59:03 vm1: Got activate_claim request from shadow (<> 11/22 20:59:03 vm1: Remote job ID is 5.0> <http://niting-w2p.corp.cdac.in >, 10, 11 ) : pid 4622
> 11/22 20:59:03 vm1: exec_starter( niting-w2p.corp.cdac.in> 11/22 20:59:03 vm1: execl(/usr/local/condor/sbin/condor_starter.std,> <http://niting-w2p.corp.cdac.in>, 0)
> "condor_starter", niting-w2p.corp.cdac.in> 11/22 20:59:03 vm1: Got universe "STANDARD" (1) from request classad> 192.168.7.221:39386 < http://192.168.7.221:39386>>
> 11/22 20:59:03 vm1: State change: claim-activation protocol successful
> 11/22 20:59:03 vm1: Changing activity: Idle -> Busy
> 11/22 20:59:09 vm1: State change: PREEMPT is TRUE
> 11/22 20:59:09 vm1: Changing activity: Busy -> Retiring
> 11/22 20:59:09 vm1: State change: retirement ended/expired
> 11/22 20:59:09 vm1: State change: WANT_VACATE is FALSE
> 11/22 20:59:09 vm1: Changing state and activity: Claimed/Retiring ->
> Preempting/Killing
> 11/22 20:59:10 DaemonCore: Command received via TCP from host <> 11/22 20:59:10 DaemonCore: received command 404> 192.168.7.221:32895 < http://192.168.7.221:32895>>
> (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
> 11/22 20:59:10 vm1: Got KILL_FRGN_JOB while in Preempting state,
> ignoring.
> 11/22 20:59:11 Starter pid 4622 exited with status 0
> 11/22 20:59:11 vm1: State change: starter exited
> 11/22 20:59:11 vm1: State change: No preempting claim, returning to owner
> 11/22 20:59:11 vm1: Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 11/22 20:59:11 vm1: State change: IS_OWNER is false
> 11/22 20:59:11 vm1: Changing state: Owner -> Unclaimed
> 11/22 20:59:12 DaemonCore: Command received via UDP from host <> 11/22 20:59:12 DaemonCore: received command 443 (RELEASE_CLAIM),> 192.168.7.221:57320 <http://192.168.7.221:57320>>#1195742198#18#...)
> calling handler (command_release_claim)
> 11/22 20:59:12 Warning: can't find resource with ClaimId (<> ===================================================================================
> starter.vm1
> ======================================================
> 11/22 20:47:35 *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:47:35 *FSM* Executing state func "dispose_all()" [ ]
> 11/22 20:47:35 Sending final status for process 4.0
> 11/22 20:47:35 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:47:35 User time = 0.000000 seconds
> 11/22 20:47:35 System time = 0.000000 seconds
> 11/22 20:47:35 Can't unlink "dir_4474/condor_exec.4.0" - errno = 2
> 11/22 20:47:35 Removed directory "dir_4474"
> 11/22 20:47:36 *FSM* Reached state "END"
> 11/22 20:47:36 ********* STARTER terminating normally **********
> 11/22 20:49:03 ********** STARTER starting up ***********
> 11/22 20:49:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:49:03 ** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:49:03 ******************************************
> 11/22 20:49:03 Submitting machine is " niting-w2p.corp.cdac.in> 11/22 20:49:03 EventHandler {
> 11/22 20:49:03 func = 0x80e3bde
> 11/22 20:49:03 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2
> SIGCHLD SIGTSTP
> 11/22 20:49:04 }
> 11/22 20:49:04 Done setting resource limits
> 11/22 20:49:04 *FSM* Transitioning to state "GET_PROC"
> 11/22 20:49:04 *FSM* Executing state func "get_proc()" [ ]
> 11/22 20:49:04 Entering get_proc()
> 11/22 20:49:04 Entering get_job_info()
> 11/22 20:49:04 Startup Info:
> 11/22 20:49:04 Version Number: 1
> 11/22 20:49:05 Id: 5.0
> 11/22 20:49:05 JobClass: STANDARD
> 11/22 20:49:05 Uid: 503
> 11/22 20:49:05 Gid: 503
> 11/22 20:49:05 VirtPid: -1
> 11/22 20:49:05 SoftKillSignal: 20
> 11/22 20:49:05 Cmd: "/home/psegrid/NIP/nip"
> 11/22 20:49:05 Args: ""
> 11/22 20:49:05 Env:
> "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= >"
> 11/22 20:49:05 Iwd: "/home/psegrid"
> 11/22 20:49:05 Ckpt Wanted: TRUE
> 11/22 20:49:05 Is Restart: FALSE
> 11/22 20:49:05 Core Limit Valid: TRUE
> 11/22 20:49:05 Coredump Limit 0
> 11/22 20:49:06 User uid set to 503
> 11/22 20:49:06 User uid set to 503
> 11/22 20:49:06 User Process 5.0 {
> 11/22 20:49:06 cmd = /home/psegrid/NIP/nip
> 11/22 20:49:06 args =
> 11/22 20:49:06 env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/
> X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY=
> X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid
> SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch
> JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 >
> LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE
> CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4575
> 11/22 20:49:06 local_dir = dir_4575
> 11/22 20:49:06 cur_ckpt = dir_4575/condor_exec.5.0
> 11/22 20:49:06 core_name = (either 'core' or 'core.<pid>')
> 11/22 20:49:06 uid = 503, gid = 503
> 11/22 20:49:06 v_pid = -1
> 11/22 20:49:06 pid = (NOT CURRENTLY EXECUTING)
> 11/22 20:49:06 exit_status_valid = FALSE
> 11/22 20:49:07 exit_status = (NEVER BEEN EXECUTED)
> 11/22 20:49:07 ckpt_wanted = TRUE
> 11/22 20:49:07 coredump_limit_exists = TRUE
> 11/22 20:49:07 coredump_limit = 0
> 11/22 20:49:07 soft_kill_sig = 20
> 11/22 20:49:07 job_class = STANDARD
> 11/22 20:49:07 state = NEW
> 11/22 20:49:07 new_ckpt_created = FALSE
> 11/22 20:49:07 ckpt_transferred = FALSE
> 11/22 20:49:07 core_created = FALSE
> 11/22 20:49:07 core_transferred = FALSE
> 11/22 20:49:07 exit_requested = FALSE
> 11/22 20:49:07 image_size = -1 blocks
> 11/22 20:49:08 user_time = 0
> 11/22 20:49:08 sys_time = 0
> 11/22 20:49:08 guaranteed_user_time = 0
> 11/22 20:49:08 guaranteed_sys_time = 0
> 11/22 20:49:08 }
> 11/22 20:49:08 *FSM* Transitioning to state "GET_EXEC"
> 11/22 20:49:08 *FSM* Executing state func "get_exec()" [ SUSPEND
> VACATE DIE ]
> 11/22 20:49:08 Entering get_exec()
> 11/22 20:49:08 Executable is located on submitting host
> 11/22 20:49:08 *FSM* Got asynchronous event "DIE"
> 11/22 20:49:09 *FSM* Executing transition function "req_die"
> 11/22 20:49:09 req_exit_all: Proc -1 in state NEW
> 11/22 20:49:09 *FSM* Transitioning to state "TERMINATE"
> 11/22 20:49:09 *FSM* Executing state func "terminate_all()" [ ]
> 11/22 20:49:09 *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:49:09 *FSM* Executing state func "dispose_all()" [ ]
> 11/22 20:49:09 Sending final status for process 5.0
> 11/22 20:49:09 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:49:09 User time = 0.000000 seconds
> 11/22 20:49:09 System time = 0.000000 seconds
> 11/22 20:49:10 Can't unlink "dir_4575/condor_exec.5.0" - errno = 2
> 11/22 20:49:10 Removed directory "dir_4575"
> 11/22 20:49:10 *FSM* Reached state "END"
> 11/22 20:49:10 ********* STARTER terminating normally **********
> 11/22 20:59:03 ********** STARTER starting up ***********
> 11/22 20:59:03 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/22 20:59:03 ** $CondorPlatform: I386-LINUX_RH9 $
> 11/22 20:59:03 ******************************************
> 11/22 20:59:03 Submitting machine is "niting-w2p.corp.cdac.in> 11/22 20:59:04 EventHandler {
> 11/22 20:59:04 func = 0x80e3bde
> 11/22 20:59:04 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2
> SIGCHLD SIGTSTP
> 11/22 20:59:04 }
> 11/22 20:59:04 Done setting resource limits
> 11/22 20:59:05 *FSM* Transitioning to state "GET_PROC"
> 11/22 20:59:05 *FSM* Executing state func "get_proc()" [ ]
> 11/22 20:59:05 Entering get_proc()
> 11/22 20:59:05 Entering get_job_info()
> 11/22 20:59:05 Startup Info:
> 11/22 20:59:05 Version Number: 1
> 11/22 20:59:05 Id: 5.0
> 11/22 20:59:05 JobClass: STANDARD
> 11/22 20:59:05 Uid: 503
> 11/22 20:59:05 Gid: 503
> 11/22 20:59:05 VirtPid: -1
> 11/22 20:59:05 SoftKillSignal: 20
> 11/22 20:59:06 Cmd: "/home/psegrid/NIP/nip"
> 11/22 20:59:06 Args: ""
> 11/22 20:59:06 Env:
> "GLOBUS_LOCATION=/usr/local/globus-4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878;LD_LIBRARY_PATH= >"
> 11/22 20:59:06 Iwd: "/home/psegrid"
> 11/22 20:59:06 Ckpt Wanted: TRUE
> 11/22 20:59:06 Is Restart: FALSE
> 11/22 20:59:06 Core Limit Valid: TRUE
> 11/22 20:59:06 Coredump Limit 0
> 11/22 20:59:06 User uid set to 503
> 11/22 20:59:06 User uid set to 503
> 11/22 20:59:06 User Process 5.0 {
> 11/22 20:59:06 cmd = /home/psegrid/NIP/nip
> 11/22 20:59:06 args =
> 11/22 20:59:06 env = GLOBUS_LOCATION=/usr/local/globus- 4.0.5/
> X509_CERT_DIR=/etc/grid-security/certificates X509_USER_PROXY=
> X509_USER_CERT= X509_USER_KEY= HOME=/home/psegrid LOGNAME=psegrid
> SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch
> JAVA_HOME=/usr/java/jdk1.6.0_03/jre GLOBUS_GRAM_JOB_HANDLE=
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?3880a8a0-990e-11dc-814c-f74218502878 >
> LD_LIBRARY_PATH= CONDOR_VM=vm1 _condor_BIND_ALL_INTERFACES=FALSE
> CONDOR_SCRATCH_DIR=/home/condor/hosts/niting-w2p/execute/dir_4622
> 11/22 20:59:07 local_dir = dir_4622
> 11/22 20:59:07 cur_ckpt = dir_4622/condor_exec.5.0
> 11/22 20:59:07 core_name = (either 'core' or 'core.<pid>')
> 11/22 20:59:07 uid = 503, gid = 503
> 11/22 20:59:07 v_pid = -1
> 11/22 20:59:07 pid = (NOT CURRENTLY EXECUTING)
> 11/22 20:59:07 exit_status_valid = FALSE
> 11/22 20:59:07 exit_status = (NEVER BEEN EXECUTED)
> 11/22 20:59:07 ckpt_wanted = TRUE
> 11/22 20:59:07 coredump_limit_exists = TRUE
> 11/22 20:59:07 coredump_limit = 0
> 11/22 20:59:07 soft_kill_sig = 20
> 11/22 20:59:07 job_class = STANDARD
> 11/22 20:59:08 state = NEW
> 11/22 20:59:08 new_ckpt_created = FALSE
> 11/22 20:59:08 ckpt_transferred = FALSE
> 11/22 20:59:08 core_created = FALSE
> 11/22 20:59:08 core_transferred = FALSE
> 11/22 20:59:08 exit_requested = FALSE
> 11/22 20:59:08 image_size = -1 blocks
> 11/22 20:59:08 user_time = 0
> 11/22 20:59:08 sys_time = 0
> 11/22 20:59:08 guaranteed_user_time = 0
> 11/22 20:59:08 guaranteed_sys_time = 0
> 11/22 20:59:08 }
> 11/22 20:59:08 *FSM* Transitioning to state "GET_EXEC"
> 11/22 20:59:09 *FSM* Executing state func "get_exec()" [ SUSPEND
> VACATE DIE ]
> 11/22 20:59:09 Entering get_exec()
> 11/22 20:59:09 *FSM* Got asynchronous event "DIE"
> 11/22 20:59:09 *FSM* Executing transition function "req_die"
> 11/22 20:59:09 req_exit_all: Proc -1 in state NEW
> 11/22 20:59:09 *FSM* Transitioning to state "TERMINATE"
> 11/22 20:59:09 *FSM* Executing state func "terminate_all()" [ ]
> 11/22 20:59:09 *FSM* Transitioning to state "SEND_STATUS_ALL"
> 11/22 20:59:10 *FSM* Executing state func "dispose_all()" [ ]
> 11/22 20:59:10 Sending final status for process 5.0
> 11/22 20:59:10 STATUS encoded as CKPT, *NOT* TRANSFERRED
> 11/22 20:59:10 User time = 0.000000 seconds
> 11/22 20:59:10 System time = 0.000000 seconds
> 11/22 20:59:10 Can't unlink "dir_4622/condor_exec.5.0" - errno = 2
> 11/22 20:59:10 Removed directory "dir_4622"
> 11/22 20:59:10 *FSM* Reached state "END"
> 11/22 20:59:10 ********* STARTER terminating normally **********
> =====================================================================
> *globus-condor.log*
> ==============================================================
> <c>
> <a n="MyType"><s>JobAbortedEvent</s></a>
> <a n="EventTypeNumber"><i>9</i></a>
> <a n="MyType"><s>JobAbortedEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:48:10</s></a>
> <a n="Cluster"><i>4</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="Reason"><s>via condor_rm (by user psegrid)</s></a>
> </c>
> <c>
> <a n="MyType"><s>SubmitEvent</s></a>
> <a n="EventTypeNumber"><i>0</i></a>
> <a n="MyType"><s>SubmitEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:48:55</s></a>
> <a n="Cluster"><i>5</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="SubmitHost"><s><192.168.7.221:42898></s></a>
> </c>
> <c>
> <a n="MyType"><s>ExecuteEvent</s></a>
> <a n="EventTypeNumber"><i>1</i></a>
> <a n="MyType"><s>ExecuteEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
> <a n="Cluster"><i>5</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="ExecuteHost"><s><192.168.7.221:57320></s></a>
> </c>
> <c>
> <a n="MyType"><s>JobEvictedEvent</s></a>
> <a n="EventTypeNumber"><i>4</i></a>
> <a n="MyType"><s>JobEvictedEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:49:10</s></a>
> <a n="Cluster"><i>5</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="Checkpointed"><b v="f"/></a>
> <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
> <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
> <a n="SentBytes"><r>2.570000000000000E+02</r></a>
> <a n="ReceivedBytes"><r> 6.650000000000000E+02</r></a>
> <a n="TerminatedAndRequeued"><b v="f"/></a>
> <a n="TerminatedNormally"><b v="f"/></a>
> </c>
> <c>
> <a n="MyType"><s>ExecuteEvent</s></a>
> <a n="EventTypeNumber"><i>1</i></a>
> <a n="MyType"><s>ExecuteEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
> <a n="Cluster"><i>5</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="ExecuteHost"><s><192.168.7.221:57320></s></a>
> </c>
> <c>
> <a n="MyType"><s>JobEvictedEvent</s></a>
> <a n="EventTypeNumber"><i>4</i></a>
> <a n="MyType"><s>JobEvictedEvent</s></a>
> <a n="EventTime"><s>2007-11-22T20:59:11</s></a>
> <a n="Cluster"><i>5</i></a>
> <a n="Proc"><i>0</i></a>
> <a n="Subproc"><i>0</i></a>
> <a n="Checkpointed"><b v="f"/></a>
> <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
> <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
> <a n="SentBytes"><r> 2.490000000000000E+02</r></a>
> <a n="ReceivedBytes"><r> 5.970000000000000E+02</r></a>
> <a n="TerminatedAndRequeued"><b v="f"/></a>
> <a n="TerminatedNormally"><b v="f"/></a>
> </c>
> ====================================================================
>
> Nitin
>
> On Nov 20, 2007 9:24 PM, Dan Bradley <dan@xxxxxxxxxxxx> <http://niting-w2p.corp.cdac.in> <http://niting-w2p.corp.cdac.in> <mailto:dan@xxxxxxxxxxxx>> wrote:
>
>
> > Last successful match: Tue Nov 20 22:36:21 2007
>
>
> This indicates that the job is successfully getting matched to a
> machine. Something must be going wrong when the Condor tries to
> run the
> job on that machine. Look for clues about what is going wrong here:
>
> The "user log": /usr/local/globus-4.0.5//var/globus-condor.log
> The ShadowLog (condor_config_val SHADOW_LOG)
> The StartLog (condor_config_val STARTD_LOG)
> The StarterLog (condor_config_val STARTER_LOG)
>
> I hope that helps!
>
> --Dan
>
> Nitin Gavhane wrote:
>
> > hello all,
> > i am submitting job through globus to condor but the job stays
> in idle
> > state. the job details are as follows.
> > ================================================
> > *The Job Description Generated by GRAM is as follows *
> >
> > [condor@niting-w2p etc]$ cat /tmp/condor_job_description
> > #
> > # description file for condor submission
> > #
> > Universe = standard
> > Notification = Never
> > Executable = /home/psegrid/NIP/nip
> > Requirements = OpSys == "LINUX" && Arch == "INTEL"
> > Environment =
> >
> GLOBUS_LOCATION=/usr/local/globus- 4.0.5/;X509_CERT_DIR=/etc/grid-security/certificates;X509_USER_PROXY=;X509_USER_CERT=;X509_USER_KEY=;HOME=/home/psegrid;LOGNAME=psegrid;SCRATCH_DIRECTORY=/home/psegrid/.globus/scratch;JAVA_HOME=/usr/java/jdk1.6.0_03/jre;GLOBUS_GRAM_JOB_HANDLE=
>
> >
> https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH= >
> >
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH=
> <https://192.168.7.221:8443/wsrf/services/ManagedExecutableJobService?7f408200-9789-11dc-9f1a-b41f06e1e2ea;LD_LIBRARY_PATH= >>
> > Arguments =
> > InitialDir = /home/psegrid
> > Input = /dev/null
> > Log = /usr/local/globus-4.0.5//var/globus-condor.log
> > log_xml = True
> > #Extra attributes specified by client
> >
> > Output = /home/psegrid/stdout
> > Error = /home/psegrid/stderr
> > queue 1
> >
> >
> =======================================================================
> > *[psegrid@niting-w2p NIP]$ condor_q -better-analyze*
> >
> >
> > -- Submitter: niting-w2p.corp.cdac.in> <mailto: condor-users-request@xxxxxxxxxxx> with a> <http://niting-w2p.corp.cdac.in>>
> > : < 192.168.7.221:42993 <http://192.168.7.221:42993>
> <http://192.168.7.221:42993>> :
> > niting-w2p.corp.cdac.in <http://niting-w2p.corp.cdac.in>
> < http://niting-w2p.corp.cdac.in>
> > ---
> > 005.000: Run analysis summary. Of 7 machines,
> > 4 are rejected by your job's requirements
> > 0 reject your job because of their own requirements
> > 0 match but are serving users with a better priority in the
> pool
> > 3 match but reject the job for unknown reasons
> > 0 match but will not currently preempt their existing job
> > 0 are available to run your job
> > Last successful match: Tue Nov 20 22:36:21 2007
> >
> > The Requirements _expression_ for your job is:
> >
> > ( target.OpSys == "LINUX" && target.Arch == "INTEL" ) &&
> > ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
> undefined
> > ) ) &&
> > ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
> > undefined ) ) &&
> > ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >=
> ImageSize )
> >
> > Condition Machines Matched Suggestion
> > --------- ---------------- ----------
> > 1 target.Arch == "INTEL" 3
> > 2 target.OpSys == "LINUX" 7
> > 3 ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is
> > undefined ) )
> > 7
> > 4 ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is
> > undefined ) )
> > 7
> > 5 ( target.Disk >= 20000 ) 7
> > 6 ( ( 1024 * target.Memory ) >= 20000 )7
> >
> >
> >
> >
> > ==========================================================
> > *[ psegrid@niting-w2p NIP]$ condor_status*
> >
> > Name OpSys Arch State Activity LoadAv Mem
> > ActvtyTime
> >
> > vm1@niting-w2 LINUX INTEL Unclaimed Idle 0.000 469
> > 0+00:05:26
> > vm2@niting-w2 LINUX INTEL Unclaimed Idle 0.140 469
> > 0+00:26:42
> > sskadam-w2p.c LINUX INTEL Unclaimed Idle 0.000 248
> > 0+00:44:38
> > vm1@psewebs-w LINUX X86_64 Unclaimed Idle 0.400 753
> > 0+00:30:04
> > vm2@psewebs-w LINUX X86_64 Unclaimed Idle 0.000 753
> > 0+00:30:05
> > vm3@psewebs-w LINUX X86_64 Unclaimed Idle 0.000 753
> > 0+00:30:06
> > vm4@psewebs-w LINUX X86_64 Unclaimed Idle 0.000 753
> > 0+00:30:27
> >
> > Total Owner Claimed Unclaimed Matched Preempting
> > Backfill
> >
> > INTEL/LINUX 3 0 0 3 0
> 0
> > 0
> > X86_64/LINUX 4 0 0 4 0 0
> > 0
> >
> > Total 7 0 0 7 0 0
> > 0
> > ==============================================================
> > *The DAEMON details for all three machines are as follows *
> >
> > [condor@niting-w2p etc]$ ./test.sh
> > current file: condor_config
> > ## checkpoint server isn't available or USE_CKPT_SERVER is set to
> > USE_CKPT_SERVER = True
> > CKPT_SERVER_HOST = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> > < http://psewebs-w2p.corp.cdac.in>
> > ## checkpoint server? If False, the CKPT_SERVER_HOST set on
> > ## the submit machine is used. Otherwise, the CKPT_SERVER_HOST set
> > STARTER_CHOOSES_CKPT_SERVER = True
> > #WALL_CLOCK_CKPT_INTERVAL = 3600
> > ## setting is only used if USE_CKPT_SERVER (from above) is True.
> > #COMPRESS_PERIODIC_CKPT = False
> > #COMPRESS_VACATE_CKPT = False
> > #SLOW_CKPT_SPEED = 0
> > DAEMON_LIST = MASTER, STARTD, SCHEDD
> > #DC_DAEMON_LIST = \
> > =============
> > current file: psewebs-w2p.local
> > USE_CKPT_SERVER = True
> > CKPT_SERVER_HOST = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> > <http://psewebs-w2p.corp.cdac.in >
> > DAEMON_LIST = MASTER, STARTD, SCHEDD
> > DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
> > =============
> > current file: niting-w2p.local
> > USE_CKPT_SERVER = True
> > CKPT_SERVER_HOST = psewebs-w2p.corp.cdac.in
> < http://psewebs-w2p.corp.cdac.in>
> > <http://psewebs-w2p.corp.cdac.in>
> > DAEMON_LIST = MASTER, STARTD, SCHEDD
> > =============
> > current file: sskadam-w2p.local
> > USE_CKPT_SERVER = True
> > CKPT_SERVER_HOST = psewebs-w2p.corp.cdac.in
> <http://psewebs-w2p.corp.cdac.in>
> > <http://psewebs-w2p.corp.cdac.in >
> > DAEMON_LIST = MASTER, STARTD, SCHEDD
> > ===============================
> >
> > Please Tell what is wrong with job submission.
> > Thank you.
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Nitin M. Gavhane
> > MS in Adavanced Software Technologies
> > International Institute of Information Technology
> > P-14,Hinjewadi,Pune, India.
> >
> ---------------------------------------------------------------------------------------------------------------------------
>
> >
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Condor-users mailing list
> >To unsubscribe, send a message to
> condor-users-request@xxxxxxxxxxx> >subject: Unsubscribe> <mailto: condor-users-request@xxxxxxxxxxx> with a
> >You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >The archives can be found at:
> >https://lists.cs.wisc.edu/archive/condor-users/
> <https://lists.cs.wisc.edu/archive/condor-users/>
> >
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Nitin M. Gavhane
> MS in Adavanced Software Technologies
> International Institute of Information Technology
> P-14,Hinjewadi,Pune, India.
> ---------------------------------------------------------------------------------------------------------------------------