Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] gridmonitor
- Date: Tue, 3 Jan 2006 11:56:38 -0600 (CST)
- From: Marco Mambelli <marco@xxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] gridmonitor
Hi Erik,
Below are the details of the submission that seems all ok.
But I found these errors in the GridmanagerLog.marco (seems to be the
same with all servers):
1/3 11:44:47 [2463] Deleting job 587.0 from schedd
1/3 11:44:47 [2463] Schedd connection error! Will retry
1/3 11:44:47 [2463] leaving doContactSchedd()
1/3 11:44:47 [2463] (297.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (225.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (261.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (230.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (302.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (258.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:48 [2463] (294.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (222.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (313.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (250.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (241.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (214.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (286.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:49 [2463] (277.0) doEvaluateState called: gmState GM_DELETE,
globusState 32
1/3 11:44:52 [2463] in doContactSchedd()
1/3 11:44:52 [2463] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
1/3 11:44:52 [2463] SEC_DEBUG_PRINT_KEYS is undefined, using default value
of False
1/3 11:44:52 [2463] AUTHENTICATE_FS: used file /tmp/qmgr_yGRIAx, status: 1
1/3 11:44:52 [2463] querying for removed/held jobs
1/3 11:44:52 [2463] Using constraint ((Owner=?="marco"&&JobUniverse==9))
&& (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?=
TRUE))
1/3 11:44:52 [2463] Fetched 0 job ads from schedd
1
On the submit host the gridmonitor seems enabled (and streamings are off)
$ condor_submit csub.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2959.
$ condor_config_val ENABLE_GRID_MONITOR
TRUE
job log file seems ok:
017 (2963.000.000) 01/03 11:42:21 Job submitted to Globus
RM-Contact: tier2-osg.uchicago.edu/jobmanager-condor
JM-Contact: https://tier2-osg.uchicago.edu:40023/32064/1136310137/
Can-Restart-JM: 1
On the server there are plenty of globus-job-managers (they are mine,
start times correspond):
[marco@tier2-02 grid]$ ps --forest -fu usatlas1
UID PID PPID C STIME TTY TIME CMD
usatlas1 32064 1 0 11:42 ? 00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31911 1 0 11:42 ? 00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31909 1 0 11:42 ? 00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31908 1 0 11:42 ? 00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 25517 1 0 11:29 ? 00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
Any idea of what happened?
Thanks,
Marco
On Fri, 30 Dec 2005, Erik Paulson wrote:
On Fri, Dec 30, 2005 at 12:35:04PM -0600, Marco Mambelli wrote:
Hi all,
in panda we have problems with Condor-G Gridmonitor.
Jobs seem not to use it.
Condor submit files are like the one at the end of the email:
streaming is turned off.
The sites have gridmonitor enabled since other jobs use it.
There is something in the condor jobs submitted by panda that let them
fall back to the globus job manager.
Any idea?
Using the grid monitor is controlled by the config file of the
submitting Condor-G, and not the submit file (ie every job uses it
or none do)
Check
condor_config_val ENABLE_GRID_MONITOR
at the Condor-G where Panda submits jobs from.
-Erik
Thanks,
Marco
######################################################################
# Submit file template, from GriPhyN submit file
######################################################################
universe = globus
globusscheduler = tier2-osg.uchicago.edu/jobmanager-condor
stream_output = false
stream_error = false
transfer_output = true
transfer_error = true
output =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.out
error =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.err
log =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files =
transfer_input_files = DQ2ProdClient.py,storage_access_info.py
executable = pilot.py
transfer_executable = true
globusrsl = (jobtype=single)(maxWallTime=4000)
environment = APP=/share/app;GTAG=job36;QUIET_ASSERT=i;
arguments = -a /share/app -d /scratch -l /share/data -q
http://tier2-01.uchicago.edu:8000/dq2/ -p 25443 -s UC_ATLAS_MWT2 -u user
-w https://gridui01.usatlas.bnl.gov
copy_to_spool = false
notification = NEVER
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3)
#remote_initialdir = /share/tmp
submit_event_user_notes = pool:UC_ATLAS_MWT2
queue
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users