[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] gridmonitor



Hi Erik,
Below are the details of the submission that seems all ok.
But I found these errors in the GridmanagerLog.marco (seems to be the same with all servers):
1/3 11:44:47 [2463] Deleting job 587.0 from schedd
1/3 11:44:47 [2463] Schedd connection error! Will retry
1/3 11:44:47 [2463] leaving doContactSchedd()
1/3 11:44:47 [2463] (297.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (225.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (261.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (230.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (302.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (258.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:48 [2463] (294.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (222.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (313.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (250.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (241.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (214.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (286.0) doEvaluateState called: gmState GM_DELETE, globusState 32 1/3 11:44:49 [2463] (277.0) doEvaluateState called: gmState GM_DELETE, globusState 32
1/3 11:44:52 [2463] in doContactSchedd()
1/3 11:44:52 [2463] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 1/3 11:44:52 [2463] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
1/3 11:44:52 [2463] AUTHENTICATE_FS: used file /tmp/qmgr_yGRIAx, status: 1
1/3 11:44:52 [2463] querying for removed/held jobs
1/3 11:44:52 [2463] Using constraint ((Owner=?="marco"&&JobUniverse==9)) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= TRUE))
1/3 11:44:52 [2463] Fetched 0 job ads from schedd
1


On the submit host the gridmonitor seems enabled (and streamings are off)
$ condor_submit csub.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2959.
$ condor_config_val ENABLE_GRID_MONITOR
TRUE

job log file seems ok:
017 (2963.000.000) 01/03 11:42:21 Job submitted to Globus
    RM-Contact: tier2-osg.uchicago.edu/jobmanager-condor
    JM-Contact: https://tier2-osg.uchicago.edu:40023/32064/1136310137/
    Can-Restart-JM: 1



On the server there are plenty of globus-job-managers (they are mine, start times correspond):
[marco@tier2-02 grid]$ ps --forest -fu usatlas1
UID        PID  PPID  C STIME TTY          TIME CMD
usatlas1 32064     1  0 11:42 ?        00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31911     1  0 11:42 ?        00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31909     1  0 11:42 ?        00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 31908     1  0 11:42 ?        00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf
usatlas1 25517     1  0 11:29 ?        00:00:00 globus-job-manager -conf /usr/local/grid/globus/etc/globus-job-manager.conf

Any idea of what happened?
Thanks,
Marco


On Fri, 30 Dec 2005, Erik Paulson wrote:

On Fri, Dec 30, 2005 at 12:35:04PM -0600, Marco Mambelli wrote:
Hi all,
in panda we have problems with Condor-G Gridmonitor.
Jobs seem not to use it.
Condor submit files are like the one at the end of the email:
streaming is turned off.
The sites have gridmonitor enabled since other jobs use it.
There is something in the condor jobs submitted by panda that let them
fall back to the globus job manager.
Any idea?


Using the grid monitor is controlled by the config file of the
submitting Condor-G, and not the submit file (ie every job uses it
or none do)

Check
condor_config_val ENABLE_GRID_MONITOR

at the Condor-G where Panda submits jobs from.

-Erik

Thanks,
Marco


######################################################################
# Submit file template, from GriPhyN submit file
######################################################################
universe = globus
globusscheduler = tier2-osg.uchicago.edu/jobmanager-condor
stream_output = false
stream_error  = false
transfer_output = true
transfer_error = true
output =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.out
error =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.err
log =
/grid/data4a/users/marco/panda3/cvs/offline/Production/panda/jobscheduler/myjobs/UC_ATLAS_MWT2-2005-12-29-13-15-11-918695/pilot.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files =
transfer_input_files = DQ2ProdClient.py,storage_access_info.py
executable = pilot.py
transfer_executable = true
globusrsl = (jobtype=single)(maxWallTime=4000)
environment = APP=/share/app;GTAG=job36;QUIET_ASSERT=i;
arguments = -a /share/app -d /scratch -l /share/data -q
http://tier2-01.uchicago.edu:8000/dq2/ -p 25443 -s UC_ATLAS_MWT2 -u user
-w https://gridui01.usatlas.bnl.gov
copy_to_spool = false
notification = NEVER
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3)
#remote_initialdir = /share/tmp
submit_event_user_notes = pool:UC_ATLAS_MWT2
queue

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users