[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] Condor sending email -- it's complicated
- Date: Tue, 16 Oct 2012 13:10:30 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: [Condor-devel] Condor sending email -- it's complicated
I'm looking for places where Condor sends notifications about jobs and
problematic events, in the hopes that I could extend Condor to send more
than just email messages.
I discovered quickly that the email sending code is split between two
interfaces (email_*open and an Email class). The responsibility of
sending email notifications is not handled consistently, i.e.
periodic_hold results in an email only if the job is already running.
The message sent are sometimes wrong, or have inconsistent information.
I even had a removed job finish and generate a completion email.
I only walked through this for notification=always. notification=error
is known problematic, with periodic mentions on condor-users.
Attached are two spreadsheets covering which daemons send which messages
and what messages get sent for jobs in different universes, on the off
chance someone wants to look deeper, or clean things up.
Best,
matt
Sender,Receiver,Backup receiver,Why,Notes
condor_master,DAEMON_ADMIN_EMAIL,CONDOR_ADMIN,Daemon obituary,
condor_preen,PREEN_ADMIN,CONDOR_ADMIN,Any output (files removed/errors),
DaemonCore,CONDOR_ADMIN,,Long locking delays from child reported via DC_CHILDALIVE,
condor_schedd,CONDOR_ADMIN,,condor_gridmanager failed to write its log file,
,NotifyUser | CONDOR_ADMIN,Owner,job hold,"Only on error, not condor_hold"
,NotifyUser | CONDOR_ADMIN,Owner,job release,Unreachable code
,CONDOR_ADMIN,,failure to start condor_shadow,
,CONDOR_ADMIN,,job in runnable table w/ existing shadow record,
,NotifyUser,Owner,scheduler universe job leaving queue (job exit),
,NotifyUser,Owner,failure to expand $$ (job held),
,CONDOR_ADMIN,,file transfer took too long,
,CONDOR_ADMIN,,AppendHistory failed,
condor_starter,CONDOR_ADMIN,,AppendHistory failed,
,NotifyUser,Owner,local universe job hold,
,NotifyUser,Owner,local universe job remove,
,NotifyUser,Owner,local universe job terminate,
condor_shadow,NotifyUser,Owner,job hold,Periodic policy and errors
,NotifyUser,Owner,job remove,Periodic policy
,NotifyUser,Owner,job terminate,
,NotifyUser,Owner,parallel universe job terminate,
condor_gridmanager,NotifyUser,Owner,job terminate,
condor_job_router,NotifyUser,Owner,job terminate,
,,,,
condor_shadow.std,Not investigated,Not investigated,Not investigated,Not investigated
universe,status,feature,outcome,notes,cmd
local,running,periodic_hold,removed email (from starter) *,wrong message,-a notification=always -a requirements=true -a periodic_hold=jobstatus==2
,,periodic_remove,removed email (from starter),,-a notification=always -a requirements=true -a periodic_remove=jobstatus==2
,,condor_hold,removed email (from starter) *,wrong message,-a notification=always -a requirements=true
,,condor_rm,removed email (from starter),,-a notification=always -a requirements=true
,,job terminate,exited normally email,,-a notification=always -a requirements=true
,idle,periodic_hold,no email,,-a notification=always -a requirements=false -a periodic_hold=true
,,periodic_remove,no email,,-a notification=always -a requirements=false -a periodic_remove=true
,,condor_hold,no email,,-a notification=always -a requirements=false
,,condor_rm,no email,,-a notification=always -a requirements=false
,,,,,
vanilla,running,periodic_hold,held email (from shadow) *,>1 email,
,,periodic_remove,removed email (from shadow) *,>1 email,
,,condor_hold,no email,,
,,condor_rm,no email,,
,,job terminate,exited normally email,,
,idle,periodic_hold,no email,,
,,periodic_remove,no email,,
,,condor_hold,no email,,
,,condor_rm,no email,,
,,,,,
scheduler,running,periodic_hold,no email,,
,,periodic_remove,no email,,
,,condor_hold,no email,,
,,condor_rm,no email,,
,,job terminate,exited normally email,ImageSize wrong,
,idle,periodic_hold,no email,,
,,periodic_remove,no email,,
,,condor_hold,no email,,
,,condor_rm,no email,,
,,,,,
parallel,running,periodic_hold,no email,,-a notification=always -a machine_count=1 -a requirements=true -a periodic_hold=jobstatus==2
,,periodic_remove,no email,,-a notification=always -a machine_count=1 -a requirements=true -a periodic_remove=jobstatus==2
,,condor_hold,no email,,-a notification=always -a machine_count=1 -a requirements=true
,,condor_rm,no email,,-a notification=always -a machine_count=1 -a requirements=true
,,job terminate,completed email (from shadow),,-a notification=always -a machine_count=1 -a requirements=true
,idle,periodic_hold,no email,,-a notification=always -a machine_count=1 -a requirements=false -a periodic_hold=true
,,periodic_remove,no email,,-a notification=always -a machine_count=1 -a requirements=false -a periodic_remove=true
,,condor_hold,no email,,-a notification=always -a machine_count=1 -a requirements=false
,,condor_hold,no email,,-a notification=always -a machine_count=1 -a requirements=true
,,condor_rm,no email,,-a notification=always -a machine_count=1 -a requirements=false
,,condor_rm,completed email (from shadow),X state,-a notification=always -a machine_count=1 -a requirements=true