HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] RFC: tracking the birthdate of the job queue



In quill, the primary key for job data is <cluster, proc>. It is assumed
that when the state of the schedd is destroyed (ie job_queue.log is
deleted and <cluster, proc> starts over at 1,0) the database associated
with that schedd is also destroyed. This is usually true, because most
quill databases run on the same machine as the schedd.

In Quill++ (and possibly the manure spreader), there is one database 
for multiple schedds - the primary key for the job becomes
<schedd_name, cluster, proc>. However, the database is likely to be
centrally run and backed up - possibly more so than the schedds (certainly
this will be true at UW Computer Sciences). This means when the schedd
resets the job queue and <cluster, proc> rolls back to (1,0), new jobs
will conflict with old jobs in the database. 

I propose changing the HistoricalSequenceNumber record in the job queue
to include the timestamp of when the first record in job_queue.log
was written. This will function as the JobQueueBirthdate. This number
will be preserved through each rotation of the job queue. Happily, the
existing log record already a timestamp: (107 is the historical seq num
record type)

107 5 CreationTimestamp 1163801657

Currently, CreationTimestamp means "when was this file created", and
is always just set to time(NULL). Nothing in Condor currently looks at it.

The patch propagates the original through log rotations, and inserts the
value into the schedd ad. This way, tools like condor_q can construct
appropriate queries to find the current job queue from a database that
may have multiple instances of the job queue. The new Quill++ primary
key would be <schedd_name, job_queue_birthdate, cluster, proc>

I had hoped to get to this before Dan left for vacation, but classes
got in the way. 

This would go into 6.8 - it's safe, since nothing was using the number,
and the sooner the existing job_queues get changed the better.

You can see it in context in /p/condor/workspaces/epaulson.1/src_trees/V68

I am open to better names for what the attribute should be called in the
schedd ad, I will use a real ATTR_ whatever when we pick one.

-Erik

Index: condor_c++_util/classad_collection.h
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_c++_util/classad_collection.h,v
retrieving revision 1.20
diff -r1.20 classad_collection.h
129a130,131
>   time_t GetOrigLogBirthdate() { return ClassAdLog::GetOrigLogBirthdate(); }
> 
Index: condor_c++_util/classad_log.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_c++_util/classad_log.C,v
retrieving revision 1.31.2.1
diff -r1.31.2.1 classad_log.C
69a70
> 	m_original_log_birthdate = time(NULL);
112a114
> 			m_original_log_birthdate = ((LogHistoricalSequenceNumber *)log_rec)->get_timestamp();
129c131
< 		log_rec = new LogHistoricalSequenceNumber( historical_sequence_number, time(NULL) );
---
> 		log_rec = new LogHistoricalSequenceNumber( historical_sequence_number, m_original_log_birthdate );
560c562
< 	log = new LogHistoricalSequenceNumber( historical_sequence_number, time(NULL) );
---
> 	log = new LogHistoricalSequenceNumber( historical_sequence_number, m_original_log_birthdate );
Index: condor_c++_util/classad_log.h
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_c++_util/classad_log.h,v
retrieving revision 1.18
diff -r1.18 classad_log.h
95a96,97
> 	time_t GetOrigLogBirthdate() {return m_original_log_birthdate;}
> 
127a130
> 	time_t m_original_log_birthdate;
145c148,149
< 	time_t timestamp; //time is logged for purely informational purposes
---
> 	time_t timestamp; //when was the the first record originally written,
> 					  // regardless of how many times the log has rotated
Index: condor_schedd.V6/qmgmt.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_schedd.V6/qmgmt.C,v
retrieving revision 1.105.2.4
diff -r1.105.2.4 qmgmt.C
672a673,679
> time_t
> GetOriginalJobQueueBirthdate()
> {
> 	return JobQueue->GetOrigLogBirthdate();
> }
> 
> 
Index: condor_schedd.V6/qmgmt.h
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_schedd.V6/qmgmt.h,v
retrieving revision 1.17
diff -r1.17 qmgmt.h
86a87
> time_t GetOriginalJobQueueBirthdate();
Index: condor_schedd.V6/schedd.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_schedd.V6/schedd.C,v
retrieving revision 1.242.2.9
diff -r1.242.2.9 schedd.C
740a741,743
> 	sprintf(tmp, "%s = %d", "JobQueueBirthdate",
> 			 GetOriginalJobQueueBirthdate());
> 	ad->Insert (tmp);