+------------------------------+ | OGF : PGI Job State Model | +------------------------------+ 1) User Job submission 2) Submitted 3) Pre-processing 4) Pre-processing-Hold 5) Delegated 6) Delegated-Hold 7) Post-processing 8) Post-processing-Hold 9) Finished 10) Failed-Cancelled 11) Purged 1) User Job submission ====================== This initial state requires that the Job Submitter (which can be a human User or a software client) has prepared a Job description (standard is JSDL). The Job description must contain a request to run a script or an executable. The Job description can contain : - Requests of automatic staging (performed by the Execution Service without action of the Submitter), - Requests of manual staging (where the Execution Service notifies the Submitter the location of the file(s) to be staged). 2) Submitted ============ The Job goes into the 'Submitted' state when the User has submitted a readable Job description. The Execution Service immediately allocates a Jobid (or an EPR) to the Job and sends it back to the Submitter. In case of a Collection Job, the Execution Service immediately allocates one Jobid (or EPR) to each Sub-Job and sends all Jobids (or EPRs) back to the Submitter. That permits the Submitter, at any time, to request a Job cancellation, which is handled as a Job failure, and which sends the Job to the 'Failed-Cancelled' state. In case of a 'Parameter Sweep Job', immediate allocation of one Jobid (or EPR) to each Sub-Job is impossible, so this case is NOT covered by this document. If the Execution Service finds that the Job can NOT be executed at all (for example: bad Job description, Requirements exceeding quotas), then it triggers the 'Submitted Failure' transition which sends the Job to the 'Failed-Cancelled' state. When the Execution Service has found a set of resources adequate for the execution of the Job : - It decides the storage resources which are necessary for Stage-in, Job execution and Stage-out, but does NOT allocate them yet, - It decides which computing resources will execute the Job, but does NOT allocate them yet, - It triggers the 'Goes to Pre-processing' transition which sends the Job to the 'Pre-processing' state. The Execution Service updates the Job status, but does NOT specifically notify the Submitter. The Job can stay in this 'Submitted' state a long time, for various reasons, for example : 2.1) Incoming ------------- In this substate, the Execution Service has previous submitted Jobs to process first. 2.2) Waiting ------------ In this substate, the resources required by the Job are NOT simultaneously available yet. 2.3) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Pre-processing' state first. 3) Pre-processing ================= The Job goes into the 'Pre-processing' state from 2 states : - From the 'Submitted' state when the Execution Service has triggered the 'Goes to Pre-processing' transition, - From the 'Pre-processing-Hold' state when the Submitter has notified the Execution Service that the Job can return to the 'Pre-processing' state. At the beginning of the 'Pre-processing' state, the Execution Service allocates the storage resources which are necessary for Stage-in. Inside the 'Pre-processing' state, the computing resources are decided, but NOT allocated yet. This permits the Job to stay inside the 'Pre-processing' state as long as necessary to perform automatic Stage-in, without blocking computing resources. In case of an unrecoverable error, or if the Submitter has suddenly cancelled the Job, the Execution Service triggers the 'Pre-processing Failure' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'Pre-processing needs User action' transition to send the Job to the 'Pre-processing-Hold' state, for example in the 2 following cases : - Manual Stage-in is necessary, - Pre-processing has failed in a manner that can perhaps be corrected by the Submitter. When the Execution Service has verified that all Stage-in operations have been successfully performed, it triggers the 'Goes to Delegated' transition which sends the Job to the 'Delegated' state. The Execution Service updates the Job status, but does NOT specifically notify the Submitter. The Job can stay in this 'Pre-processing' state a long time, for various reasons, for example : 3.1) Incoming ------------- In this substate, the Execution Service has previous pre-processing Jobs to process first. 3.2) Automatic-Stage-In ----------------------- In this substate, the Execution Service automatically performs Stage-in. 3.3) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Delegated' state first. 4) Pre-processing-Hold ====================== The Job goes into the 'Pre-processing-Hold' state from the 'Pre-processing' state when the Execution Service has triggered the 'Pre-processing needs User action' transition, for example in the 2 following cases : - Manual Stage-in is necessary, - Pre-processing has failed in a manner that can perhaps be corrected by the Submitter. Inside the 'Pre-processing-Hold' state : - The storage resources which are necessary for Stage-in are already allocated, - The computing resources are decided, but NOT allocated yet. This permits the Job to stay inside the 'Pre-processing-Hold' state as long as necessary to perform manual Stage-in or any other User action, without blocking computing resources. The Execution Service DOES specifically notify the Submitter that the Job requires User action. The Execution Service updates the Job status. In case of an unrecoverable error, or if the Submitter cancels the Job, the Execution Service triggers the 'Pre-processing-Hold Cancel' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'User action for Pre-processing' transition to send the Job to the 'Pre-processing' state when the Submitter has notified the Execution Service that the Job can return to the 'Pre-processing' state. The Job can stay in this 'Pre-processing' state a long time, for various reasons, for example : 4.1) Incoming ------------- In this substate, the Execution Service has previous pre-processing hold Jobs to process first. 4.2) Manual-Stage-In -------------------- In this substate, the Submitter has to manually perform Stage-in. 4.3) Failed-Recoverable ----------------------- In this substate, the Submitter has to perform corrective actions. 4.4) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Pre-processing' state first. 5) Delegated ============ The Job goes into the 'Delegated' state from 2 states : - From the 'Pre-Processing' state when the Execution Service has triggered the 'Goes to Delegated' transition, - From the 'Delegated-Hold' state when the Submitter has notified the Execution Service that the Job can return to the 'Delegated' state. At the beginning of the 'Delegated' state : - The storage resources which were necessary for Stage-in are already allocated, and Stage-in has already been successfully completed, - The Execution Service allocates the storage resources which are necessary for Job execution, - The Execution Service allocates all necessary computing resources to the Job. So, the Job should make best use of the allocated computing resources : It should NOT sleep or halt. The Execution Service should consider most Job failures as unrecoverable. The 'Delegated' state covers both : - Execution by a batch system belonging to the original grid infrastructure; In that case, the Execution Service must be able to give to the Submitter the name of the batch queue (+ perhaps local job id), - Forwarding to the Execution Service of another grid infrastructure; In that case, the Execution Service must be able to give to the Submitter the corresponding EPR; Detailed follow-up of the Job status inside the other grid infrastructure is NOT performed at all by the original Execution Service, but optionally by the Submitter using the returned EPR. In case of an unrecoverable error, or if the Submitter has suddenly cancelled the Job, the Execution Service triggers the 'Delegated Failure' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'Delegated needs User action' transition to send the Job to the 'Delegated-Hold' state, in rare cases when Job execution has failed in a manner that can perhaps be corrected by the Submitter. At the end of the 'Delegated' state, the Execution Service deallocates : - All computing resources from the Job, - The storage resources which where required only for Job execution, - The storage resources which where required only for Stage-in. When the Execution Service has verified that Job execution has been entirely performed (successfully or with a non-zero return code), it triggers the 'Goes to Post-Processing' transition which sends the Job to the 'Post-Processing' state. The Execution Service updates the Job status, but does NOT specifically notify the Submitter. The Job can stay in this 'Delegated' state a long time, for various reasons, for example : 5.1) Incoming ------------- In this substate, the Execution Service has previous delegated Jobs to process first, for example allocate computing resources to them. 5.2) Running ------------ In this substate, the Job is in execution inside a batch system belonging to the original grid infrastructure. 5.3) Forwarded -------------- In this substate, the original Execution Service has forwarded the Job to the Execution Service of another grid infrastructure. 5.4) Outgoing ------------- In this substate, the Execution Service has previous Jobs to process first, for example deallocate computing resources from them, then send them to the 'Post-Processing' state. 6) Delegated-Hold ================= The Job goes into the 'Delegated-Hold' state from the 'Delegated' state in rare cases when the Execution Service has triggered the 'Delegated needs User action' transition, for example when Job execution has failed in a manner that can perhaps be corrected by the Submitter. Inside the 'Delegated-Hold' state, all storage and computing resources which are necessary for Job execution are already allocated. So the Job should stay inside the 'Delegated-Hold' state as shortly as possible. The Execution Service DOES specifically notify the Submitter that the Job requires urgent User action. The Execution Service updates the Job status. In case of an unrecoverable error, or if the Submitter cancels the Job, or in case of time-out, the Execution Service triggers the 'Delegated-Hold Cancel' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'User action for Delegated' transition to send the Job to the 'Delegated' state when the Submitter has notified the Execution Service that the Job can return to the 'Delegated' state. This 'Delegated-Hold' state contains substates, for example : 6.1) Incoming ------------- In this substate, the Execution Service has previous delegated hold Jobs to process first. 6.2) Failed-Recoverable ----------------------- In this substate, the Submitter has to urgently perform corrective actions. 6.3) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Delegated' state first. 7) Post-processing ================== The Job goes into the 'Post-processing' state from 2 states : - From the 'Delegated' state when the Execution Service has triggered the 'Goes to Post-processing' transition, - From the 'Post-processing-Hold' state when the Submitter has notified the Execution Service that the Job can return to the 'Post-processing' state. At the beginning of the 'Post-processing' state : - The Execution Service allocates the storage resources which are necessary for Stage-out, - The computing resources are already deallocated. This permits the Job to stay inside the 'Post-processing' state as long as necessary to perform automatic Stage-out, without blocking computing resources. In case of an unrecoverable error, or if the Submitter has suddenly cancelled the Job, the Execution Service triggers the 'Post-processing Failure' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'Post-processing needs User action' transition to send the Job to the 'Post-processing-Hold' state, for example in the 2 following cases : - Manual Stage-out is necessary, - Post-processing has failed in a manner that can perhaps be corrected by the Submitter. At the end of the 'Post-processing' state, the Execution Service deallocates the storage resources which where required only for Stage-out. When the Execution Service has verified that all Stage-out operations have been successfully performed, it triggers the 'Finishes with Success or Error' transition which sends the Job to the 'Finished' state. The Execution Service updates the Job status, but does NOT specifically notify the Submitter. The Job can stay in this 'Post-processing' state a long time, for various reasons, for example : 7.1) Incoming ------------- In this substate, the Execution Service has previous post-processing Jobs to process first. 7.2) Automatic-Stage-Out ------------------------ In this substate, the Execution Service automatically performs Stage-out. 7.3) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Finished' state first. 8) Post-processing-Hold ======================= The Job goes into the 'Post-processing-Hold' state from the 'Post-processing' state when the Execution Service has triggered the 'Post-processing needs User action' transition, for example in the 2 following cases : - Manual Stage-out is necessary, - Post-processing has failed in a manner that can perhaps be corrected by the Submitter. Inside the 'Post-processing-Hold' state : - The storage resources which are necessary for Stage-out are already allocated, - The computing resources are already deallocated. This permits the Job to stay inside the 'Post-processing-Hold' state as long as necessary to perform manual Stage-out or any other User action, without blocking computing resources. The Execution Service DOES specifically notify the Submitter that the Job requires User action. The Execution Service updates the Job status. In case of an unrecoverable error, or if the Submitter cancels the Job, the Execution Service triggers the 'Post-processing-Hold Cancel' transition which sends the Job to the 'Failed-Cancelled' state. The Execution Service can trigger the 'User action for Post-processing' transition to send the Job to the 'Post-processing' state when the Submitter has notified the Execution Service that the Job can return to the 'Post-processing' state. The Job can stay in this 'Post-processing' state a long time, for various reasons, for example : 8.1) Incoming ------------- In this substate, the Execution Service has previous post-processing hold Jobs to process first. 8.2) Manual-Stage-Out --------------------- In this substate, the Submitter has to manually perform Stage-out. 8.3) Failed-Recoverable ----------------------- In this substate, the Submitter has to perform corrective actions. 8.4) Outgoing ------------- In this substate, the Execution Service has previous Jobs to send to the 'Post-processing' state first. 9) Finished =========== The Job goes into the 'Finished' state from the 'Post-processing' state when the Execution Service has triggered the 'Finishes with Success or Error' transition. Inside the 'Finished' state : - All storage and computing resources are already deallocated from the Job. - The Submitter can only retrieve the Job log and the Output Sandbox. The Execution Service DOES specifically notify the Submitter that the Job is finished and that the Output Sandbox is available for retrieval. The Execution Service can purge the Output Sandbox : - when the Submitter has successfully retrieved it, - or after a long defined period (several weeks). The Execution Service then triggers the 'Purge after Finished' transition to send the Job to the 'Purged' state. 10) Failed-Cancelled ==================== The Job goes into the 'Failed-Cancelled' state from any state : - when the Execution Service has detected an unrecoverable Job Failure, - or when the Submitter has cancelled the Job. Inside the 'Failed-Cancelled' state : - All storage and computing resources are already deallocated from the Job. - The Submitter can only retrieve the Job log, and perhaps bits of the Output Sandbox which the Job could generate before failure or cancellation. The Execution Service DOES specifically notify the Submitter that the Job is 'Failed-Cancelled' and if bits of the Output Sandbox are available for retrieval. The Execution Service can purge the Output Sandbox : - if it is empty, - or when the Submitter has successfully retrieved the available bits, - or after a long defined period (several weeks). The Execution Service then triggers the 'Purge after Failure or Cancellation' transition to send the Job to the 'Purged' state. 11) Purged ========== The Job goes into the terminal 'Purged' state from 2 states : - From the 'Finished' state when the Execution Service has triggered the 'Purge after Finished' transition, - From the 'Failed-Cancelled' state when the Execution Service has triggered the 'Purge after Failure or Cancellation' transition. Inside the 'Purged' state : - All storage and computing resources are already deallocated from the Job. - The Output Sandbox has already been purged. - The Submitter should still be able to retrieve the Job log (For example, inside EGEE, if the Job log has been purged from the LB, it must still be available in the JP).