
Requirement -
Accounting record type 'R' should have information about resources used by the job so far.
Analysis:
- The current behaviour does not record anything about the resources used by the job before until the time it was re-queued.
- This results in less usage being reported for the job.
- If information on resource usage is added to the record type 'R', better resource usage will be reported, even though not 100% accurate.
- 'R' record type is logged when a job is re-queued because -
- The node the job was running on goes down and node_fail_requeue timeout is hit.
- It is rerun using qrerun <job-id>.
- It is rerun using qrerun -Wforce <job-id>.
- provisioning for a vnode fails.
- mom is restarted without any options or with the '-r' option.
- Currently, 'R' record contains resource_used information for items 2 and 5 above.
Code flow:
Case 1. Job re-queued because node_fail_requeue was triggered:
node_down_requeue() --> discard_job() --> post_discard_job() --> account_jobend()
Case 2, 5. job is rerun using qrerun <job id> and on mom restart.
on_job_rerun() [job substate == JOB_SUBSTATE_RERUN3] --> account_jobend()
Case 3. job is rerun using qrerun -Wforce <job id>.
req_rerunjob() --> req_rerunjob2() --> force_requeue() --> account_jobend().
Case 4. provisioning of a vnode fails
check_and_run_jobs() --> fail_vnode_job() --> force_requeue() --> account_jobend()