Analysis for PP-351
https://openpbs.atlassian.net/browse/PP-351
Requirement -
Accounting record type 'R' should have information about resources used by the job so far.
Analysis:
The current behaviour does not record anything about the resources used by the job before until the time it was re-queued.
This results in less usage being reported for the job.
If information on resource usage is added to the record type 'R', better resource usage will be reported, even though not 100% accurate.
'R' record type is logged when a job is re-queued because -
The node the job was running on goes down and node_fail_requeue timeout is hit.
It is rerun using qrerun <job-id>.
It is rerun using qrerun -Wforce <job-id>.
provisioning for a vnode fails.
mom is restarted without any options or with the '-r' option.
Currently, 'R' record contains resource_used information for items 2 and 5 above.
Code flow:
Case 1. Job re-queued because node_fail_requeue was triggered:
node_down_requeue() --> discard_job() --> post_discard_job() --> account_jobend()
Case 2, 5. job is rerun using qrerun <job id> and on mom restart.
on_job_rerun() [job substate == JOB_SUBSTATE_RERUN3] --> account_jobend()
Case 3. job is rerun using qrerun -Wforce <job id>.
req_rerunjob() --> req_rerunjob2() --> force_requeue() --> account_jobend().
Case 4. provisioning of a vnode fails
check_and_run_jobs() --> fail_vnode_job() --> force_requeue() --> account_jobend()