What Is a Test Case?
A test case is a written description of the steps to be taken in order to validate a behavior. The test case should contain all of the information and instructions the tester needs.
Components of a Test Case
Summary
- Summarize purpose of test
- Less than 50 characters
Instructions
- Put each step in a separate line (a separate row in Qmetry)
- Provide a high-level summary of what needs to be done in each step
- Do not specify OS-specific commands
Host
- Specify host for each step: server, sched, MoM, comm, client
User
- If relevant, specify whether to run command as root or unprivileged user
- For example, there is usually no need to specify that a hook or qmgr must be run as root. However, specify this when you are checking user-related behavior.
Pre- and post-conditions
- Specify these if there any besides the common configuration settings
Expected outcome
- Be clear about what you expect.
- Specify exactly what should happen to an attribute, resource, etc.
- Do not use snapshots of qstat, pbsnodes, or logs
Cleanup
- Specify any steps needed to return the system to its vanilla state
Guidelines
- Add a summary to the test. Give a unique test ID and a test summary of fewer than 50 characters ### is ID part of summary? Or is this 2 steps?
- We recommend providing the issue ID in the name of the test suite as a reference
- Provide a high-level summary of what needs to be done in each step. Avoid providing commands as sometimes they are system-specific or might change in future.
- Only specify server/sched and MoM options or log messages if they are public/stable or contractual. Avoid specifying any command that is private/unstable
- Refer to guides for any upgrade or installation steps. DO NOT add any upgrade or installation steps in the test. Also DO NOT add reference to a section in PBS guides as it may change in future.
- Similarly DO NOT provide any failover or peer server, etc .. setup in the test configuration.
Make sure that each step is unambiguous.
Notes on Test Execution
- Refer the related documentation. For example, if testing on a special branch then refer the branch specific documentation and EDD along with generic PBS guides,
- Sometime new features might not be documented in time. If not able to find information in PBS guides then refer the respective EDD. Also one might have to follow up with the project team who has delivered the feature on the status of documentation and previous testing.
- Link PBS documentation and product bugs to the tests if they are failing either due to missing documentation or product behavior.
Examples ### most of these examples fail our recommendations; need better
Job Placement Should Respect ncpus on Nodes Marked Exclusive
Prerequisites
Create 2 vnodes with 2 ncpus and sharing=default_excl each
Test
- Submit 1 job with 1 ncpus and place=excl
- Submit another job with 3 chunks each of 1 ncpus and place=excl
- Check 1st job is running and 2nd is queued with the following job comment:
Insufficient amount of resource: ncpus (R: 3 A: 2 T: 4)
Note: A is 2 and not 3 since the node on which first job is running is exclusive.
Cleanup
Remove the vnode and reset the configuration to default
Calendaring When opt_backfill_fuzzy Set to "low"
Prerequisites
Set 2 ncpus on the node
Test
- Set sched attribute opt_backfill_fuzzy to low
- Set server’s backfill_depth = 2
- Set sched attribute strict_ordering to true
- Submit a job consuming 1 ncpu of walltime 60 secs
- Submit a reservation which will start after 60 secs consuming both the ncpus
- Submit a job with walltime 120 secs.
- Verify that above job will be calendared to start after reservation end time. To see that, check the estimated start time of the second job.
Cleanup
Set ncpus to default and unset backfill_depth, opt_backfill_fuzzy, and strict_ordering.
sched_preempt_enforce_resumption is true and Job is topjob_ineligible ###
Prerequisites
Set 2 ncpus on the node
Test
- Set server’s backfill_depth = 2
- Set sched attribute sched_preempt_enforce_resumption = true
- Submit a job consuming 2 ncpus with walltime of 2min
- Set running job's topjob_ineligible=true via qalter
- Create a high priority queue
- Submit 2 jobs to high priority queue with walltime of 1min
- See that pre-empted jobs are calendared even when they are set topjob_ineligible=true, i.e., pre-empted jobs have estimated start time set.
Cleanup
Reset to default configuration
Check Performance with Various opt_backfill_fuzzy Values
Test
- Submit a reservation of 30 secs which will recur every minute for a whole day (-r "FREQ=MINUTELY;COUNT=3600")
2. Configure daemons:
Set server’s backfill_depth = 1000
set ncpus=2 on MoM
set sched_cycle_length to 600 on server
set strict_ordering to true on scheduler
3. Create 3 queues q1,q2,and q3.
4. Set q1 and q2 queue’s backfill_depth to 1000
5. Submit the 10-second jobs below with walltimes of 10 seconds
1000 jobs to q1
1000 jobs to q2
1000 jobs to q3
2000 jobs to default workq
6. Set the opt_backfill_fuzzy scheduler attribute to each of off, low, med, high, and do the following:
Re-install PBS and rerun all the above steps for each value of opt_backfill_fuzzy. (We recommend this method, since it is easier and faster than the next one)
OR
Run a scheduling cycle for each value of opt_backfill_fuzzy. Note that this step will not initiate a new scheduling cycle, hence you need to wait until the current scheduling cycle is over. Read the data of the next scheduling cycle. It helps to get more than one scheduling cycle for each value.
7. Run pbs_loganalyzer on sched_log/<date>
8. Compare the following for each value from the pbs_log_analyzer output:
cycle_duration,num_jobs_calendared,num_jobs_considered, time_to_calendar
Create a Hook with Event execjob_epilogue
Prerequisites
Have hook script test.py copied from XYZ location ###
- Create a hook with event execjob_epilogue and import test.py
- Submit a job and make sure it is running
- Verify that hook will not update Variable_List of job with "BONJOUR=Mounsieur Shlomi" while job is running
- Verify that 'resources_available.file = 1gb' is not set under nodes while job is running
- wait till the job finishes
- hook has updated following node attributes
- added resources_available.file = 1gb
- check the Variable_List of finished job. BONJOUR=Mounsieur Shlomi has appended to the list.
- look for following messages in mom_logs of both the nodes
Hook;pbs_python;printing pbs.event() values ---------------------->
Hook;pbs_python;event is EXECJOB_EPILOGUE
Hook;pbs_python;hook_name is test
Hook;pbs_python;hook_type is site
Hook;pbs_python;requestor is pbs_mom
Hook;pbs_python;requestor_host is (hostname)
Cleanup
Delete the hook and reset to default configuration
Creating a Hook as Ordinary User Throws Error
Test
- Create a hook as user with the following events:
- execjob_begin
- execjob_prologue
- execjob_epilogue
- execjob_preterm
- execjob_end
- exechost_periodic
Expected Result
All above commands will fail with error "(user)@(fqdn hostname) is unauthorized to access hooks data from server"
PBS Failover Configuration with PBS_PUBLIC_HOSTNAME
Prerequisites
6 nodes where node1 is primary server and node2 is secondary server with failover configured. Node3 and Node6 are mom only. Node4 and Node5 are comm+mom. Set following values in pbs.conf
Node1 - Primary server node with comm
PBS_PUBLIC_HOST_NAME=node1
PBS_LEAF_NAME=node1
Node2 - Secondary server node with comm
PBS_PUBLIC_HOST_NAME=node2
PBS_LEAF_NAME=node2
Node3 - Mom only
PBS_PRIMARY=node1
PBS_SECONDARY=node2
PBS_LEAF_NAME=node3
PBS_LEAF_ROUTERS=node1,node2,node4,node5
Node4 - Comm + Mom
PBS_PRIMARY=node1
PBS_SECONDARY=node2
PBS_LEAF_NAME=node4
PBS_COMM_ROUTERS=node1,node2
Node5 - Comm + Mom
PBS_LEAF_NAME=node5
PBS_LEAF_ROUTERS=node4
PBS_COMM_ROUTERS=node1,node2,node4
PBS_PRIMARY=node1
PBS_SECONDARY=node2
Node6 - Mom only
PBS_PRIMARY=node1
PBS_SECONDARY=node2
PBS_LEAF_NAME=node6
PBS_LEAF_ROUTERS=node1,node2,node4,node5
Test
- Submit 3 jobs each asking 2 chunks of 1 ncpus and place=scatter.
- Verify all jobs are running
- Bring primary down and wait for secondary to come over (approx ~30s)
- Delete a job. other 2 jobs will continue to run.
- Bring primary up again
- 2 jobs will still be running
- Delete another job. Now there is only one job running
Verify that Jobs Run Fine After Upgrade
Prerequisites
- Install old version of PBS
- Create 2 queues
- Create a queuejob hook and import queuejob.py
- Create an execjob_epilogue hook and import epi.py
- Add following custom resources
- A type=float
- B type=string_array
- Set resources_available.A=4.5 and resources_available.B="AA,BB,CC" on server
- Add resources A and B in sched_config
Test
- Submit a few jobs asking for resources A and B.
- Verify some are queued and some running depending on the number of ncpus
- Upgrade to new version of PBS following the steps from the upgrade guide. Requeue the jobs.
- After upgrade verify that the jobs from old server continues to run.
- Verify hook has executed by looking at server_log and mom_log
- Wait for jobs to get over and submit new jobs to the cluster.
- Verify that new jobs are running/queued successfully.
Job Continues to Run During Failover
Prerequisites
Configure failover between node1 and node2 where node1 is primary and node2 is secondary.
Test
- Submit a long job and verify it is running.
- Bring the primary down by using qterm
- wait till secondary takes over (approx 30s)
- make sure job continues to run on the secondary too.
- delete the job and submit another job with walltime of 120s
- bring primary back up
- verify that second job is still running.
- Wait for second job to get over (approx 2min) and make sure exit status of the job is 0
When Not Running, Job Prints the Reason
Prerequisites
2 ncpus set on execution host
Test
- submit 1 job with 1 ncpus and place=excl and verify that it is running
- Also verify that node state is job-exclusive
- submit another 1ncpu job
- verify that job is queued with job comment
Expected Result
For PBS <= 10.x : "not enough free vnodes available"
For PBS >10.x : "cannot run job: Insufficient amount of resource: ncpus (R:1 A:0 T:2)"
Questions
Do we have a test suite tool for returning the system to its vanilla state?