Introduction

Cray DataWarp

Cray DataWarp provides an intermediate layer of high bandwidth, file-based storage to applications running on compute nodes. It is comprised of commercial SSD hardware and software, Linux community software, and Cray system hardware and software. DataWarp storage is located on server nodes connected to the Cray system's high speed network (HSN). I/O operations to this storage completes faster than I/O to the attached parallel file system (PFS), allowing the application to resume computation more quickly and resulting in improved application performance. DataWarp storage is transparently available to applications via standard POSIX I/O operations and can be configured in multiple ways for different purposes. DataWarp capacity and bandwidth are dynamically allocated to jobs on request and can be scaled up by adding DataWarp server nodes to the system. [Source: XC™ Series DataWarp™ User Guide (CLE 6.0.UP01)]

PBS Professional

PBS Professional is a fast, powerful workload manager designed to improve productivity, optimize utilization & efficiency, and simplify administration for HPC clusters, clouds and supercomputers. PBS Professional automates job scheduling, management, monitoring and reporting, and is the trusted solution for complex Top500 systems as well as smaller cluster owners.

Cray DataWarp Integration with PBS Professional

The expectation of the integration between Cray DataWarp and PBS Professional is to

Schedule jobs based on the availability of DataWarp storage capacity.
Setup the DataWarp job instance before the job begins execution, such that the following DataWarp functions are executed
1. paths
2. setup
3. data_in
4. pre_run
Teardown the DataWarp job instance after the job terminates (i.e., normal, error, abort), such that the following DataWarp functions are executed
1. post_run
2. data_out
3. teardown
When a non-successful DataWarp exit code (1) is detected, Altair will attempt to re-queue the job if the job has not started execution, or leave the data and job instance allocation intact is the job had executed allowing the user/admin to manually resolve any issues.

Cray DataWarp Integration Requirement(s)

Installation and configuration of Cray DataWarp hardware
Installation and configuration of Cray DataWarp software
- IMPORTANT For sites running the PBS Professional Server/Scheduler on an external node (e.g., white box linux or eLogin node), the site will need to configure the external API and verify that it works before deploying the integration.
PBS Professional 13.0.40x and later (depends on the PBS Professional Plugin Framework)

Future Cray DataWarp Integration with PBS Professional

Supporting Cray DataWarp - Persistent Instances

PBS capabilities relying on DataWarp Persistent Instances could be
- Job Dependencies
- Job Arrays
- PBS Advance Reservations

Job Lifecycle

This section will walk through the lifecycle of what the user is required to do before job submission through the final steps of DataWarp teardown.

Validating DataWarp Directives

It is strongly recommended that the user validate the correctness of the DataWarp directives (#DW) by executing a Cray supplied utility call dw_wlm_cli prior to job submission.

dw_wlm_cli {-f | --function} job_process {-j | --job} jobscriptfile

The user is expected to correct any errors prior to job submission. Failure to correct the error(s) will result in the delaying of the job execution or worse the job may exit the batch queue system without executing the job.

NOTE There is an existing RFE for PBS Professional to allow the queuejob hook to read/parse the user's jobscript, which would allow for automation of this step.

Job Submission

Upon the submission of the job to the PBS Server, the dw_queuejob_hook will be executed. This hook will be responsible for validating the user's DataWarp capacity and pool request. The validation of the user's request should be nearly instantaneous.

From the user's perspective, they will submit the job with PBS custom resources that are specific to DataWarp.

required: -l dw_capacity=<value>

optional: -l dw_pool=<value>; assuming resources_default.dw_pool=<value>

If the job is submitted with -l dw_capacity, then the dw_queuejob_hook will construct the proper job submission request for PBS Professional and create the appropriate attributes for the DataWarp workflow. Otherwise, the dw_queuejob_hook (and other DataWarp hooks) will accept the job "as-is" and it will be assessed by any other site-defined hooks before being accepted by the PBS Server.

If the dw_queuejob_hook should timeout or fail, the user will receive an error message.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

Job Eligibility (Job Scheduling)

Relying on existing PBS Professional job scheduling capabilities, the scheduler will compare the user's dw_capacity request with the availability of DataWarp capacity. If there is sufficient capacity and all other scheduler policies are satisfied, the scheduler will inform the Server of which nodes to dispatch the job and request the job to be executed. If there is insufficient capacity, then the job will remain queued, but will be eligible for scheduling at the next scheduling cycle.

DataWarp Path

Prior to the job being dispatch to the nodes, the Server will execute a runjob hook, dw_paths_hook, that will setup the job's environment with the DataWarp-specific environment variables, which will be referenced in the user's job script. The updating of the job's environment with the DataWarp-specific environment variables should be nearly instantaneous. If the setup is successful, then the job will proceed to staging in the data.

If this hook should fail or timeout, then the job is re-queued and put in a 'H'old state. By placing the job into a 'H'old state, the administrator can investigate why the setup of the job's environment variables failed. Depending on the resolution of the issue, the administrator will be able to release the hold (qrls -h s <pbs_jobid>), and the job will be re-considered in the next scheduling cycle.

The dw_paths_hook will record information in the PBS Server logs ($PBS_HOME/server_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details.

DataWarp Setup

Before the job begins execution, PBS Professional will need to setup the storage object and stage-in the user's data, if specified. The MOM will execute an execjob_prologue hook, dw_setup_hook, that will allocate and configure a new storage object for the job. In practice, the setup of the job's storage object will take a few seconds. A successful setup will the job transition to the staging of the data in to the storage object.

If this hook should fail or timeout, then the job is re-queued and put in a 'H'old state. By placing the job into a 'H'old state, the administrator can investigate why the setup of the job's environment variables failed. Depending on the resolution of the issue, the administrator will be able to release the hold (qrls -h s <pbs_jobid>), and the job will be re-considered in the next scheduling cycle.

The dw_setup_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

DataWarp Stage-in

After a successful setup of the storage object, PBS Professional will initiate the DataWarp data_in function. It is expected that the data being staged in will be relatively quick; less than 5 minutes. The MOM will execute an execjob_prologue hook, dw_data_in_hook, that will stage data into a storage object from an external source. A successful stage-in will connect the compute nodes to the DataWarp job instance.

If this hook should fail or timeout, then the job is re-queued and put in a 'H'old state. By placing the job into a 'H'old state, the user and/or administrator can investigate why the stage-in of the job's data failed. Remember, that the dw_wlm_cli -f job_process, which the user is expected to execute before job execution does *not* validate paths of files or directories. So, it is possible that the user has specified an invalid path.

Depending on the resolution of the issue, the administrator will be able to release the hold (qrls -h s <pbs_jobid>), and the job will be re-considered in the next scheduling cycle. NOTE: the storage object will be torn down, as well.

The dw_data_in_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

DataWarp Pre Run

After a successful stage in of the user's data, the MOM will execute an execjob_launch hook, dw_pre_run_hook, that will connect the compute nodes to the storage object. In practice, the connection of the compute nodes is nearly instantaneous. A successful setup will the job transition to execution.

If this hook should fail or timeout, then the job is re-queued and put in a 'H'old state. By placing the job into a 'H'old state, the administrator can investigate why the setup of the job's environment variables failed. Depending on the resolution of the issue, the administrator will be able to release the hold (qrls -h s <pbs_jobid>), and the job will be re-considered in the next scheduling cycle.

The dw_pre_run_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

DataWarp Post Run

Once the job terminates (i.e., normal, error, or abort) PBS Professional will initiate the DataWarp post_run function. The MOM will execute a execjob_epilogue hook, dw_post_run_hook, that will disconnect compute nodes from storage object. If the post_run is successful, then the job will begin execution. In practice, the disconnect of the compute nodes from the storage object will take a few seconds. A successful post_run will transition the job to the next stage of instantiating a stage-out and teardown job.

If this hook should fail or timeout, then the job exits the system without initiating the proceeding hook that will be responsible for the stage-out and teardown of the storage object. This is an intention decision to avoid purging the user's data.

The dw_post_run_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details. The administrator will need to investigate why the post_run failed, and will need to manually rectify the issue.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

DataWarp Stage-out & Teardown Job

At this point, the compute nodes have been successfully disconnected from the storage object, and the data can be staged-out and the storage object can be torn down. Although, it is expected that the data produced by the job will be large (e.g., 100s of GB and more). In an effort to avoid having PBS Professional keep the compute nodes allocated to the job while staging out data, the MOM will execute an execjob_epilogue hook, dw_secondary_job_hook, as the user to submit a secondary job. The secondary job will be responsible for triggering the stage-out of the user's data and teardown of the storage object. In practice, the submission of the secondary job will be instantaneous. A successful submission of the secondary job will allow the job to exit, barring no other site-specific hooks are executing later.

If this hook should fail or timeout, then the job exits the system without initiating the stage-out or teardown of the storage object. This is an intention decision to avoid purging the user's data.

The dw_post_run_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details.

If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

DataWarp Stage-out & Teardown

The stage-out of the user's data and teardown of the storage object is initiated by the secondary job that was submitted on behalf of the user. The job itself will execute very quickly (i.e., /bin/true) on the MOM node that had executed the original job. Then the MOM will execute an execjob_epilogue hook, dw_data_out_n_teardown_hook, that will stage-out the user's data. If data is staged out successfully, the teardown of the storage object is executed.

If this hook should fail or timeout, then the job exits the system without initiating the stage-out and teardown of the storage object. This is an intention decision to avoid purging the user's data. Remember, that the dw_wlm_cli -f job_process, which the user is expected to execute before job execution does *not* validate paths of files or directories. So, it is possible that the user has specified an invalid path.

The dw_data_out_n_teardown_hook will record information in the PBS MOM logs ($PBS_HOME/mom_logs). It is possible to increase verbosity of the log messages to troubleshoot the issue by enabling debug in the hook. See DataWarp Hook Verbosity for more details. The administrator will need to investigate why the post_run failed, and will need to manually rectify the issue.

It is expected that the data produced by the original job will be large (e.g., 100s of GB and more), which could take upwards of 30 minutes or more to stage out. If the hook is timing out, the administrator can increase the duration of the hook's alarm. See the DataWarp Hook Alarms for more information.

Installation

The administrator will need to have created the DataWarp pool(s) prior to configuring PBS Professional. Review the DataWarp manuals for details on how to setup and configure DataWarp pool before proceeding.

After successfully creating the DataWarp pool(s), the administrator, as root, will need to complete a few manual steps to install and configure the integration of DataWarp with PBS Professional.

The instructions below reference $PBS_HOME, which is a variable that can be found in the /etc/pbs.conf file. Assuming the root SHELL is bash, consider sourcing the /etc/pbs.conf with the following command to easily reference PBS-specific variables.

. /etc/pbs.conf

1. Update the /etc/pbs.conf on the PBS Server/Scheduler, PBS MOM (login nodes), and hosts that user have access to for job submission.

PBS_DATAWARP=/opt/cray/dw_wlm/default/bin/dw_wlm_cli

The PBS Professional site-specific hooks require the PBS_DATAWARP variable to be setup so that the hooks can execute the dw_wlm_cli command.

NOTE the user will need to validate their script prior to job execution. So, if the user submits their job from a non-Cray XC host, then the dw_wlm_cli command will need to be accessible on the host.

2. Unpack the PBS Professional DataWarp site-specific hooks, and cd into directory.

cd $PBS_EXEC/../
tar zxvf <package_name>
cd pbspro_datawarp

3. Install each PBS Professional DataWarp site-specific hook. NOTE: each site-specific hook contains the installation and configuration instructions. Assuming the PBS Professional command, qmgr, is in root's path.

a. dw_queuejob_hook

qmgr << EOF
create hook dw_queuejob_hook
set hook dw_queuejob_hook event = queuejob
set hook dw_queuejob_hook alarm = 10
set hook dw_queuejob_hook order = 1
import hook dw_queuejob_hook application/x-python default dw_queuejob_hook.py
set hook dw_queuejob_hook enabled = true
create resource dw_capacity type=string
create resource dw_capacity_request type=string,flag=r
create resource dw_pool type=string_array
create resource dw_capacity_check type=long
EOF

b. dw_paths_hook

qmgr << EOF
create hook dw_paths_hook
set hook dw_paths_hook event = runjob
set hook dw_paths_hook alarm = 600
set hook dw_paths_hook order = 1
import hook dw_paths_hook application/x-python default dw_paths_hook.py
set hook dw_paths_hook enabled= true
EOF

c. dw_setup_hook

qmgr << EOF
create hook dw_setup_hook
set hook dw_setup_hook event = execjob_prologue
set hook dw_setup_hook alarm = 600
set hook dw_setup_hook order = 50
import hook dw_setup_hook application/x-python default dw_setup_hook.py
set hook dw_setup_hook enabled= true
create resource dw_setup_time type=long,flag=r
EOF

d. dw_data_in_hook

qmgr << EOF 
create hook dw_data_in_hook 
set hook dw_data_in_hook event = execjob_prologue 
set hook dw_data_in_hook alarm = 600 
set hook dw_data_in_hook order = 51 
import hook dw_data_in_hook application/x-python default dw_data_in_hook.py 
set hook dw_data_in_hook enabled = true 
create resource dw_data_in_time type=long,flag=r 
EOF

e. dw_pre_run_hook

qmgr << EOF
create hook dw_pre_run_hook
set hook dw_pre_run_hook event = execjob_launch
set hook dw_pre_run_hook alarm = 600
set hook dw_pre_run_hook order = 50
import hook dw_pre_run_hook application/x-python default dw_pre_run_hook.py
set hook dw_pre_run_hook enabled = true
EOF

f. dw_post_run_hook

qmgr << EOF
create hook dw_post_run_hook
set hook dw_post_run_hook event = execjob_epilogue
set hook dw_post_run_hook alarm = 600
set hook dw_post_run_hook order = 100
import hook dw_post_run_hook application/x-python default dw_post_run_hook.py
set hook dw_post_run_hook enabled = true
create resource dw_post_run_time type=long,flag=r
create resource dw_post_run_pass type=boolean,flag=r
EOF

g. dw_secondary_job_hook

qmgr << EOF
create hook dw_secondary_job_hook
set hook dw_secondary_job_hook event = execjob_epilogue
set hook dw_secondary_job_hook user = pbsuser
set hook dw_secondary_job_hook alarm = 600
set hook dw_secondary_job_hook order = 101
import hook dw_secondary_job_hook application/x-python default dw_secondary_job_hook.py
set hook dw_secondary_job_hook enabled = true
create resource dw_pbs_jobid type=string
EOF

h. dw_data_out_n_teardown_hook

qmgr << EOF
create hook dw_data_out_n_teardown_hook
set hook dw_data_out_n_teardown_hook event = execjob_epilogue
set hook dw_data_out_n_teardown_hook alarm = 1800
set hook dw_data_out_n_teardown_hook order = 101
import hook dw_data_out_n_teardown_hook application/x-python default dw_data_out_n_teardown_hook.py
set hook dw_data_out_n_teardown_hook enabled = true
create resource dw_data_out_time type=long,flag=r
create resource dw_teardown_time type=long,flag=r
EOF

4. Create PBS custom resource for the DataWarp pool. NOTE: prefix "dw_pool_" to the DataWarp pool name so that it is easier to see in the output of the PBS commands.

qmgr -c "create resource dw_pool_<dw_pool_name> type=size"

Below is an example

qmgr -c "create resource dw_pool_wlm_pool type=size"

If you have multiple DataWarp pools, consider using a for-loop

dw_pool_names="wlm_pool test dev"
for p in $dw_pool_names; do
    qmgr -c "create resource dw_pool_${p} type=size"
done

5. Create PBS custom resource for the DataWarp pool "granularity". NOTE: prefix "dw_granularity_" to the DataWarp pool name so that it is easier to see in the output of the PBS commands.

qmgr -c "create resource dw_granularity_<dw_pool_name> type=size"

Below is an example

qmgr -c "create resource dw_granularity_wlm_pool type=size"

If you have multiple DataWarp pools, consider using a for-loop

dw_pool_names="wlm_pool test dev"
for p in $dw_pool_names; do
    qmgr -c "create resource dw_granularity_${p} type=size"
done

6. Define the available DataWarp pool(s) in qmgr.

qmgr -c "set server resources_available.dw_pool = <dw_pool_name>"

Below is an example

qmgr -c "set server resources_available.dw_pool = wlm_pool"

Below is an example with the three DataWarp pools defined; using a comma-separated list

qmgr -c "set server resources_available.dw_pool = 'wlm_pool,test,dev'"

7. Define the DataWarp "granularity" in qmgr, which will be derived from the output of dw_wlm_cli -f pools. This assumes "granularity" is the same for all DataWarp pools

dw_wlm_cli -f pools
{"pools": [{"free": 8, "granularity": 990526832640, "id": "wlm_pool", "quantity": 8, "units": "bytes"}]}

The "granularity" is 990526832640

qmgr -c "set server resources_available.dw_granularity_wlm_pool = 990526832640"

Below is an example with the three DataWarp pools and their "granularity"

dw_wlm_cli -f pools
{"pools": [{"free": 2, "granularity": 990526832640, "id": "wlm_pool", "quantity": 8, "units": "bytes"}, {"free": 400, "granularity": 1073741824, "id": "dev", "quantity": 400, "units": "bytes"}, {"free": 20000, "granularity": 1048576, "id": "test", "quantity": 2000, "units": "bytes"} ]}

Define the granularity for each DataWarp pool

qmgr -c "set server resources_available.dw_granularity_wlm_pool = 990526832640"
qmgr -c "set server resources_available.dw_granularity_dev = 1073741824"
qmgr -c "set server resources_available.dw_granularity_test = 1048576"

8. Update the PBS Scheduler configuration file ($PBS_HOME/sched_priv/sched_config) to schedule based on PBS custom resources for the DataWarp pool. Add the PBS custom resource(s), defined in Step 4, to the comma-separated list.

resources: "..., dw_pool_<dw_pool_name>"

Below is an example

resources: "..., dw_pool_wlm_pool"

Below is an example with the three DataWarp pools defined.

resources: "..., dw_pool_wlm_pool, dw_pool_test, dw_pool_dev"

9. Continue updating the PBS Scheduler configuration file ($PBS_HOME/sched_priv/sched_config) to query for the DataWarp pool capacity.

a. Add the dw_capacity_check.py script to the "server_dyn_res:" section

server_dyn_res: "dw_capacity_check ! . /etc/pbs.conf ; $PBS_EXEC/../pbspro_datawarp/dw_capacity_check.py"

b. Continue adding a "server_dyn_res:" line for each DataWarp pool, as defined in Step 4.

server_dyn_res: "dw_pool_wlm_pool !/opt/pbs/pbspro_datawarp/dw_capacity_check_pool.py dw_pool_wlm_pool"

Below is an example with the three DataWarp pools defined.

server_dyn_res: "dw_pool_wlm_pool !/opt/pbs/pbspro_datawarp/dw_capacity_check_pool.py dw_pool_wlm_pool"
server_dyn_res: "dw_pool_dev !/opt/pbs/pbspro_datawarp/dw_capacity_check_pool.py dw_pool_dev"
server_dyn_res: "dw_pool_test !/opt/pbs/pbspro_datawarp/dw_capacity_check_pool.py dw_pool_test"

10. Send the HUP signal to the PBS Scheduler daemon (pbs_sched) to have it re-read the configuration file.

kill -HUP `cat $PBS_HOME/sched_priv/sched.lock`

11. [Optional] Configure a default dw_pool attribute. By defining a default dw_pool, the user will not be required to specify the -l dw_pool=<dw_pool_name> request at job submission.

qmgr -c "set server resources_default.dw_pool = wlm_pool"

Configuration

The Cray DataWarp integration with PBS Professional has several configuration options, which the site will need to consider if the defaults are reasonable for their site.

DataWarp Hook Alarms

In practice, the dw_wlm_cli command can take several seconds to complete execution for several of the functions. The data_in and data_out could take multiple minutes, depending on the user's data. Therefore, it advisable for the site to become familiar with the execution time of the dw_wlm_cli command and adjust the respective hook alarms. If the dw_wlm_cli command exceeds the hook alarm, then the hook will fail.

If one of the DataWarp hooks should timeout, the proceeding hooks in the workflow will execute. This is by design of the PBS Professional Plugin framework. However, each hook as logic to handle a failure scenario.

The DataWarp hooks have the following default alarm attributes.

set hook dw_queuejob_hook alarm = 10
set hook dw_paths_hook alarm = 600
set hook dw_setup_hook alarm = 600
set hook dw_data_in_hook alarm = 1800
set hook dw_pre_run_hook alarm = 600
set hook dw_post_run_hook alarm = 600
set hook dw_secondary_job_hook alarm = 600
set hook dw_data_out_n_teardown_hook alarm = 1800

To change a hook's alarm attribute, the administrator will execute

qmgr -c "set hook <hook_name> alarm = <value>"

where the value is a positive integer, and the units are in seconds.

DataWarp Command Retry

Each DataWarp hook can be configured to retry the DataWarp command, dw_wlm_cli. By default, each DataWarp hook will attempt to retry the command three (3) times.

number_of_tries = 3

NOTE the dw_queuejob_hook does not have this option.

If you change the debug variable to True, then you will need to re-import the hook into the PBS Server via qmgr command.

qmgr -c "import hook <hook_name> application/x-python default <hook_name>.py"

DataWarp Hook Verbosity

By default, each DataWarp hook is configured to log minimal job-specific information in the daemon logs. However, it is possible to increase the verbosity of the DataWarp hook to help troubleshoot an issue. Each DataWarp hook has a variable call debug, and is defined after the import declaration of the hook.

debug = False

IMPORTANT there is a separate debug variable for the run_command function, which will only enable the debug message for when the hook is executing a command, e.g., dw_wlm_cli and PBS commands.

If you change the debug variable to True, then you will need to re-import the hook into the PBS Server via qmgr command.

qmgr -c "import hook <hook_name> application/x-python default <hook_name>.py"

Troubleshooting

This section will focus on troubleshooting based on the messages seen by the user at submission time or in the daemon log files.

Job Submission

If the DataWarp job is rejected by the dw_queujob_hook, the user could see the following error messages.

qsub: Missing required option dw_capacity. Submit with -l dw_capacity=<value>.

The user has submitted a job with -l dw_pool, but neglected to submit the job with -l dw_capacity. A DataWarp job submission requires -l dw_capacity. See User Guide, Job Submission for more details.

qsub: Job requested -l dw_capacity - Missing required -l dw_pool. Either resubmit with -l dw_pool=<value>, or request administrator to set the resources_default.dw_pool=<value> attribute.

The user submitted with the required dw_capacity attribute, however, a default dw_pool was not configured within qmgr. The administrator can configure the default dw_pool (see Administrator Guide, Installation), or the user can request -l dw_pool at submission time (see User Guide, Job Submission).

qsub: Job requested a DataWarp pool (my_pool) that is not available. Resubmit with eligible DataWarp pool (wlm_pool,dev,test), or admin needs to configure DataWarp pool.

The -l dw_pool request does not match the eligible DataWarp pools configured within qmgr. The user can resubmit the job requesting an eligible DataWarp pool, as provided in the error message (see User Guide, Job Submission). Or, the administrator will need to configure the missing DataWarp pool within qmgr (see Administrator Guide, Installation).

Job Execution

If the DataWarp job encounters an issue after it has been accepted by the PBS Server, then the DataWarp hook will record a log message in the PBS Professional daemon log. The hook was written to associate the record with the PBS jobid. The user/administrator will be able to parse the logs by jobid, or use the PBS Professional tracejob command.

The sections below will summarize the common log messages found in the daemon log.

dw_paths_hook

The dw_paths_hook is a runjob hook and will record information in the PBS Server logs ($PBS_HOME/server_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/server_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the

filesystem where $PBS_HOME is located is not full or have read-only permissions.
PBS Professional commands are installed.
/etc/pbs.conf file's $PBS_EXEC contains the correct path to the PBS Professional commands.

DataWarp paths failed. Contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f paths function and had received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp paths succeed.

Confirms that the DataWarp paths function completed successful.

DataWarp data_out & teardown job.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_setup_hook

The dw_setup_hook is an execjob_prologue hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/mom_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp setup failed. Verify #DW directives or contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f setup function and had received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp setup succeed.

Confirms that the DataWarp setup function completed successful.

DataWarp data_out & teardown job. Skipping setup.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job, and it will skip the DataWarp setup function. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_data_in_hook

The dw_data_in_hook is an execjob_prologue hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/mom_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp data_in failed. Verify #DW directives (file and directory paths) or contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f data_in function and had received a non-zero exit code from the command.

Remember, that the dw_wlm_cli -f job_process, which the user is expected to execute before job execution does *not* validate paths of files or directories. So, it is possible that the user has specified an invalid path.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp data_in succeed.

Confirms that the DataWarp data_in function completed successful.

DataWarp data_out & teardown job. Skipping data_in.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job, and it will skip the DataWarp data_in function. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_pre_run_hook

The dw_pre_run_hook is an execjob_launch hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/mom_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp pre_run failed. Contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f pre_run function and had received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp pre_run succeed.

Confirms that the DataWarp pre_run function completed successful.

DataWarp data_out & teardown job. Skipping setup.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job, and it will skip the DataWarp setup function. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_post_run_hook

The dw_post_run_hook is an execjob_epilogue hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/mom_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp post_run failed. DataWarp instance remains. Contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f post_run function and had received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp post_run succeed.

Confirms that the DataWarp post_run function completed successful.

DataWarp data_out & teardown job. Skipping post_run.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job, and it will skip the DataWarp post_run function. This log will help the administrator trace the DataWarp workflow that was discussed earlier in Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_secondary_job_hook

The dw_secondary_job_hook is an execjob_epilogue hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/mom_priv/jobs. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp secondary job submission (data_out and teardown) failed. DataWarp instance remains.

The hook attempted to submit a secondary job on behalf of the user to trigger the DataWarp data_out and teardown. Unfortunately, the submission failed - received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp post_run failed. DataWarp instance remains. Contact your Administrator.

The dw_secondary_job_hook is checking for a successful DataWarp post_run. If the dw_post_run_hook failed, then the dw_secondary_job_hook will acknowledge the failure and leave the DataWarp instance.

The administrator should consider enabling debug within the dw_post_run_hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp secondary job submission (data_out and teardown) succeed.

Confirms that the submission of the secondary job on behalf of the user completed successful.

DataWarp data_out & teardown job. Skipping secondary job submission.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp data_out and teardown job, and it will skip the submission of the secondary job. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

dw_data_out_n_teardown_hook

The dw_data_out_n_teardown_hook is an execjob_epilogue hook and will record information in the PBS MOM logs ($PBS_HOME/mom_logs).

DataWarp: jobscript did not exist. Contact your Administrator.

The dw_wlm_cli command requires the job script through the life cycle of the job's execution. If the hook cannot locate the job script, then this is a critical issue that will prohibit the integration from executing correctly.

The job script will be found in $PBS_HOME/spool. The job script filename will be based on the PBS jobid with a 'SC' file extension (e.g., 3855.dw01.SC). The log will have the full path to the expected job script.

The administrator should verify the filesystem where $PBS_HOME is located is not full or have read-only permissions.

DataWarp: data_out failed. Verify #DW directives (file and directory paths) or contact your Administrator. Skipping teardown.

The hook attempted to execute the dw_wlm_cli -f data_out function and had received a non-zero exit code from the command.

Remember, that the dw_wlm_cli -f job_process, which the user is expected to execute before job execution does *not* validate paths of files or directories. So, it is possible that the user has specified an invalid path.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp teardown failed. Contact your Administrator.

The hook attempted to execute the dw_wlm_cli -f teardown function and had received a non-zero exit code from the command.

The administrator should consider enabling debug within the hook; specifically, in the run_command function. By enabling debug, the stdout, stderr, and rc will be logged in the daemon log file.

DataWarp data_out succeed.

Confirms that the DataWarp data_out function completed successful.

DataWarp teardown succeed.

Confirms that the DataWarp teardown function completed successful.

DataWarp setup, data_in, or post_run job. Skipping data_out and teardown.

Acknowledges that the job is associated to DataWarp. More specifically, the log confirms that the job is a DataWarp setup, data_in, or post_run job, and it will skip the DataWarp data_out and teardown functions. This log will help the administrator trace the DataWarp workflow that was discussed earlier in the Introduction.

Not a DataWarp data_out & teardown job.

Acknowledges that the job is *not* a DataWarp job, and will bail out of the hook quickly.

Jira ticket: PP-1128 - Getting issue details... STATUS

Community Discussion: To Be Posted..

Project Documentation

PP-1128 Support Cray DataWarp - Job Instances

Introduction

Cray DataWarp

PBS Professional

Cray DataWarp Integration with PBS Professional

Cray DataWarp Integration Requirement(s)

Future Cray DataWarp Integration with PBS Professional

Job Lifecycle

Validating DataWarp Directives

Job Submission

Job Eligibility (Job Scheduling)

DataWarp Path

DataWarp Setup

DataWarp Stage-in

DataWarp Pre Run

DataWarp Post Run

DataWarp Stage-out & Teardown Job

DataWarp Stage-out & Teardown

Installation

Configuration

DataWarp Hook Alarms

DataWarp Command Retry

DataWarp Hook Verbosity

Troubleshooting

Job Submission

Job Execution

dw_paths_hook

dw_setup_hook

dw_data_in_hook

dw_pre_run_hook

dw_post_run_hook

dw_secondary_job_hook

dw_data_out_n_teardown_hook