There are two ways to monitor your running job:
- Run unix command top on the compute node where your job is executing, or
- Query LSF via bjobs for current running jobs
Both have their advantages and disadvantages, which we'll discuss below. In either case, either method will give you real-time feedback on how your code is behaving.
top on the compute node:
bjobs to figure out what execution host (EXEC_HOST) your job is running on
2) For each host, get an interactive session on that machine (replace EXEC_HOST with the actual machine name, e.g
bsub -q interactive -Is -W 24:00 -R "rusage[mem=1000]" -m EXEC_HOST /bin/bash
(This command gets a bash shell on the named machine with 1 core for 24 hrs with 1000 MB of RAM)
top, and watch for your processes.
4) (Optional) Press
u and enter your username to view only your processes.
5) (Optional) Press
1 (the number one) to see the utilization of each cpu core or hyperthread.
ctrl-c to stop monitoring.
Monitoring execution via
This is less precise as LSF (the scheduler) must collect runtime data on a periodic basis from the execution hosts, so there will be a lag in information. Also, you may not have information for the first 1 to 5 minutes of your job.
bjobs to figure out the job IDs for your jobs
2) for each job, use
bjobs -l jobID to get job details
3) look at the
IDLE_FACTOR statistic. For a 1 core job at 100% efficiency, this will be 1. For a 2 core job, this will be 2, etc. An example of threading out would be asking for 2 cores and seeing a value of > 2.5 (e.g. 4, 6, etc).