- Guidelines on Choosing Resources
- Past Jobs Max Memory Info
- RAM Usage in Analysis/Programming Environments
"Take what you need. Need what you take."
Since the compute grid is a shared computing resource, being a good community member is important to ensure that everyone has fair access to do their work. For that purpose, it is important to accurately request the amount of RAM and number of CPU/cores that you wish to use.
Why is this important? The Grid is a complex, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs -- finding space (RAM+CPUs) on a computer in order to do your work. If insufficient memory is allocated, your program may crash, often in an unintelligible way; if too much memory is allocated, resources that could be used for other researcher's sessions (interactive or batch) will be wasted. Additionally, your "fairshare", a number used in calculating the priority of getting your work scheduled and running, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores and memory through the default application submission scripts or through custom ones.
Many scientific computing tools can take advantage of multiple processing cores, but many cannot. A typical MATLAB, R, or Python script, for example, will not use multiple cores. On the other hand, Stata, a graphical console for statistics is improved substantially by using multiple cores, but not for every Stata function. Please read the program documentation to understand the multicore capabilities or check with the RCS staff before requesting multiple cores for a given application.
Finally, when you start a session interactively or via batch, the RAM and CPU that you've requested are reserved only for your use, even when your session is idle. This means that, until your code finishes or you exit the interactive program, those resources cannot be used by anyone else, and are wasted if sitting idle.
Figuring out the appropriate amount of RAM and CPU/cores take a little bit of sleuthing, but becomes very easy in time. Here are some general guidelines:
If you data you've worked with before or work that you're repeating:
- More is not really better, since this is a shared resource
- Use fewer cores (1!) for interactive work, especially if you plan on having a session open over several days. It is considered bad form to leave sessions open for more than 7 days, as no one can use the resources that are reserved exclusively for your use.
- Choosing multiple cores for interactive work is OK if you will be finishing your work in hours to a day or two. Please do not let these sessions sit idle.
- Check your MAX MEM usage (see below) from past job history, and select best fit memory footprint.
- A little more difficult, but write custom LSF job submit commands to closely match memory usage that you need. You'll need to do this if requiring RAM amounts > 30 GB, as the default wrapper scripts only allow 30 GB RAM allocations as a maximum.
If you really have no idea where to start, try one of the following approaching for approximating RAM and/or CPU usage:
- Remember that MATLAB, R, and Python can only use 1 CPU unless you've programmed it to do otherwise.
- Stata can use multiple CPUs, but be conservative. Again, more is not necessarily better.
- Each language has commands that will give you the memory usage of your data while loaded (in memory).
- Or, if not creating new data structures after reading in data file, try RAM footprint that is 10x the data file size. If creating new ones, try 20x to 30x.
- Or, try a large memory size (e.g. 20G), finish your work, and decrease the memory ask by checking the MAX MEM usage, and selecting best fit memory footprint next time.
- Give yourself about 20% wiggle room
See our Choosing Resources seminar PDF for more details on these guidelines.
LSF, the scheduling software, makes it easy to figure out how much RAM you've used for a currently running or past job. Using either the
bjobs command for currently running jobs, or the
bhist command for finished jobs, and search for the text MAX MEM with grep and one can easily determine the maximum amount of RAM used for your jobs.
For example, using the
-l flag (long format) to display info for currently running jobs:
[jharvard@rhrcscli01:~]$ bjobs -l | grep -E "Application|IDLE|MAX" Job <144795>, User <rfreeman>, Project <XSTATA>, Application <stata-mp4-30g>, S IDLE_FACTOR(cputime/runtime): 0.01 MAX MEM: 56 Mbytes; AVG MEM: 49 Mbytes
The next example can be used to display information for jobs that ran since a particular date. Here we will use the flags
-a (all jobs),
-l (long format), and
-S (submitted date; comma indicates range up to today):
[jharvard@rhrcscli01:~]$ bhist -a -l -S 2017/09/1, | grep -E "Application|IDLE|MAX" Job <158502>, User <jharvard>, Project <STATA-SE>, Application <stata-se-5g>, I MAX MEM: 12 Gbytes; AVG MEM: 12 Mbytes Job <158547>, Job Name <MATLAB>, User <rfreeman>, Project <MATLAB>, Application MAX MEM: 607 Mbytes; AVG MEM: 511 Mbytes
MAX MEM values in bold can now inform how much RAM you would ask for next time you do similar work or work with the same data.
Estimating RAM usage can be easy in the analysis and programming environments that researchers typically use. Each language has commands that will give you the memory usage of your data while loaded (in memory). A few examples are listed below.
NB! In all situations, one should add at least 0.5 - 1 GB of RAM to the values reported by your environments when requesting RAM as a part of your job submission to account for the overhead of running the application.
In Stata, the
memory command will display a number of details about the RAM usage. The
grand total indicates amount of memory actually used and amount of memory allocated:
Best to take the larger of the two values.
Please see the Stata manual entry for memory for more information.
In R, the
mem_used() function of the
pryr package can inform your variable usage:
Please see the http://adv-r.had.co.nz/memory.html (by Hadley Wickham) for more information.
guppy module is a library and programming environment for Python, currently providing in particular the
Heapy subsystem, which supports object and heap memory sizing, profiling and debugging.
from guppy import hpy h = hpy() print h.heap() Total size = 19909080 bytes.
Sadly, MATLAB does not have a command common to all platforms that will give memory usage. On Windows, one can use the
>> A = magic(1000); B = phantom(500); C = peaks(250); in_use = memory() Maximum possible array: 6397 MB (6.708e+09 bytes) * Memory available for all arrays: 6397 MB (6.708e+09 bytes) * Memory used by MATLAB: 1094 MB (1.148e+09 bytes) Physical Memory (RAM): 7861 MB (8.243e+09 bytes) * Limited by System Memory (physical + swap file) available.
One Mac and Linux, one must run download and run the small script
monitor_memory_whos.m in order to determine what is going on under the hood:
>> A = magic(1000); B = phantom(500); C = peaks(250); in_use = monitor_memory_whos in_use = 10.0136
The value reported in MB is that used by objects in the application space. MATLAB itself will use about 500 MB. Please see this Mathworks technote for more information on MATLAB's memory usage.
If you need help making RAM estimates for an environment not listed above, please feel free to contact RCS.