Major Usability Changes

The following information describes in detail the major changes that will affect usability on the compute grid.

What is Changing & How Will This Affect Me?
Summary of Job Queue Changes

 

What is Changing

How will this affect me?

OS has been upgraded to Red Hat Enterprise Linux 7.5

Pro: Ability to use newer software and technologies

Con: Some old software may no longer work. Packages and modules compiled for Perl, Python, and R may need to be recompiled to work again

New NoMachine & user desktop

As with any new OS, some re-acclimation may be necessary to use the file browser, run applications, etc

Job sandboxing (CGROUPS) will now restrict out-of-control CPU usage

This technology will prevent applications which are using more cores than requested from the scheduler ("threading out") from negatively affecting other jobs.

If this is your application, likely it will run more slowly since it is contained, unless you match the cores used to the cores requested at job submission time.

Hard-coded limits of 12-cores and 80 GB RAM across all jobs have been removed

Removing this restriction allows maximum use of the compute grid by persons whether or not the compute grid is fully utilized. Other restrictions are in place to allow immediate access to interactive sessions, prevent compute hogging, and eliminate resource 'squatting' of the past.

Jobs will (mostly) no longer be allowed to run endlessly

This prior configuration allowed people to run a job (e.g. a Stata session) and leave it running endlessly; and no one else could use those resources while idle. We have restricted this 'no time limit' to batch jobs only, and only for a limited number of job slots (cores).

Queues have been split: Short run queues will allow more cores but less time, while long run queues will allow the opposite

To provide flexibility for different types of work patterns, 'short' queues have been created for both interactive and normal that promote more cores per job but for a shorter run of time. Likewise, 'long' queues provide longer run times but with fewer cores available.

Interactive queue has been split into 1-day short and 3-day long queues

24-hr (1d) 'short_int' has a max of 12-cores per job submitted. Max two jobs.

72-hr (3d) 'long_int' has a max of 4-cores per job submitted. Max five jobs.

There is a max 24-core or 3-job limit across all interactive queues

We want to ensure that everyone has the ability to get at least one interactive session to do work. This safeguard will limit the likelihood that enthusiastic persons will crowd out new sessions by taking most of the interactive slots (max 24 hrs on short_int!).

Normal (batch) queue has been split into 3-day short and 7-day long queues

72 hr (3d) 'short' has a max of 16-cores per job submitted. No max job-count limits

168 hr (7d) 'long' has a max of 12-cores per job submitted. No max job-count limits

Jobs dispatched to the batch queues will be affected by Fairshare priority

There are no limits to the number of batch jobs that can be submitted to the 'short' and 'long' queues. Instead, a priority score (Fairshare) determines your dispatch order and is based on your past usage and other factors relative to others who also have submitted jobs. This allows fair access to compute by all persons.

Drop-down menus in NoMachine & command-line wrapper scripts now include more Stata choices

We want to provide more flexibility in the power you can use for your job. -SE and -MP4 will run in the 3-day 'long_int', and anything higher will run in the 1-day 'short_int'. Not every Stata routine is parallelized or highly efficient: choose your MP version wisely.

Drop-down menus in NoMachine & command-line wrapper scripts will have a max 20G memory partition (30G available separately)

Our current compute grid suffers from 'RAM starvation', in that there plenty of free CPUs, but not enough RAM to run new jobs. In a significant fraction of 30G jobs, only half or less of the RAM had been used. 20 GB is the new max per drop-down menu job. If you require more than 20 GB, one can use the command-line to do. Please see the website docs or contact RCS.

Command-line wrapper scripts have been standardized

We have tried to ensure that the shell-based wrapper scripts have similar options for parameters. Please see the updated list at https://grid.rcs.hbs.org/command-line. NOTE: this may change over time as we uncover usability issues.

Anaconda environments are used for R and Python

The installations for R and both versions of Python use the Anaconda package management system from Continuum. This means that local, virtual environments are possible for custom, project-based work. Please see http://conda.io for more information (Note that not all commands will work as a low-privileged user)

New GUI applications for research & data management

Spyder is packaged with Anaconda's installation of Python. You can now use this GUI IDE for scripting. More info at https://www.spyder-ide.org/.

GitKraken is a GUI for the version control software Git, uses similar commands and concepts as the command line tools, and works with local and remote repositories like GitHub.com. More info at https://www.gitkraken.com/git-client.

Math Kernel Library available for many programs

Intel's Math Kernel Library is a set of pre-compiled routines for various math and analytical functions that is highly optimized for Intel CPUs, which is the infrastructure of our computing environment. The Anaconda distributions of R and Python will use the MKL automatically. If you are using other programming environments or compiling your own software, contact us on how you can leverage them. More info at https://software.intel.com/en-us/mkl.

 

Summary of Compute Grid Queue Changes

 

QUEUE

TYPE

LENGTH

# CORES / JOB MAX

# JOBS MAX

FAIRSHARE?

short_int

interactive

24 hours (1 day)

12 cores

2 jobs max

No

long_int

interactive

72 hours (3 days)

4 cores

5 jobs max

No

short

batch

72 hours (3 days)

16 cores

no limit

Yes

long

batch

168 hours (7 days)

12 cores

no limit

Yes

unlimited

batch

no limit

8 cores

no limit

Yes

sas_interactive

interactive

3 days

2 cores

2 job max

No

sas_normal

batch

7 days

4 cores

no limit

YES