Compute Grid 2.5

Welcome to the new and improved HBS Compute Grid v2.5!

Sections:

Introduction

Based on both research computing trends and recommendations from the HBS research computing environment assessment, Research Computing Services (RCS) has been working closely with HBS IT to make improvements to our local compute grid. This new compute grid, v2.5, provides the following updates and enhancements:

  • Improved compute capacity through more hardware and better using of existing hardware
  • Significantly fewer restrictions on compute capacity
  • Increased safeguards to prevent CPU spillover and memory problems
  • Newer OS and software versions, and improved usability
  • More software titles, including GitKraken, a GUI Git version control application, and Spyder, a Python IDE
  • Better command-line submission scripts, to improve productivity.

We have summarized below what we believe are the need-to-know points for you to begin your work in the new environment. We ask that you read through this and related documents / web pages thoroughly. Note that both environments will be running side-by-side during a transition period until the end of December (the anticipated transition close). Thus, there will be differences with directions, hostnames, URLs, etc. Although we have tried to document these differences as much as possible, and please let us if there are any omissions. Of course, contact us at any time if you have questions. 

Major Changes to Note 

We put together some information to summarize and detail the improvements to our computing environment:

The two major interrelated points to remember as you conduct your work are: 

  • Interactive sessions are now limited to 1 or 3 days. Most batch sessions are limited to 3 or 7 days.  
  • The per-user resource limits have been either raised or eliminated.  

The queue changes and limits are discussed in detail in both the Major Usability Changes page and on our LSF Partitions web page.

Finally, PAC is not immediately available in the new environment. We hope to have this running in early 2019. 

Major Considerations 

During the transition period into mid-December, we will migrate compute nodes from the old (v2.0) environment into the new (current), v2.5 environment. We invite you to use the new environment immediately, but do keep in mind that the compute capacity is slightly constrained at the start, but will increase over time.

Additionally, the queues and scheduling setup is a work in progress! As we add more compute nodes, watch usage patterns, and determine how the scheduler responds, we may need to adjust the scheduling policies and limits; though we will communicate clearly any changes and with advanced notice. 

Login

During this transition period, you will need to use new hostnames to log into the new environment: 

  • For SSH / terminal sessions, login in to hbsgrid.hbs.edu. You might receive a warning about a host key/fingerprint. Please accept/save the changes, as this is an expected behavior for a new login server.
  • For NoMachine GUI sessions, follow the setup instructions on our website, but use researchnx-new.hbs.edu for the hostname.

Obtaining support and leaving feedback

If you experience any unusual behavior, or if inspiration visits you about new features or enhancements as you work, please contact RCS via email as our preferred communication vehicle. Please try to describe your problem in detail; and error messages and screen captures of the problem are immensely helpful! Of course, phone calls to 617-495-6100 or drop-in visits if you are nearby are always welcome!

As mentioned, the new environment is a work in progress. We’ve had other researchers and our own RCS team using the new environment since late August. Testing has been going smoothly, and we believe that we have caught most of the problems. But in complex systems, one cannot catch every problem in advance. We hope you will bear with us as we work out any final, unforeseen problems.

Problem resolution and maintenance 

As noted above, we will make the best decision to either attempt to fix any problems as they arise or to schedule the fix for a maintenance window. In certain situations, we may need to take measures that may jeopardize running sessions, but we will try to avoid this at all costs. This might include killing running jobs, or rebooting either login or compute nodes. For server reboots, we will make every effort to schedule these during maintenance window which we will announce as needed and with appropriate lead time to minimize work interruptions. 

Other Items to Note 

Please be aware of the following items that may affect your work: 

  • Compiled software: Since this new environment is an OS upgrade, software that you have compiled on the current grid is not guaranteed to work in the new environment. This includes packages and modules in R and python. If the add-on is compiled, you are advised to remove the package or module, and re-install. This will force your programming environment to recompile the add-on during installation, which will then be OS-compatible.
  • SAS, SAS server, and SAS Connect: During the transition period, running SAS via interactive and batch may perform less quickly than on the v2.0 environment. This is due to a temporary setup during the transition period with the storage mounts. In mid-December, when the SAS server is transitioned into the new environment, performance should be on par with or exceed that of the v2.0 environment.
  • MariaDB: As of Nov 9th, MariaDB is accessible currently only via the shell mysql client on the compute grid. ODBC, R, and Python functionality should be ready by Nov 19th.

 

Updated 12/7/2018