Program crashing? What's going on?
Software will occasionally crash while working on the compute grid for apparently no reason, which can be very frustrating. The typical reaction is to launch the program again to continue one's work. But why did this happen in the first place? And will this happen again?
The most common problem is an Out of Memory error. As opposed to work on desktops or laptops, events on the compute grid are carefully logged. The big benefit is that we have an audit trail of what is going on, and should be able to easily diagnose this. This requires a few simple commands in the terminal.
If using NoMachine, launch the terminal from Applications > Accessories > gnome-terminal:
As each program launch is considered a job, find out what jobs you previously ran with the bhist command:
We can see a few programs that have been running recently. If you haven't yet, launch the program again. The top line (#1) listed is your most recent crashed program launch. Let's get more details on that program run:
So, this command and -l (long) option gives us quite a few more details. We can see what we ran, when we started it, how many CPUs it is using, thr and the RAM footprint requested (#2). If we look at the bottom, we can see the run time in seconds, as well as the maximum memory used (#3). But prior to that, in our date/time log details, we can also see that an error condition was noted: Out of Memory (#4). This makes sense, as the maximum memory listed at the bottom is greater than the RAM footprint requested. On compute clusters such as the HBS Compute Grid, when your memory usage goes beyond the memory requested, the system (the scheduler) kills your program so that you do not jeopardize all the other programs (other people's work) running on that same machine.
And so this is the source of our program crash. The fix: run with a slightly larger memory footprint on your next run. For more information on figuring out how much RAM to ask for, see http://grid.rcs.hbs.org/choosing-your-resources.
We hope this information helps you be more productive!