I’ve been struggling with a model error for over a week now, and it all seemed to revolve around this Fortran subroutine “shr_sys_flush()”. The error usually showed up in the log file as a “floating point exception”, but other times I got a “segmentation fault”. I had an epiphany this afternoon that will hopefully help other people with similar errors.
My issues are actually from the DOE ACME model, but it is still pretty similar to the NCAR CESM model in many ways.
The error I got is similar to the error described in this post on the CEMS forum.
Below is an example of the error from the main log file.
Opened existing file /lustre/atlas1/cli900/world-shared/cesm/inputdata/lnd/clm2/rtmdata/rdirc_0.5x0.5_simyr2000_slpmxvl_c120717.nc 26 Opened existing file /lustre/atlas1/cli900/world-shared/cesm/inputdata/rof/rtm/initdata/rtmi.ICRUCLM45BGC.2000-01-01.R05_simyr2000_c130518.nc 26 Reading setup_nml Reading grid_nml Reading ice_nml Reading tracer_nml CalcWorkPerBlock: Total blocks: 1279 Ice blocks: 1279 IceFree blocks: 0 Land blocks: 0 MCT::m_Router::initp_: GSMap indices not increasing...Will correct MCT::m_Router::initp_: RGSMap indices not increasing...Will correct MCT::m_Router::initp_: RGSMap indices not increasing...Will correct MCT::m_Router::initp_: GSMap indices not increasing...Will correct _pmiu_daemon(SIGCHLD): [NID 03243] [c8-2c1s5n3] [Fri Mar 24 14:54:18 2017] PE RANK 603 exit signal Floating point exception [NID 03243] 2017-03-24 14:54:18 Apid 14081198: initiated application termination _pmiu_daemon(SIGCHLD): [NID 03236] [c8-2c1s2n2] [Fri Mar 24 14:54:18 2017] PE RANK 751 exit signal Floating point exception Application 14081198 exit codes: 136 Application 14081198 exit signals: Killed Application 14081198 resources: utime ~1572s, stime ~1245s, Rss ~335152, inblocks ~19905289, outblocks ~88371084
Exit code 136 indicates that there has been some bad calculation that caused division by zero, or perhaps an integer value that exceeds the integer limit.
I used a disgusting amount of print statements like this:
if (masterproc) write(iulog,*) 'whannah - atm_comp_mct.F90 - atm_init_mct() - FIRST'
to track this down, and whether it was in cam_comp.F90 or atm_comp_mct.F90 it always seemed to point to a line with:
The variable “iulog” is just an integer value that identifies the log files (atm.log.nnnnnn-nnnnnn) that show up in the run directory. It took me awhile to understand that this subroutine is “flushing” information to the log file that is on the disk. Before calling shr_sys_flush, this information is held in a buffer. This is probably done for performance reasons because more i/o with the disk can dramatically slow down the execution.
After realizing this I finally understood that the real error is happening in between the flush statements. All the print statements that I had put in place were getting stored in the buffer, but the model crashed before this information could get flushed to the iulog file!
There might be another way to write the print statements so that they bypass the buffer and go directly to the log file, but I’m not sure about this. The workaround I used was to add “call shr_sys_flush(iulog)” after each print statement.