CESM: Apparent Errors from shr_sys_flush()

I’ve been struggling with a model error for over a week now, and it all seemed to revolve around this Fortran subroutine “shr_sys_flush()”. The error usually showed up in the log file as a “floating point exception”, but other times I got a “segmentation fault”. I had an epiphany this afternoon that will hopefully help other people with similar errors.

My issues are actually from the DOE ACME model, but it is still pretty similar to the NCAR CESM model in many ways.

The error I got is similar to the error described in this post on the CEMS forum.
Below is an example of the error from the main log file.

 Opened existing file
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/lnd/clm2/rtmdata/rdirc_0.5x0.5_simyr2000_slpmxvl_c120717.nc
 26
 Opened existing file
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/rof/rtm/initdata/rtmi.ICRUCLM45BGC.2000-01-01.R05_simyr2000_c130518.nc
 26
 Reading setup_nml
 Reading grid_nml
 Reading ice_nml
 Reading tracer_nml
CalcWorkPerBlock: Total blocks: 1279 Ice blocks: 1279 IceFree blocks: 0 Land blocks: 0
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
_pmiu_daemon(SIGCHLD): [NID 03243] [c8-2c1s5n3] [Fri Mar 24 14:54:18 2017] PE RANK 603 exit signal Floating point exception
[NID 03243] 2017-03-24 14:54:18 Apid 14081198: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 03236] [c8-2c1s2n2] [Fri Mar 24 14:54:18 2017] PE RANK 751 exit signal Floating point exception
Application 14081198 exit codes: 136
Application 14081198 exit signals: Killed
Application 14081198 resources: utime ~1572s, stime ~1245s, Rss ~335152, inblocks ~19905289, outblocks ~88371084

Exit code 136 indicates that there has been some bad calculation that caused division by zero, or perhaps an integer value that exceeds the integer limit.

I used a disgusting amount of print statements like this:

if (masterproc) write(iulog,*) 'whannah - atm_comp_mct.F90 - atm_init_mct() - FIRST'

to track this down, and whether it was in cam_comp.F90 or atm_comp_mct.F90 it always seemed to point to a line with:

call shr_sys_flush(iulog)

The variable “iulog” is just an integer value that identifies the log files (atm.log.nnnnnn-nnnnnn) that show up in the run directory. It took me awhile to understand that this subroutine is “flushing” information to the log file that is on the disk. Before calling shr_sys_flush, this information is held in a buffer. This is probably done for performance reasons because more i/o with the disk can dramatically slow down the execution.

After realizing this I finally understood that the real error is happening in between the flush statements. All the print statements that I had put in place were getting stored in the buffer, but the model crashed before this information could get flushed to the iulog file!

There might be another way to write the print statements so that they bypass the buffer and go directly to the log file, but I’m not sure about this. The workaround I used was to add “call shr_sys_flush(iulog)” after each print statement.

 

Leave a Reply

Your email address will not be published. Required fields are marked *