CESM: Apparent Errors from shr_sys_flush()

I’ve been struggling with a model error for over a week now, and it all seemed to revolve around this Fortran subroutine “shr_sys_flush()”. The error usually showed up in the log file as a “floating point exception”, but other times I got a “segmentation fault”. I had an epiphany this afternoon that will hopefully help other people with similar errors.

My issues are actually from the DOE ACME model, but it is still pretty similar to the NCAR CESM model in many ways.

The error I got is similar to the error described in this post on the CEMS forum.
Below is an example of the error from the main log file.

 Opened existing file
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/lnd/clm2/rtmdata/rdirc_0.5x0.5_simyr2000_slpmxvl_c120717.nc
 26
 Opened existing file
 /lustre/atlas1/cli900/world-shared/cesm/inputdata/rof/rtm/initdata/rtmi.ICRUCLM45BGC.2000-01-01.R05_simyr2000_c130518.nc
 26
 Reading setup_nml
 Reading grid_nml
 Reading ice_nml
 Reading tracer_nml
CalcWorkPerBlock: Total blocks: 1279 Ice blocks: 1279 IceFree blocks: 0 Land blocks: 0
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
_pmiu_daemon(SIGCHLD): [NID 03243] [c8-2c1s5n3] [Fri Mar 24 14:54:18 2017] PE RANK 603 exit signal Floating point exception
[NID 03243] 2017-03-24 14:54:18 Apid 14081198: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 03236] [c8-2c1s2n2] [Fri Mar 24 14:54:18 2017] PE RANK 751 exit signal Floating point exception
Application 14081198 exit codes: 136
Application 14081198 exit signals: Killed
Application 14081198 resources: utime ~1572s, stime ~1245s, Rss ~335152, inblocks ~19905289, outblocks ~88371084

Exit code 136 indicates that there has been some bad calculation that caused division by zero, or perhaps an integer value that exceeds the integer limit.

I used a disgusting amount of print statements like this:

if (masterproc) write(iulog,*) 'whannah - atm_comp_mct.F90 - atm_init_mct() - FIRST'

to track this down, and whether it was in cam_comp.F90 or atm_comp_mct.F90 it always seemed to point to a line with:

call shr_sys_flush(iulog)

The variable “iulog” is just an integer value that identifies the log files (atm.log.nnnnnn-nnnnnn) that show up in the run directory. It took me awhile to understand that this subroutine is “flushing” information to the log file that is on the disk. Before calling shr_sys_flush, this information is held in a buffer. This is probably done for performance reasons because more i/o with the disk can dramatically slow down the execution.

After realizing this I finally understood that the real error is happening in between the flush statements. All the print statements that I had put in place were getting stored in the buffer, but the model crashed before this information could get flushed to the iulog file!

There might be another way to write the print statements so that they bypass the buffer and go directly to the log file, but I’m not sure about this. The workaround I used was to add “call shr_sys_flush(iulog)” after each print statement.

 

3 thoughts on “CESM: Apparent Errors from shr_sys_flush()

  1. Bin Peng

    Have you finally solved this “apparent errors”? I also came across a very similar situation when I try to write the pft-level output in CESM/CLM using multiple processes.

    Reply
    1. Walter Post author

      The point of this post is that there is no problem with shr_sys_flush(), it just looks that way because print statements don’t show up until shr_sys_flush() is called. So it looks like the model is stopping there, but in reality it keeps going. That’s why I labelled this an “apparent error”. There’s nothing to solve. You just have to add more shr_sys_flush() statements or look for another way to debug the code.

      Reply

Leave a Reply to Walter Cancel reply

Your email address will not be published. Required fields are marked *