Thoughts on File Organization for Research

This probably one of the driest topics I could write about, but I’ve changed my system a lot over the last year and it has had some notable advantages. I wish someone had told me how cluttered my files were going to get when I started in science! So hopefully this post will get other people thinking more “long term” about their file organization.

My main motivation was that I wanted to start using git to help manage my code across several machines. Some would be small laptops that only have one hard drive, while others would be large servers which mount a large raid array for data storage. I wanted the files to organized in a consistent way across all these machines to make the code syncing more effective.

The biggest impact of my reorganization is to categorize my code in a ~/Research/ directory, and always keep it separate from my data. This is where I keep all my analysis code and the resulting figure files. I use individual git repositories to sync the code and figure files.

I create several folders for code such as code_climate/, code_composite/code_wave/, and then have corresponding folders for the figure output, such as figs_climate/figs_composite/, figs_wave/. This makes it much easier to find all the figures of a particular type, even though they have different file names. I suppose a better way to do this might be to have one “figures” folder and then use some sort of “tagging” system, like in gmail, to associate figures with a set of categories. I’m not sure how to do this, but it would be cool if it worked.

I also usually have a code_data/ directory to keep any script that processes the data. Some examples of this are:

  • mk.var.fv.daily.v1.ncl  (postprocess a given variable (var) from finite volume (fv) grid data)
  • mk.CAPE.v1.ncl  (Calculate convective available potential energy)
  • rm.leap.v1.ncl  (remove leap days from data)

Even though I use git for version control, I often end up with more than one way to address the same thing. So I ALWAYS add “v1”, “v2”, etc. to my file names to indicate the version of the code.

To deal with the different data directories I use symbolic links in my home directory, so I have a generic path to all of my data (~/Data/). This way, all the analysis code in my ~/Research/ directory can always just point to sub-directories under ~/Data/ no matter what machine they are run on.

I also make sure to separate directory for running models under ~/Model/. Within these directories I can link back to the ~/Data/ directory to use as “scratch” directories when writing out model code.

I’ve draw up a quick schematic of what this all looks like below.