Readable R code

In my previous post, I talked about managing projects with Github and how crucial it is to make your work reproducible. While we craft our writings on paper to present the idea better, it is equally important to make our code readable for someone trying to understand the work.

I listed here a few points on writing more explicit code, mostly from my own experience. Read more about good and bad R code style on Google’s R Style Guide.

Start writing and keep refining your code

When you just started to learn how to write code in R, don’t worry about any code style and just start writing. Before long, you will find that you are doing repeated work that can be replaced by some simple loops. Basically, if you can do it with copy and paste, you can use a loop! There are lots of guides online on how to write loops, but the easiest way is to start with a small n and figure it out gradually.

As you develop your project further, you will find that you use some code chunks repeatedly with a slight variation. Then it would be a good idea to start writing your own functions that can be reused across analyses and projects.

Separate functions from scripts

  • functions are for specific purposes and can be used elsewhere

Below is a simple function that I wrote for generating long date and yearmon formatted dates from a data frame with variables “year” and “month”.


  • Scripts are used to generate graphs and tables and save them as intermediate products

longer, makes of functions.


  • Make the body of the scripts more readable

Keep scripts to a minimum

To avoid the hassle of writing a step-by-step readme document to inform the readers, you need to keep the number of scripts to a minimum. Perhaps 1. a cleaning script 2. exploratory data analysis 3. models 4. a script that generates all the formatted tables and graphs.

Code comments

  • Comment at the top about the purpose of the code, input, and output

  • Use packages and source functions at the start

  • Make comments on each step

  • Use meaningful variable or function names

Use dplyr and piping

dplyr is my go-to package for managing data frame. Its underlying C++ code makes running your daily data processing faster than other packages.

What’s more, the “piping” grammar provided by the dplyr package makes a series of processing steps on a data frame much more concise. You can leave out the name of the data frame after the piping sign “%>%.”

Example: list the median size of each type (at least 3), in decreasing order

data_frame %>% group_by(type) %>%

  summarise(median_size = median(size, na.rm = TRUE)) %>%
  filter(median_size > 3) %>%
  arrange(desc(median_size)) %>%
  select(type, median_size)
Yujun Zhou
Data Scientist

Currently works at Facebook Applied AI Research as a Data Scientist. Trained in Applied Economics and Machine Learning.