1 Introduction to R Markdown

1.1 Introduction

In this chapter, we turn an R script into a fully reproducible R Markdown report using the NYPD Shooting Incident Data from NYC Open Data. We load the dataset through an API, clean and prepare the data, explore patterns, and create tables and visualizations using kable() and ggplot2. The goal is to practice building a clear, reproducible workflow that combines code, narrative text, and results in a single document.

1.2 Required Packages

First, we load the packages used in this report.

library(tidyverse)
library(lubridate)
library(stringr)
library(tidyr)
library(ggplot2)
library(dplyr)
library(knitr)

1.3 Data Ingestion via API

Next, we pull the NYPD shooting incident data directly from NYC Open Data using httr::GET() and the endpoint below.

endpoint <- "https://data.cityofnewyork.us/resource/833y-fsy8.json"

resp <- httr::GET(endpoint, query = list("$limit" = 30000, "$order" = "occur_date DESC"))

shooting_data <- jsonlite::fromJSON(httr::content(resp, as = "text"), flatten = TRUE)

This request returns up to 30,000 records, sorted from newest to oldest by occur date. The dataset covers incidents from 2006-01-01T00:00:00.000 to 2024-12-31T00:00:00.000.

1.4 Cleaning Data

Now that the dataset was successfully loaded, we began the data cleaning process.

1.4.1 Removing NA rows in perp_race

# Check how many missing values each column has
colSums(is.na(shooting_data))
##                incident_key                  occur_date                  occur_time 
##                           0                           0                           0 
##                        boro           loc_of_occur_desc                    precinct 
##                           0                       25596                           0 
##           jurisdiction_code          loc_classfctn_desc               location_desc 
##                           2                       25596                       14977 
##     statistical_murder_flag              perp_age_group                    perp_sex 
##                           0                        9344                        9310 
##                   perp_race               vic_age_group                     vic_sex 
##                        9310                           0                           0 
##                    vic_race                  x_coord_cd                  y_coord_cd 
##                           0                           0                           0 
##                    latitude                   longitude        geocoded_column.type 
##                          97                          97                          97 
## geocoded_column.coordinates 
##                           0
# Check missing values specifically in perp_race
sum(is.na(shooting_data$perp_race))
## [1] 9310
# Remove rows where perp_race is missing or marked as unavailable
shooting_clean<-shooting_data %>% filter(
  !is.na(perp_race) &
  !(perp_race %in% c("(NULL)","UNKNOWN","(null)")))

# Confirm that missing values were removed
sum(is.na(shooting_clean$perp_race))
## [1] 0

Here, we check how many missing values are present across the dataset. We then focus on the perp_race column, which originally contains 9310 missing values. After filtering out rows with missing or unavailable values, we check the column again to confirm that the cleaning step is successful.

1.4.2 Making perp_race Values Lowercase

Next, we standardize the perp_race column by converting all values to lowercase. This prevents duplicate categories that differ only by capitalization.

shooting_clean<-shooting_clean %>% mutate(
  perp_race=str_to_lower(perp_race))

1.4.3 Creating a time_of_day Column

After some initial cleaning and standardizing, we now create a new column called time_of_day that groups each incident into broader time categories.

# Split occur_time into Hour, Minute, and Second
shooting_clean<- shooting_data %>% separate(
  col = occur_time,
  into = c("Hour","Minute","Second"),
  sep = ":",
)

# Create a time_of_day category based on the Hour value
shooting_clean <- shooting_clean %>% mutate(
  time_of_day = case_when(
    Hour < 12 ~ "Morning",
    Hour < 18 ~ "Afternoon",
    Hour >= 18 ~ "Night"
  ))

To create the time_of_day column, we first split the occur_time variable into separate Hour, Minute, and Second columns. Then we group the Hour values into three categories: Morning, Afternoon, and Night. The number of shootings in each group is ; Morning Afternoon Night ; 12222 5439 12083 .

1.5 Insights

1.5.1 Time of Day

Next, we summarize how often shootings occur during each time of day.

# View the names of the columns in the NYPD Shooting dataset
colnames(shooting_clean)
##  [1] "incident_key"                "occur_date"                 
##  [3] "Hour"                        "Minute"                     
##  [5] "Second"                      "boro"                       
##  [7] "loc_of_occur_desc"           "precinct"                   
##  [9] "jurisdiction_code"           "loc_classfctn_desc"         
## [11] "location_desc"               "statistical_murder_flag"    
## [13] "perp_age_group"              "perp_sex"                   
## [15] "perp_race"                   "vic_age_group"              
## [17] "vic_sex"                     "vic_race"                   
## [19] "x_coord_cd"                  "y_coord_cd"                 
## [21] "latitude"                    "longitude"                  
## [23] "geocoded_column.type"        "geocoded_column.coordinates"
## [25] "time_of_day"
# Count shootings by time of day in descending order
shooting_clean %>% count(time_of_day)%>% arrange(desc(n))
##   time_of_day     n
## 1     Morning 12222
## 2       Night 12083
## 3   Afternoon  5439
# Count shootings by time of day and borough in descending order
shooting_clean %>% count(time_of_day,boro) %>% arrange(desc(n))
##    time_of_day          boro    n
## 1        Night      BROOKLYN 4793
## 2      Morning      BROOKLYN 4554
## 3        Night         BRONX 3761
## 4      Morning         BRONX 3517
## 5    Afternoon      BROOKLYN 2338
## 6      Morning        QUEENS 2072
## 7      Morning     MANHATTAN 1709
## 8        Night     MANHATTAN 1648
## 9        Night        QUEENS 1575
## 10   Afternoon         BRONX 1556
## 11   Afternoon        QUEENS  779
## 12   Afternoon     MANHATTAN  620
## 13     Morning STATEN ISLAND  370
## 14       Night STATEN ISLAND  306
## 15   Afternoon STATEN ISLAND  146
# Create a summary table with counts and percentages for each time of day category
time_summary <- shooting_clean %>%
  filter(!is.na(time_of_day)) %>%
  count(time_of_day, name = "n") %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  arrange(desc(n))
time_summary
##   time_of_day     n  pct
## 1     Morning 12222 41.1
## 2       Night 12083 40.6
## 3   Afternoon  5439 18.3

We count incidents in the Morning, Afternoon, and Night and arrange them from highest to lowest. The highest rate occurs during Morning (12222 cases; 41.1%).

1.5.2 Sex of Perpetrator

We also summarize the distribution of perpetrator sex.

# View the names of the columns in the NYPD Shooting dataset 
colnames(shooting_clean)
##  [1] "incident_key"                "occur_date"                 
##  [3] "Hour"                        "Minute"                     
##  [5] "Second"                      "boro"                       
##  [7] "loc_of_occur_desc"           "precinct"                   
##  [9] "jurisdiction_code"           "loc_classfctn_desc"         
## [11] "location_desc"               "statistical_murder_flag"    
## [13] "perp_age_group"              "perp_sex"                   
## [15] "perp_race"                   "vic_age_group"              
## [17] "vic_sex"                     "vic_race"                   
## [19] "x_coord_cd"                  "y_coord_cd"                 
## [21] "latitude"                    "longitude"                  
## [23] "geocoded_column.type"        "geocoded_column.coordinates"
## [25] "time_of_day"
# Remove missing and unavailable perp_sex values
shooting_clean_sex <- shooting_clean %>%
  filter(!is.na(perp_sex),
         !(perp_sex %in% c("U","(null)")))

# Count shootings by perpetrator sex and borough in descending order
shooting_clean_sex %>% count(perp_sex,boro)%>% arrange(desc(n))
##    perp_sex          boro    n
## 1         M      BROOKLYN 5971
## 2         M         BRONX 5279
## 3         M        QUEENS 2502
## 4         M     MANHATTAN 2484
## 5         M STATEN ISLAND  609
## 6         F      BROOKLYN  146
## 7         F         BRONX  134
## 8         F     MANHATTAN   87
## 9         F        QUEENS   79
## 10        F STATEN ISLAND   15
# Count male perpetrators by borough (after removing missing boroughs)
male_by_boro <- shooting_clean_sex %>%
  filter(perp_sex == "M", !is.na(boro)) %>%
  count(boro, name = "n") %>%
  arrange(desc(n)) %>%
  mutate(boro = str_to_title(boro))
male_by_boro
##            boro    n
## 1      Brooklyn 5971
## 2         Bronx 5279
## 3        Queens 2502
## 4     Manhattan 2484
## 5 Staten Island  609

We clean the perp_sex variable by removing missing and unavailable values. Then, we count how many shootings involved each sex in each borough. After that, we focus on male perpetrators and summarize the number of male-involved incidents by borough. The borough with the highest number of male perpetrator incidents is Brooklyn (5971 cases; 35.4%).

1.6 Tables & Graphs

1.6.1 Table (kable)

Now that we have an overview of our data, we create a table to neatly display a portion of the cleaned dataset.

# Filter out missing or unavailable perpetrator sex values
shooting_top <- shooting_clean %>% filter(!is.na(perp_sex), !(perp_sex %in% c("U","(null)"))) %>%
   # Convert occur_date to a Date format
  mutate(occur_date = as.Date(str_remove(occur_date, "T.*")), 
        
          # Recode perpetrator sex labels for readability
         perp_sex = case_when(
  perp_sex == "M" ~ "Male",
  perp_sex == "F" ~ "Female",
  TRUE ~ perp_sex)) %>% 
  
   # Select key columns using base R indexing and display the first 10 rows
  .[, c("occur_date", "boro", "time_of_day", "perp_sex", "perp_race")] %>%
dplyr::slice_head(n = 10)

# Display the cleaned preview table
shooting_top
##    occur_date      boro time_of_day perp_sex      perp_race
## 1  2024-12-31  BROOKLYN       Night     Male          BLACK
## 2  2024-12-31  BROOKLYN       Night     Male          BLACK
## 3  2024-12-30     BRONX   Afternoon     Male          BLACK
## 4  2024-12-30  BROOKLYN       Night     Male          BLACK
## 5  2024-12-30     BRONX       Night     Male          BLACK
## 6  2024-12-29     BRONX   Afternoon     Male          BLACK
## 7  2024-12-28 MANHATTAN       Night     Male          BLACK
## 8  2024-12-28 MANHATTAN       Night   Female          BLACK
## 9  2024-12-27     BRONX       Night     Male BLACK HISPANIC
## 10 2024-12-27     BRONX       Night     Male BLACK HISPANIC
# Identify the most common perpetrator sex in the dataset
top_sex <- shooting_top %>% count(perp_sex, sort = TRUE) %>% slice(1)

# Create a kable table
kable(shooting_top,
      caption = "Preview of 10 cleaned NYPD shooting records showing the date, borough, time of day, perpetrator sex, and perpetrator race. This table provides a quick view of the variables used in the analysis after cleaning.") 
Table 1.1: Preview of 10 cleaned NYPD shooting records showing the date, borough, time of day, perpetrator sex, and perpetrator race. This table provides a quick view of the variables used in the analysis after cleaning.
occur_date boro time_of_day perp_sex perp_race
2024-12-31 BROOKLYN Night Male BLACK
2024-12-31 BROOKLYN Night Male BLACK
2024-12-30 BRONX Afternoon Male BLACK
2024-12-30 BROOKLYN Night Male BLACK
2024-12-30 BRONX Night Male BLACK
2024-12-29 BRONX Afternoon Male BLACK
2024-12-28 MANHATTAN Night Male BLACK
2024-12-28 MANHATTAN Night Female BLACK
2024-12-27 BRONX Night Male BLACK HISPANIC
2024-12-27 BRONX Night Male BLACK HISPANIC

We remove rows with missing or unavailable perpetrator sex values, convert occur_date to a date format without the time stamp, and recode sex labels to ‘Male’ and ‘Female’ for readability. We then select key columns and display the first 10 rows of the cleaned dataset. The most common perpetrator sex in this subset is Male.

1.6.2 Graphs (ggplot2)

1.6.2.1 Time of Day Plot

To better understand patterns in the data, we visualize shooting counts by time of day using a bar chart.

shooting_time<- shooting_clean %>% 
  group_by(time_of_day,boro) %>% 
  summarize(total=n())

ggplot(shooting_time, aes(x = time_of_day, y = total, fill = time_of_day)) +
  geom_col() +
  labs(title = "Time of Shootings in NYC",
       x = "Time of Day", y = "Number of Shootings",fill="Time of Day") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(size = 17, family = "Georgia", face = "bold"),
        axis.title.x = element_text(size = 12, family = "Georgia"),
        axis.title.y = element_text(size = 12, family = "Georgia"))
Bar chart showing the total number of NYPD shooting incidents by time of day (Morning, Afternoon, Night). This figure helps show how shootings are distributed across the day.

(#fig:time of day plot)Bar chart showing the total number of NYPD shooting incidents by time of day (Morning, Afternoon, Night). This figure helps show how shootings are distributed across the day.

Interpretation: We group shootings by time of day and borough, count the number of incidents, and create a bar chart showing total shootings by time of day. The fewest shooting incidents occur during the Afternoon.

1.6.2.2 Sex of Perpetrator Plot

Next, we visualize the number of shooting incidents by perpetrator sex across boroughs using a faceted bar chart.

shooting_clean_perp_sex<- shooting_clean_sex %>% 
  group_by(perp_sex,boro) %>% 
  summarize(total=n())

shooting_clean_perp_sex <- shooting_clean_perp_sex %>%
  mutate(
    perp_sex = factor(perp_sex, levels = c("F","M"),
                      labels = c("Female","Male")))

ggplot(shooting_clean_perp_sex, aes(x = perp_sex, y = total, fill = perp_sex)) +
  geom_col() +
  facet_wrap(~ boro) +
  labs(
    title = "Shootings by Sex of Perpetrator (Faceted by Borough)",
    x = "Perpetrator Sex", y = "Number of Shootings", fill = "Perpetrator Sex"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title   = element_text(size = 17, family = "sans", face = "bold"),
    axis.title.x = element_text(size = 12, family = "sans"),
    axis.title.y = element_text(size = 12, family = "sans")
  )
Bar charts showing the number of NYPD shooting incidents by perpetrator sex, separated by borough. This figure allows for comparison of shooting counts by sex across different boroughs.

(#fig:sex of perpetrator plot)Bar charts showing the number of NYPD shooting incidents by perpetrator sex, separated by borough. This figure allows for comparison of shooting counts by sex across different boroughs.

Interpretation: We group the data by perpetrator sex and borough, count incidents, and recode sex labels to ‘Female’ and ‘Male.’ We then plot a faceted bar chart showing shootings by perpetrator sex for each borough. The borough with the fewest shootings is STATEN ISLAND (624 incidents).

1.7 Reflection

Learning how to create an R Markdown document will be very helpful when I begin working with my thesis dataset. It allows me to keep my code and explanations organized in a clear, step-by-step workflow, making it easy to see how each part of the analysis was carried out. When I return to the project later, the document serves as a built-in guide that helps me understand my previous decisions and continue the work without confusion. This structure also supports reproducibility and makes it easy to share my workflow with others.