class: inverse, center, title-slide, middle # QSURE R Training 2022 ### Karissa Whiting & Mike Curry <p align="center"><img src="images/core_hex_stickers.png"width=20%></p> --- class: inverse, center, title-slide, middle # `Lesson 1` ## June 7th, 2022 <p align="center"><img src="images/core_hex_stickers.png"width=20%></p> --- # Workshop Goal .xxlarge[ Fill potential gaps in your R knowledge and help you get properly set up to conduct impactful and reproducible research during your time at MSK (and after!) ] ??? --- # Survey Results (> 75% Proficient in R! ) .pull-left[ <p align="center"><img src="images/r-basics.png" width=90%></p> <p align="center"><img src="images/what-to-learn.png" width=100%></p> ] .pull-right[ <p align="center"><img src="images/prefer-to-learn.png"width=100%></p> **Based on your feedback we will:** - Short review on basic R vocab (today) - Skip dplyr basics but include some advanced dplyr/data cleaning (today) - Focus on project setup (today) - Focus on coding statistical analyses (next session) - Optional 3rd session on R package making/Github ] --- # Training Agenda - **Lesson 1** – 6/7/2022 - R Basics (Quick Review) - Project Setup & Reproducibility - Guided Example - Data Cleaning - Exploring your Data - **Lesson 2** - 6/9/2022 - Guided Example (Continuation) - Analyzing/Modeling the Data - Reporting Your Results - **Lesson 3** - TBD - Github - Intro to Package Development --- class: inverse, center, title-slide, middle # R Basics --- # R, Rstudio, Open source philosophy - **R** is an object-oriented **open-source** programming language used mostly for statistical computing and graphics - **Open source** means the original source code is freely available, and can be distributed and modified. Also, users can contribute to the usefulness by creating packages, which implement specialized functions for all kinds of uses (e.g. statistical methods, graphical capabilities, reporting tools). Added Bonus: vibrant R community! - **RStudio** is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting code editor that supports direct code execution, as well as tools for plotting, history, debugging and work space management. --- <center> <img src="images/r-cars.jpg" width="100%"/> </center> --- # R Basics: General Things - `<-` is the assignment operator (`=` also works) ```r v1 <- c(1, 2, 3) v1 ``` ``` ## [1] 1 2 3 ``` - The `%>%` (pipe) from the magrittr package is a useful way to link functions together to make your code more succinct and easier to read ```r library(dplyr) mtcars %>% select(mpg) %>% filter(mpg == max(mpg)) ``` - `?` is your friend if you want to look at documentation! (e.g. type `?mean()` in the console) - R is case sensitive, bE cArEfUl! --- # R Basics: Data Structures and Basic Syntax R basic data types: * Logical (`TRUE`) * integer (`1`) * numeric (a.k.a. double) (`1.2`) * character (`"Purple"`) * factor ("a") * complex (nobody ever uses these really) --- # R Basics: Beware Data Type Coercion - Since columns of a data.frame must be of the same type, some data may be coerced in unexpected ways when reading in a csv or excel file. - character is often the default for mixed data types ```r x <- c("apple", 3) str(x) ``` ``` ## chr [1:2] "apple" "3" ``` ```r y <- c(3, 2, "twenty") y ``` ``` ## [1] "3" "2" "twenty" ``` ```r #sum(y) ``` --- # R Basics: Data Structures and Basic Syntax R has 5 basic data structures: .pull-left[ 1. vector 2. matrix 3. list 4. array 5. data.frame/tibble ] --- # R Basics: Data Structures and Basic Syntax .pull-left[ .bold[1. vector] ] .pull-right[ - only 1 data type allowed ```r # character c("apple", "orange") ``` ``` ## [1] "apple" "orange" ``` ```r # numeric c(1:15) ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ``` ] --- # R Basics: Data Structures and Basic Syntax .pull-left[ 1. vector 2. **matrix** ] .pull-right[ 2d, only 1 data type allowed ```r letters <- c("a","b","c","d", "e", "f") matrix(letters, nrow=2, ncol=3) ``` ``` ## [,1] [,2] [,3] ## [1,] "a" "c" "e" ## [2,] "b" "d" "f" ``` ] --- # R Basics: Data Structures and Basic Syntax .pull-left[ 1. vector 2. matrix 3. **list** ] .pull-right[ - any data type allowed ```r my_list <- list("a", 2, TRUE) str(my_list) ``` ] --- # R Basics: Data Structures and Basic Syntax .pull-left[ 1. vector 2. matrix 3. list 4. **array** ] .pull-right[ - n-dimensions, of 1 data type ```r # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) array(c(vector1,vector2),dim = c(3,3,2)) ``` ``` ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 5 10 13 ## [2,] 9 11 14 ## [3,] 3 12 15 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 5 10 13 ## [2,] 9 11 14 ## [3,] 3 12 15 ``` ] --- # R Basics: Data Structures and Basic Syntax .pull-left[ 1. vector 2. matrix 3. list 4. array 5. **data.frame/tibble** ] .pull-right[ - any data type is allowed, but each column has to have the same type - the most important for data analysts. Most similar to an excel spreadsheet/statistical data file ```r head(iris, 4) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ``` ] --- # Exploring Your Data: Identify Data Types - `colnames()` - will give you the column names - `ncol()` and `nrow()` - will give you the total count of columns and rows respectively - `class()`, `str()`, `attributes()` will give you meta-information on the object - `head()`, `tail()` show the top or bottom rows of your df - `View()` will show the whole dataframe - `table()` will summarise variables ```r str(iris) ``` ``` ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` ```r nrow(iris) ``` ``` ## [1] 150 ``` --- # Exploring Your Data: Identify Data Types cont. ```r colnames(iris) ``` ``` ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" ``` ```r class(iris) ``` ``` ## [1] "data.frame" ``` ```r head(iris[,1:3], 3) ``` ``` ## Sepal.Length Sepal.Width Petal.Length ## 1 5.1 3.5 1.4 ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ``` ```r table(iris$Species) ``` ``` ## ## setosa versicolor virginica ## 50 50 50 ``` --- # Intro to tidyverse The tidyverse package is a collection of R packages designed for data analysis, all of which share a similar design, grammar, and structure. ```r # load it library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ## ✔ tibble 3.1.6 ✔ dplyr 1.0.8 ## ✔ tidyr 1.2.0 ✔ stringr 1.4.0 ## ✔ readr 2.1.2 ✔ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` ```r # check out the cute logo tidyverse_logo() ``` ``` ## ⬢ __ _ __ . ⬡ ⬢ . ## / /_(_)__/ /_ ___ _____ _______ ___ ## / __/ / _ / // / |/ / -_) __(_-</ -_) ## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ ## ⬢ . /___/ ⬡ . ⬢ ``` --- # Intro to tidyverse - readr: data import/export - tibble: re-imagined data frames - dplyr: data manipulation - tidyr: data manipulation - ggplot2: graphics and visualization - purrr: functional programming toolkit, replaces the need for many loops - stringr: string manipulation - forcats: re-imagined factor data types There are several additional packages which are installed as part of the tidyverse, but are not loaded by default. --- # Intro to tidyverse Overall tidyverse helps with code readability and has shortcuts for some common data manipulation tasks tidyverse has been developed and significantly improved in the last few years, with a lot of ongoing work being done to further increase usability. --- # tidyverse: data cleaning with dplyr The dplyr package is a data manipulation and cleaning package. A few of the key functions (verbs) in dplyr are: - select() - mutate() - filter() - arrange() - group_by() - summarize() All take a data frame as input, and return a data frame as output. **We will briefly review during case study** --- # The R Analysis Workflow - Setup Your Project - Clean and Explore Data - `tidyverse` - `lubridate` - Analyze it - `stats` - `survival` - `lme4`/ `nlme` - `ggplot2` - Report Your Findings - R Markdown / `knitr` - `gt` / `gtsummary` - `broom` - Iterate, Share, and Collaborate! - `git`/ `github` - `devtools` & `usethis` --- class: inverse, center, title-slide, middle # Project Setup ### How to organize your project to maximize reproducibility --- # Reproducibility -- .xlarge[ *Why is a Reproducible R Workflow Important?* ] -- - Allows you to show evidence of your results - Encourages transparency about decisions made during analysis - Enables others to check and use/extend your methods and results - Enables FUTURE YOU to check and use/extend your methods and results --- # Reproducibility .xlarge[ *Why is a Reproducible R Workflow Important?* ] - Allows you to show evidence of your results - Encourages transparency about decisions made during analysis - Enables others to check and use/extend your methods and results - Enables FUTURE YOU to check and use/extend your methods and results .center[ .medium[ ***"You mostly collaborate with yourself, and me-from-two-months-ago never responds to email"*** ] ] .pull-right[ .medium[ ~ *Dr. Mark Holder, Computational Biologist* ] ] --- # How do we ensure our R code is reproducible? - Have a clear project structure and defined workflow - **Comment and document your code** - Use reproducible reporting practice (e.g. Rmd inline text) - Avoid absolute file paths (e.g. `~/Users/Whiting/Projects...`) - Version control (document changes you make, or use git!) --- class: inverse, center, title-slide, middle # Coding Exercise ## Case Study: Diabetes Risk Factors --- # Case Study - Modeling Diabetes Risk Factors - Analyze risk factors for diabetes - **Research Question:** Is waist to hip ratio a risk factor for diabetes? - Outcome: glycosolated hemoglobin level (> 7 is considered diabetic) - Available Variables of interest: - cholesterol level - stabilized glucose - location of individual - age - gender - height/weight - body frame - waist in inches - hip in inches ```r library(faraway) ?diabetes ``` --- # Project Setup: Anatomy of a Project .pull-left[ <p align="center"><img src="images/project-folders.png"width=100%></p> ] .pull-right[ - keep raw and processed data separate (`raw-data`, vs. `data`) - folder for `scripts` ordered or labelled descriptively - optionally can have `admin` for project notes, etc and `outputs` for final reports and figures - **README** - text file that introduces and explains a project (`usethis::use_readme_md()`) - **R Project (.Rproj file) ** - tells RStudio all your files belong to one project and sets working directory for entire project. ] --- # Project Setup 1. Create a new folder on your computer and name it "your-initials-case-study-2022" 2. Create subfolders within your project folder called: `admin`, `raw-data`, `scripts`, `data`, `outputs` 3. Create a new R Project in Rstudio from this folder (**File > New Project > Existing Directory**) 4. Go to https://github.com/karissawhiting/qsure-case-study-2022 and click **Code > Download ZIP** 5. Create a `README.md` using `usethis::use_readme_md()` 6. Drag contents of `raw-data` and `scripts` from qsure-case-study-2022 to your new Project. --- class: inverse, center, title-slide, middle # Thank You! --- # Resources - These materials will available on Github - Data wrangling cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf - Questions? Reach out to myself or Mike on teams or via email - Create an R help channel on teams- help each other! --- --- # Bonus Slides: Why Write a Function? - D.R.Y. - limit copy pasting and potential mistakes - Automate common tasks - You only need to update code in one place - If you are copy pasting code > 3 times, write a function - Give your function a useful name --- # Bonus Slides: Function Example ```r add_one <- function(number) { result = number + 1 return(result) } add_one(7) ``` ``` ## [1] 8 ``` ```r #add_one("hai") ``` --- # Bonus Slides: Function Practice Write a function to calculate BMI given height (inchs) and weight (lbs) BMI = (weight * 703) / (height)^2 ```r calc_bmi <- function() { ????????????? } ``` --- class: inverse, center, title-slide, middle # Case Study https://github.com/karissawhiting/qsure-case-study-2022 --- --- class: inverse, center, title-slide, middle ## `Lesson 2` #### June 9th, 2022 <p align="center"><img src="images/core_hex_stickers.png"width=20%></p> --- ## Model Statistical analyses .xxlarge[ Will cover: - linear model - logistic model - survival analyses ] --- ## General modeling formula .xxlarge[ - in general `~` is used to separate your outcome on the left hand side and your predictors on the right hand side - your outcome will always be on the left side of the `~` - only some univariate tests like `chisq.test()` do not use the `~` notation - general notation: model(outcome ~ covariates, data) ] - the `stats` package is already loaded in R which will make it easier to use common statisitcal tests --- ## Example of linear model .xxlarge[ - Continuous outcome - Specifying interactions ] ```r mtcars$vs <- as.character(mtcars$vs) mtcars$cyl <- as.character(mtcars$cyl) mod1 <- lm(mpg ~ vs * cyl, data = mtcars) class(mod1) # class of lm which is a list ``` ``` ## [1] "lm" ``` ```r names(mod1) ``` ``` ## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "contrasts" "xlevels" "call" "terms" ## [13] "model" ``` --- ## Check model diagnositcs ```r library(car) #model diagnositics #multi colinearity #vifmod <- car::vif(mod1) #will not work with interaction present #influencers cutoff <- 4/((nrow(mtcars)-length(mod1$coefficients)-2)) ``` --- ## Check model diagnositcs ```r #many options for which! plot(mod1, which=1, cook.levels=cutoff) ``` ![](index_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- ## Example cont. ```r summary(mod1) ``` ``` ## ## Call: ## lm(formula = mpg ~ vs * cyl, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.3300 -1.4437 0.0875 1.5250 7.1700 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 26.000 3.318 7.836 2e-08 *** ## vs1 0.730 3.480 0.210 0.83541 ## cyl6 -5.433 3.831 -1.418 0.16757 ## cyl8 -10.900 3.434 -3.174 0.00374 ** ## vs1:cyl6 -2.172 4.305 -0.504 0.61801 ## vs1:cyl8 NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.318 on 27 degrees of freedom ## Multiple R-squared: 0.7361, Adjusted R-squared: 0.697 ## F-statistic: 18.82 on 4 and 27 DF, p-value: 1.696e-07 ``` - while it is nice to see the summary results, you wouldn't present them in this fashion --- ## Example cont. .xxlarge[ - `broom` and `gt`/`gtsummary` will help - `broom` is a package that helps tidy model results into data.frames - this helps with reporting and you can further format the data.frame and present with `gt` ```r moddf <- broom::tidy(mod1) %>% #didn't load broom just called one function mutate(p.value = round(p.value,3)) %>% select(-std.error) gt::gt(moddf) ```
term
estimate
statistic
p.value
(Intercept)
26.000000
7.8364568
0.000
vs1
0.730000
0.2097843
0.835
cyl6
-5.433333
-1.4182193
0.168
cyl8
-10.900000
-3.1738857
0.004
vs1:cyl6
-2.171667
-0.5044923
0.618
vs1:cyl8
NA
NA
NA
] --- ### gtsummary .xxlarge[ - `gtsummary` makes it easier to report model and descriptive statistics (more on this in the example) - For more helpful examples: https://www.danieldsjoberg.com/gtsummary/index.html ] --- ### gtsummary ```r tbl_regression(mod1) ```
Characteristic
Beta
95% CI
1
p-value
vs
0
—
—
1
0.73
-6.4, 7.9
0.8
cyl
4
—
—
6
-5.4
-13, 2.4
0.2
8
-11
-18, -3.9
0.004
vs * cyl
1 * 6
-2.2
-11, 6.7
0.6
1 * 8
1
CI = Confidence Interval
--- ### Customizing gtsummary ```r library(labelled) var_label(mtcars$cyl) <- "Cylinder" newmodsum <- lm(mpg ~ vs * cyl, data = mtcars) %>% tbl_regression() %>% bold_labels() %>% modify_caption("New title for model") ``` --- ### Customizing gtsummary ```r newmodsum ```
<caption>New title for model</caption>
Characteristic
Beta
95% CI
1
p-value
vs
0
—
—
1
0.73
-6.4, 7.9
0.8
Cylinder
4
—
—
6
-5.4
-13, 2.4
0.2
8
-11
-18, -3.9
0.004
vs * Cylinder
1 * 6
-2.2
-11, 6.7
0.6
1 * 8
1
CI = Confidence Interval
--- ## Logistic models .xxlarge[ - binary outcome (0/1) - R will model the '1' as the event by default make sure your variable is coded correctly ```r mtcars$vs <- as.numeric(mtcars$vs) mtcars$am <- as.character(mtcars$am) mod2 <- glm(vs ~ am , data = mtcars, family = "binomial") ``` ] --- ## Summarize logistic model ```r tbl_regression(mod2, exponentiate = TRUE) %>% bold_labels() ```
Characteristic
OR
1
95% CI
1
p-value
am
0
—
—
1
2.00
0.48, 8.76
0.3
1
OR = Odds Ratio, CI = Confidence Interval
--- ## Survival analysis .xxlarge[ - Outcome is both time and an event (e.g death, progression) - Have to specify both in model - Former MSK employee Emily Zabor put together great materials for survival analysis here: https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html ] --- ## Survival analysis ```r library(survival) ``` ``` ## ## Attaching package: 'survival' ``` ``` ## The following objects are masked from 'package:faraway': ## ## rats, solder ``` ```r lung <- lung %>% mutate(ph.ecog = as.character(ph.ecog), sex = as.character(sex)) mod3 <- coxph(Surv(time, status)~ph.ecog+sex,data = lung) mod4 <- survfit(Surv(time, status) ~ sex,data = lung) ``` --- ## Survival analysis ```r tbl_regression(mod3, exponentiate = TRUE) ```
Characteristic
HR
1
95% CI
1
p-value
ph.ecog
0
—
—
1
1.52
1.03, 2.25
0.036
2
2.58
1.66, 4.01
<0.001
3
7.76
1.04, 58.0
0.046
sex
1
—
—
2
0.58
0.42, 0.81
0.001
1
HR = Hazard Ratio, CI = Confidence Interval
--- ### Survival analysis ```r mod4 ``` ``` ## Call: survfit(formula = Surv(time, status) ~ sex, data = lung) ## ## n events median 0.95LCL 0.95UCL ## sex=1 138 112 270 212 310 ## sex=2 90 53 426 348 550 ``` --- ## Survival analysis - test proportional hazards ```r cox.zph(mod3) ``` ``` ## chisq df p ## ph.ecog 5.64 3 0.130 ## sex 2.56 1 0.110 ## GLOBAL 7.83 4 0.098 ``` --- # Coding Exercise ## Case Study: Diabetes Risk Factors --- class: inverse, center, title-slide, middle # Thank You!