-
Notifications
You must be signed in to change notification settings - Fork 6
/
Handout.Rmd
1847 lines (1272 loc) · 60.4 KB
/
Handout.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Handout"
header-includes:
- \pretitle{\begin{center}\LARGE\includegraphics[width=12cm]{Images/Clinician Coders Branding_FINAL_CMYK_Colour.png}\\[\bigskipamount]}
- \posttitle{\end{center}}
output: pdf_document
number_sections: true
linkcolor: blue
urlcolor: blue
citecolor: blue
---
\newpage
\tableofcontents
```{r, include = FALSE}
# We will need this data later.
library(tidyverse)
library(lubridate)
cchic <- read_csv("clean_CCHIC.csv")
```
# Licensing
This course is licensed under a Creative Commons share-alike licence.
[Creative Commons](https://creativecommons.org/licenses/by-sa/4.0/)
The course has been adapted from the Data Science for Doctors course.
[DS4D](http:https://datascibc.org)
\newpage
# Workshop 1 - Introduction to R.
## Introduction
### What is R?
R is a FREE and open source software for statistical computing and graphics. It works with variety of platforms:
- UNIX
- MacOS
- Windows
R uses a computer ‘language’ that is thought to make it particularly suitable for statistics. Often the challenge new users face is learning that language. Most people are often used to a graphical user interface (GUI) where they can point and click. With R, you need to describe to the computer the steps you want it to take.
It does get better with practice!
But why bother with the practice for R? You are not limited to pre-designed command process of a particular GUI package – you can give R your data and ask it to work through almost any calculation or process.
### Reproducible science
One of the main criticisms of data science may be that logging the steps of your analysis can be arduous or contain errors, so if a third party were to try and reproduce your findings they may well get different results. Also, how many times have you been working with a large database and found that there was either an error in the data or you made an error early on in your workings. You have then had to go back and re do the whole analysis from the point at which you made an error.
With R:
- The code you write to analyse your data is also the record of your analysis.
- If you make a mistake early on, this can be corrected and the subsequent code still valid and likely adaptable.
- You can use the similar analytical methods on a whole other database, by saving and re-using code.
\newpage
### R Studio
This is what the R studio interface looks like:
<p>
![R Studio Interface](Images/RStudio.png)
</p>
The windows (clockwise from bottom right) are:
*Console*
When you type in the "Console", you are typing into R itself. Commands are executed, but not saved.
Here `Enter` executes a command
*Source*
This is a text file. You can type commands in it, and send them to the "Console", where they are executed in R. You can save the text file on your computer so that you have a record of the commands you run.
Here `Cmd + Enter` (Mac) or `Ctrl + Enter` (Windows) executes your command.
*Environment*
This lists the objects you are working with in your R environment.
*Supporting Panes*
The supporting panes include:
- Help pages
- Plot outputs
- Browser for your files (this is your computer, if you delete something here it will also be deleted from your computer drive. BE CAREFUL)
\newpage
## R building blocks
We propose that there are three main building blocks to R:
1. Names
2. Data
3. Functions
### 1. Names
Names are assigned to a value/function/file.
The assignment operator is `<-`.
You can then recall the value/function/file by using the name
```{r}
# Assign the value of 2+2+6 to the name `practice`. Notice there is no output in the
# console, but 'practice' is added to the environment window.
practice <- 2 + 2 + 6
# Recall the value using the name.
practice
```
You can also assign names to text, but note that the text needs to be surrounded with "". Also note that object names can't have spaces in them.
```{r}
say_hello <- "hello"
say_hello
```
### 2. Data
R is a statistical computing language; statistics uses vectors and matrices.
#### *Vector*
A vector is a structure containing elements of the same type of data.
* Collection of numbers (integer vector). Example - age.
```{r, echo = FALSE}
c(50, 25, 32, 67, 46, 19, 48)
```
* Collection of real numbers (numeric vector). Example - temperature
```{r}
c(37.5, 37.1, 37.3, 38.3, 37.4, 38.9)
```
* Collection of characters (character vector). Example - name.
```{r, echo = FALSE}
c("Adam", "Sally", "Eve", "John", "James", "Jennifer")
```
* Collection of true/false values (logical vector). Example - 'blood results avaiblable?'
```{r, echo = FALSE}
c(TRUE, FALSE, FALSE, TRUE, TRUE)
```
\newpage
You can also assign names to vectors. You need to use `c()` in order to combine the elements of the vector. Remember, characters need to be surrounded with `""`.
```{r}
age <- c(50, 25, 32, 67, 46, 19)
# Recall the whole vector using the name.
age
# You can do the same thing for a character vector.
name <- c("Adam", "Sally", "Eve", "John", "James", "Jennifer")
name
gender <- c("male", "female", "female", "male", "male", "female")
```
You can recall specific elements of a vector using the name and `[]`.
```{r}
# The age of the second person.
age[2]
# The name of the fifth person.
name[5]
```
#### *Vectors – Other properties.*
If you add a `character` to a vector of integers, R will change the whole vector to a character vector.
```{r}
test <- c(1, 2, 3, 4)
# Note that `test` is a numeric vector.
str(test)
# Add a character to `test`
test <- c(1, 2, 3, 4, "x")
# Note that `test` is now a character vector.
str(test)
```
If you add a `logical` to an integer vector, `TRUE` is converted to 1 and `FALSE` is converted to 0. The vector remains numeric.
```{r}
test2 <- c(1, 2, 3, 4, TRUE)
# Note that the last element is `1`.
test2
# Note that the vector is numeric.
str(test2)
```
\newpage
#### *Matrix*
A matrix is collection of elements of the same data type, arranged into rows and columns.
```{r, echo = FALSE}
matrix(1:9, nrow = 3, ncol = 3)
```
#### *Data Frame*
A data frame is just a list of vectors. We are all familiar with data frames, as a properly formatted spreadsheet is a data frame. Here is an example:
```{r, echo = FALSE}
data.frame(name, age, gender)
```
In the table above there are two character vectors and one numeric vector.
You can create a data frame in R with the command `data.frame()`. R arranges the vectors as names across columns, and also numbers each row automatically.
```{r}
# Combining the vectors we created before into a data frame.
patients <- data.frame(name, age, gender)
patients
```
Remember, if you try and create a data frame with unequal number of values in each vector R will try and fill in the missing values by filling them in for you.
```{r}
name <- name <- c("Adam", "Sally", "Eve", "John", "James", "Jennifer")
gender <- c("male", "female")
# You probably don't want this to happen.
data.frame(name, gender)
```
You can review a specific vector in a data frame using `$`.
Try `patients$age`.
\newpage
### 3. Functions
Finally, functions.
Think of functions as factories. Each factory is designed to have a certain output and needs a specific input. A function may conduct more than one step on the data you input e.g. `table()` counts values and presents them by the group parameter you set (we will use `table()` later).
Functions will try and process most of the arguments you give it but there will be constraints. The help pages for the functions (accessed with either `?function_name` or `??function_name`) will detail these constraints.
Examples of some functions which are useful for exploring your dataset are:
- `head(dataset_name)` - prints the first 6 rows.
- `tail(dataset_name)` - prints the last 6 rows.
- `summary()` - gives you infomration about either a data frame or a vector.
- `mean()` - calculates the mean of a numeric vector.
R has many inbuilt functions. Packages are repositories of functions separate to this inbuilt content. To access these packages from the online package repository, you need the commands:
`install.packages("package.name")`
`library(package.name)`
You only need to use the `install.packages()` command once. You will need to use `library()` to reload repositories you want to use each time you restart R.
## R vocabulary.
The main terms to be get you started are:
* Names
* Vectors
* Data Frames
* Functions
* Packages
#### R Studio cheat sheet
Available here:
https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf
\newpage
# Workshop 2 - Keeping a clean dataset.
## Introduction
Spreadsheets offer a simple and effective method to collect data. However, their ease of use often leads to a sloppy approach to data entry.
Good data discipline is an essential foundation to data analysis, and a small investment at this stage can prevent you running into trouble when trying to analyse and publish your study. This tutorial will take you through the essential dos and don’ts of managing your data in a spreadsheet.
As a word of caution, do not forget, that any data collection should be in line with the [Data Protection Act](legislation.gov.uk) and the [General Data Protection Regulation](ico.org.uk). Please ensure you manage your data appropriately.
This is a good opportunity to use the tutorial and compare the points below to your own original data. Take the time to clean your data as you read through the document. We have provided a ‘dirty’ sample of the Critical Care Health Informatics Collaborative (CC-HIC) public dataset to practice cleaning if you would prefer (see slack channel).
Prior to using any of the CC-HIC extracts please ensure you have signed the End User Licence Agreement (EULA).
### Spreadsheet applications
There are many different spreadsheets to use. Almost everyone will be familiar with Excel, although it is worth considering if this is actually the right choice for you.
If you are planning a major collaboration, where many people need to enter data simultaneously, then an excel spreadsheet is probably not for you. Have a look at googlesheets as an alternative. As your requirements grow, you may want to explore a relational database. For now, we will assume you are using excel.
\newpage
## Data Types
To know how to best record and manage data, it is important to have an understanding of the type of data you are working with.
Have a look at this diagram from derangedphysiology.com.
<p>
![Types of data](Images/DataTypes.png)
<p>
R supports all these data types. Look at the diagram below and consider how you may represent these data types in R.
If you intend to collect data on a spreadsheet program (e.g. excel), then import into R for further analysis, it is helpful to understand how R 'sees' data and what it is able to do with different types of data.
\newpage
### Data types in R
#### *Integers and Decimals*
The values recorded in numeric variables can be viewed as either integers or decimals.
- Integers are discrete whole numbers (counting numbers) e.g. 1, 2, 3
- You can specify an integer in R by writing an “L” after the integer e.g. 1L, 2L, 3L. This might seem strange, but it guarantees that the value you are recording is an integer, and not a decimal.
- Decimals are numbers that appear in the real world e.g. 2.1, 2.0, 1.9
- Decimals are often called floats or reals in computer science
- R refers to decimals as numerics
Computers apply maths differently to these two numerical data types.
#### *Factors*
The values recorded in categorical variables can be either nominal (unordered names) or ordinal (ordered names). R stores both these variables as nominal and ordinal factors respectively.
Example of a nominal factor vector:
- Apples
- Oranges
- Pears
R will store this as:
1. Apples
2. Oranges
3. Pears
The numbers here do not indicate a relationship between the items. It is only used for indexing the list. R will not understand this as 1 > 2 > 3, which makes no sense in this context.
Example of an ordinal factor vector:
- strongly disagree
- disagree
- neither agree/disagree
- agree
- strongly agree
This may be familiar to you as the Likert scale. Here the values are named, but there is an intrinsic order that may be relevant. Where there is an order to the values R stores the data as follows:
1. strongly disagree
2. disagree
3. neither agree/disagree
4. agree
5. strongly agree
For ordinal factors, R understands that 1 < 2 < 3 < 4 < 5
\newpage
#### *Strings*
A string is any sequence of characters. Literally anything you can type can be represented as a string. This is the default data type in excel, and is what data typically defaults to if there is any confusion about what the data is. This is because anything can be represented as a string.
Take the example of recording the fluid balance of patients on the day of admission to ITU. If we write "480ml" excel will store this value as a string. It has no choice, since the "ml" adds data that cannot possibly be interpreted numerically.
When dealing with data that is numerical in nature, it is best practice to keep all units elsewhere and keep the data as an integer or decimal.
#### *Dates and Times*
Dates and times are a special case of integers and decimals. Computers store date information as integers being counted from a specific point in time. This isn’t helpful for us, as we have no idea how many seconds there are since 1st January 1970. R stores the integer value, and presents this as a formatted date when we ask for the value.
You will be using an extract from the CC-HIC in the next few workshops. You will note that all the patients are admitted on 1st January 2000. This was one of the measures to protect patient identification. Thus events (e.g. discharge dates, death dates etc.) are recorded as respective to 1st January 2000.
The time is stored as a fraction of the date, but 10467.5674 is not intuitive to read. R will format this correctly when asked to present the data.
Depending on the cell format, typing the same date in excel will give you different appearances, or even a different actual date.
<p>
![Dates in Excel](Images/ExcelDates.png)
<p>
#### *Booleans*
These are binary statements, i.e. TRUE or FALSE, which are recorded by the computer as 1 or 0 respectively.
Of note, FALSE is always 0. TRUE can be stored as any non-zero number. When recording TRUE or FALSE statements for any variable (e.g. on antibiotics, had steroids etc.), it is best to stick to this single conventional practice.
\newpage
## Formatting
#### What is a CSV file?
A comma separated values file is as it says on the tin! The structure of columns and rows stays the same but data values on a single row are separated by commas instead of walls of a cell.
Here is what a CSV file looks like if you were to look at the data raw:
![](Images/UnformattedCSV.png)
Here is what you see when you open up your CSV file in a spreadsheet application:
![](Images/FormattedCSV.png)
Excel opens it up to view as above but also has formatting which makes it easier to read, and an interface where we can interact with the program to manipulate the data.
![](Images/ExcelCSV.png)
\newpage
#### What to do about formatting?
It is often easier for files to be converted into CSV before importing into R. This ensures that the data is being 'translated' into a universal structure that both applications can understand. Therefore, coloured cells and font formatting will not be translated when converted to CSV.
Remove all coloured cells and formatting from the data you are using for this workshop.
### Common Mistakes to Avoid and Correct
The following are a list of common mistakes to avoid. See if you can find examples in your data and correct as you go along:
- Multiple tables
- Multiple tables in one sheet will confuse the computer
- The computer will still identify each row as a different single identity across multiple tables
- Using the same field name twice - this will confuse the computer.
\newline
- Multiple tabs
- E.g. a different tab for each day of your wet lab experiment
- Recording across different tabs means you may make mistakes or have inconsistencies in your recording
- When you come to merge all the tabs together make sure to change the headings.
- Multiple tabs can be avoided by accepting that your data sheet is going to be potentially quite large.
- If you must use multiple tabs, consider before creating each one whether the information could be recorded on the single sheet.
- You do not have to merge the multiple tabs made available in the CC-HIC dirty extract. Just have a look and think how you would do so.
\newline
- Repeating headers
- Never ever do this. You will not be able to run calculations on these variables as the computer won’t know which column you are referring to.
- In order to scroll across headers and columns, you can freeze them.
- Freezing rows and columns, view > freeze top row
\newline
- Not filling in zeros
- If a value of a recording actually is ‘zero’, record that ‘0’ e.g. number of infections during admission on ITU.
- If there is no value in the cell (even if the value was counted but nothing observed) the computer will interpret that as missing
\newline
- Using bad null values
- Essentially, don’t use numerical values (‘9999’ or ‘0’)
- Blank or NA are reasonable choices
\newline
- Using formatting to convey information/make the sheet look pretty
- The formatting (e.g. coloured cells) won’t be conveyed in R, try and change this to another column that gives you this information without the need for formatting.
- Merging cells, including for headings, compromises your computer’s ability to read your data. Always avoid merging, consider instead adding pre/suffix to each heading.
\newpage
- Using punctuation or units inside cells
- Leaving units or comments may be a helpful aide memoire at the time but will confuse your data analysis process.
- Create another variable/column
\newline
- More than one piece of information in a cell
- E.g. BP 120/80, 90/60
- Separate this out into 2 columns; DBP, SBP
\newline
- Special characters in data
- E.g. tabs for format lines of data
- Importing this in R could result in a line being cut in half/lost data/extra rows
\newline
- Inclusion of metadata in data table
- It’s natural, and good practice, to store information about your data
- But, it doesn’t (and shouldn’t) be on the same sheet, as this can disrupt how the computer will interpret the data.
- Store this in a different file.
\newline
- Field name problems
- Avoid spaces and special characters
- R doesn’t like spaces
- Some programs don’t like text string names that start with a number
- Generally consider using snake_case (this_is_snake_case) or CamelCase (ThisIsCamelCase) for all headers for consistency
- Avoid uncommon abbreviations – you never know how long you may use this data for and keep referring to it or who you may need to share it with.
<p>
![Examples of names](Images/NameExamples.png)
<p>
\newpage
## Recording and Cleaning Data
### Tips for recording data
1. Keep three sheets/tabs in a file:
- Sheet 1: This is your data
- Sheet 2: Dictionary – you can define the terms for each of the variables. It is good practice to define each variable as what may seem obvious now, may not in 6,12,18 months etc..
- Sheet 3: readme – a file for string where you can record further information
- Metadata goes in a different file.
\newline
2. Put all your variables in columns - the thing you’re measuring, like ‘weight’, ‘temperature’ or ‘SBP’. Break things down into their most basic constituents, and keep units in your headers only.
\newline
3. Put each observation in its own row. Think very carefully about what constitutes your basic observation. Often it’s your patient, but it may not be as intuitive as you think.
- e.g. for labour ward projects each row could indicate a delivery or a procedure and not necessarily a patient.
\newline
4. Don’t combine multiple pieces of information in one cell.
\newline
5. Leave the raw data raw - don’t mess with it! That means no formulas anywhere in your spreadsheet!
\newline
6. Dates and Times:
- Absolutely separate out dates and times
- To make life very easy you can consider separate columns for: day, month, year.
- This is definitely more intensive at the beginning but saves you time later on in the analysis.
\newline
7. True/False, Yes/No (Boolean or Binary) statements, where there are only 2 possible ‘answers’ to the variable in question (e.g. ‘severe CVS disease’ – ‘did the patient have known severe CVS disease when admitted?’):
- Best to record as 1 and 0
- You can of course just record TRUE and FALSE, but if typing in the answers 1 and 0 are quicker.
\newline
8. Export the cleaned data to a text based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.
\newpage
### Tips for cleaning data
1. Understand what each row and column is meant to be recording:
- Is each row a patient, or a patient event?
- Does each column record a data point that applies to each patient
- If a column contains information about other columns (and not specifically about the row unit) this is called metadata (data about data).
\newline
2. Understand the types of data being recorded in your spreadsheet
- Assess each column, does it contain:
- Integers
- Decimals
- Categorical: Nominal/Ordinal
- Boolean
- Dates
- Times
- (Dates and Times)
- Strings
- Ensure for each of the data you have identified, that your spreadsheet application has formatted for that column correctly.
- E.g. numbers can be stored as string. They don’t look any different. BUT when imported into R, mathematical calculations aren’t possible as R does not see them as numbers.
- Formatting in the spreadsheet is mostly straightforward with dropdown options. Just beware when formatting dates/times to ensure no major changes.
\newline
3. Go through each variable/column:
- Review headers; rename duplicates, remove spaces
- Decide how you are going to record your missing values, ‘NA’ is a good option.
- Ensure that you aren’t recording a single thing more than one way, e.g. ‘F’ and ‘f’ for female. Pick one and ensure all females are recorded the same way.
- Remove units, you can create a different column to hold units if needed, though it is best to just convert all values to a common unit
- Put ‘0’s where they are meant to be
- Remove punctuation/formatting
\newline
4. Remove “extra headers” – combine these into variable names
\newline
5. Remove colour formatting
\newline
6. Merge tabs if possible – then all you data can exist on one data frame
\newline
7. Remove columns containing calculations.
- You can either change these to “values only” if you still really need the data
- OR remove completely keeping only raw data, knowing you are going to be creating calculations in R.
\newpage
# Workshop 3 - Importing Data into R
## Introduction
### Reproducibility
We strongly believe that the greatest strength you will gain from the work today is reproducibility. Learning coding is a pain, and for a 'one off' process there will nearly always be some tempting point-and-click alternative. However, learning to code, means being forced to write down what you did to get from your data to your result. You'll only have to reproduce something once to appreciate the value of having everything documented. Whether it is an error in your data, a new observation, or your supervisor or collaborator asking you to repeat an analysis with a twist, you will always be revisiting and revising your work. And the initial investment in writing (i.e. coding) your analysis means that these adjustments become simple.
#### What you will learn
What are files and directories?
Importing spreadsheet data into R
Initial exploration of your data
Tips on using Google sheets with R
## Files and Directories
Directories are folders, they contain other folders and files.
Their relevance will become obvious once we need to instruct R, using code, where to find our stored databases. Therefore we will touch on this again later with greater detail.
\newpage
## Importing Your Data into R
There are three main methods for getting data into R:
1. Point and click
2. Using commands
- Better for reproducibility
- Someone else would just have to replace the source code to run the code on their own computer
3. Connecting to an online database like google spreadsheet
- But first you need to save your data in a format that can be understood by R.
#### Saving your data as a CSV file
We need a 'common data language' that all spreadsheet languages can write to, and all statistical and programming languages can read. Arguably, one of the oldest and most ubiquitous is the 'comma separated values' (CSV) file. This is easily exportable from Microsoft Excel, Apple Numbers, Open Office, Google Sheets...etc.
It's a simple format. The top line are the column names, each separated by a comma. The following lines are the observations in those columns, again, separated by a comma.
Its strength is in its simplicity. It only has data, no formulas, no tricks and is very well recognised amongst software packages as it is very easily supported. R has excellent support for CSV files.
<p>
![](Images/UnformattedCSV.png)
<p>
\newpage
Spreadsheet programs allow you to save your data in a .csv format.
<p>
![](Images/saveAscsv2.png)
<p>
\newpage
### Point and Click
You can either use
1. `Import Dataset` tab in the `Environment` pane
OR
2. Navigate with the `Files` tab in the `Supporting Pane`
**The point and click method is not reproducible**
<p>
![Notice how there is no text in the source window to describe how you got to this data](Images/Point and click.png)
<p>
\newpage
### Importing Data with Commands
##### Finding and setting your working directory
To use commands for importing data you need to instruct R where to find your database file (this is where directories come in).
- R would have chosen a working directory for you. You can check what it is using the `getwd()` command.
```{r, eval=FALSE}
getwd()
```
You can set your working directory to be the location of your file.
```{r, eval=FALSE}
setwd()
```
<p>
![](Images/Setwd.png)
<p>
\newpage
Remember
- You need quotation marks (`""`) around your file path name
- Start with a root directory and then specify branches
- If you have a file in the directory you wish to use, you can find out the file path name by right clicking on the file:
- `Properties` in Windows
- `Get Info` in Mac
<p>
![](Images/FileProperties.png)
<p>
Note - Folders within directories are specified with:
- `/` in Mac
- `\\` in Windows
However, the correct notation in R for both is `/`
\newpage
##### The `readr` package
- `readr` is a package containing functions to help R read your external files
- It is automatically installed with `tidyverse`
- Install and load the package as follows
```{r, eval=FALSE}
install.packages("tidyverse")
library(tidyverse)
```
##### You are now ready to import the file.
You can then tell R to read a CSV file by specifying the full file path name. You can see an example of this below, but the path will be unique to your computer.
```{r, eval=FALSE}
cchic <- read_csv("clean_CCHIC.csv")
```
Now look for the `tibble` in your console.
You have now imported the files into your R environment. How can you look at the contents of the dataset within R?
```{r, eval=FALSE}
# Looking at the file in R
View(cchic)
```
This gives you a neater view, which is more human legible than the output printed in the console.
<p>
![](Images/view.png)
<p>
##### Shortcuts
- There are notations to shorten the code for the file path.
- Use a single dot for a file within the current working directory.
```{r, eval=FALSE}
read_csv("./clean_CCHIC.csv")
```
- Use a double dot `..` for a file within the directory above the currnet working directory.
\newpage
## Reviewing Your Data in R
Spreadsheets in R are called data frames.
You can get to know your data frame with the following functions:
- `head(cchic)`
- `tail(cchic)`
- `names(cchic)`
- `str(cchic)`
```{r}
head(cchic)
```
- Returns the first 6 rows of the dataset.
- Lists other column headings at the bottom
```{r}
tail(cchic)
```
- Like `head()` but gives you the last 6 rows as a tibble
- Useful to look for missing data at the end of the dataset
\newpage
```{r}
names(cchic)
```
- Lists the names of all the variables
- Useful if you want to check for the spelling of any variables
```{r, eval = FALSE}
str(cchic)
```
- Reports the format of each variable in R.
- You can therefore check vectors before running functions. This would prevent you, for example, from trying to run a mean on a character vector
\newpage
#### Co-ordinates
You can also use code in the form of coordinates to help you look at specific items within your `data frame`.
- For example:
```{r}
cchic[25, 5]
```
This is the value of the 5th variable on row 25.
#### $ to review column
You can extract a specific variable using `$`.
- Enter the `data_frame_name$variable_name`
- For example
```{r, eval = FALSE}
cchic$weight
```
This lists all of the data in the weight variable.
### Description Functions
Some packages contain functions that help describe the whole data frame. These normally describe the type of each variable, and may include summary statistics such as the mean, or the range of data values.
Examples include:
- `Desc()` from `DescTools` package.
- This function also outputs plots for each variable.
- `describe()` from `Hmisc` package.
To use these you have to install the appropriate packages
\newpage
## Exercises
1. What are the variable formats in `cchic`?
2. Display the `discharge` vector in `cchic`.
3. How many men and women are in the database?
4. How many survived and how many died?
5. What does the function `ls()` do?
- Use it to see how many vectors contain information about "temp"
\newpage
### Solutions
1. What types of variables do you have in the `cchic` data frame?
```{r, eval = FALSE}
str(cchic)
```
2. Display the `discharge` vector from `cchic`
```{r, eval = FALSE}
cchic$discharge
```
3. How many men and women are in the database?
```{r}
table(cchic$sex)
```
4. How many survived and how many died?
```{r}
table(cchic$vital_status)
```
5. What does the function `ls()` do?
```{r}
ls(cchic, pattern = "temp")
```
It lists all variable names with the word `temp` in them.
\newpage
## Addendum: Importing from Google Sheets
Googlesheets are spreadsheets which allow for collaboration and live-updating.
It is possible to load and use the data in a googlesheet in R. For this you need the `googlesheets` package.
```{r, eval=FALSE}
install.packages("googlesheets")
library(googlesheets)
```
Then you can have a look at the files in your google sheets account with the function `gs_ls()`.
You will see a tibble of the list of files, here is an example of what that can look like:
<p>
![](Images/Googlesheetstibble.png)
<p>
- The first time you run this you will be asked to authenticate.
- A new browser will open and you will be asked to sing-in to your google account
- A code will then be displayed, which you will have to input into your console.
- Behind the scenes, R now saves a hidden file into your working directory.
- The next time you run your script, as long as you haven't moved your code to a new directory, it won't need to ask for authentication again.
Then, as before, use `read` commands in R to import your data as a tibble.
```{r, eval=FALSE}
file_name <- gs_title("googlesheet_title")
data_frame_name <- gs_read("file_name")
```
\newpage
# Workshop 4 - Data Wrangling.
## Introduction
### What is Data Wrangling?
The term data wrangling describes the process of transforming your raw data into other formats, which can then be used for analysis. Analysis can include summary counts, plots and statistical tests.
Data Wrangling includes:
- Cleaning Data
- Unifying complex and messy data
- Converting one form of raw data into other variables
Data wrangling is a step in the data pipeline. The term data pipeline refers to the processes your raw data is subjected to in order for you to reach your end output with your data (this could be a presentation, a journal article, a report etc).
<p>
![The data pipeline](Images/PipelineWrangling.png)
<p>
### Data wrangling with dplyr
You do not need the dplyr package to wrangle your data but it uses intuitive grammar that simplifies data wrangling/manipulation.
```{r, eval = FALSE}
# You only need to do this once
install.packages("dplyr")
```
```{r}
# You need to do this each time you open R
library(dplyr)
```
The main concept in dplyr, is of piping your data with the code `%>%`. ‘Piping’ here refers to taking the object to left of the code and applying more commands to it. This is not the same as the term data pipeline.
With the pipe function, `%>%`, you can essentially do the following:
`data_frame_name %>% [select certain patients/rows] %>% [function]`
The keyboard shortcuts for the pipe are:
- `Cmd + Shift + M` (Mac)
- `Ctrl + Shift + M` (Windows)
## Common dplyr functions
- `filter()`
- `select()`
- `group_by()`
- `summarise()`
\newpage
### Filter
This functions allows you to select rows based on the criteria you specify.
<p>
![Sourced from 'datanovia'](Images/Filter.png)
<p>
```{r}
cchic %>%
filter(age_years >= 65)
```
You can then create a whole other data frame to work with a subset of data:
```{r}
elder_cchic <- cchic %>%
filter(age_years >= 65)
```
\newpage
### Select
This allows you to select only certain columns, that you can also use to create a separate data frame if you wanted.
<p>
![Image from 'datanovia'](Images/Select.png)
<p>
```{r}
cchic %>%
select(sex)
```
This gives you just the variable named `sex` for all patients.
### Filter and select together
You can also use `filter()` and `select()` together but be careful about the order as you will not be able to `filter()` according to a variable if it has been inadvertently removed by a `select()` command.
```{r}
cchic %>%
filter(age_years >= 65) %>%
select(sex)
```
\newpage
### The `group by` and `summarise` functions.
`group_by()` breaks down a dataset into specified groups of rows. When you then apply operations on the resulting object they’ll be automatically applied “by group”. Most importantly, all this is achieved by using the same exact syntax you’d use with an ungrouped object.
You can use the `summarise()` function to apply a summary statistic to a vector.
```{r}
cchic %>%
group_by(sex) %>%
summarise(mean_urea = mean(urea, na.rm = TRUE))
# Some summary statistic functions do not work with missing values, so you need to add the
# `na.rm` argument to remove these.
```
### Using the pipe with other functions
Piped data can be piped on to be used in almost any function in R, for example:
```{r}
cchic %>%
filter(is.na(age_years) == FALSE, age_years >= 65) %>%
select(sex) %>%
head()
```