Course Project for Getting and Cleaning Data
##Contents:
- run_analysis.R -- The script for creating tidy data set
- CodeBook.md -- Describes the variables, data, and transformations
- avedata.csv -- Tidy dataset resulting from R script
- README.md -- This file
##About the Project Data: This script operates on the "Human Activity Recognition Using Smartphones Dataset", Version 1.0, downloaded from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip on 10-August-2014.
By default, the dataset will extract into a folder titled "UCI HAR Dataset". The structure of the folder is as follows:
UCI HAR Dataset/ | ||
README.txt | ||
features_info.txt | ||
features.txt | ||
activity_labels.txt | ||
features_info.txt | ||
train/ | ||
X_train.txt | ||
y_train.txt | ||
subject_train.txt | ||
Inertial Signals/ | ||
test/ | ||
X_test.txt | ||
y_test.txt | ||
subject_test.txt | ||
Inertial Signals/ |
The folders named "Inertial Signals/" are not used in this project. Within the "test/" and "train/" folder, the files "subject_*.txt" contain a list of test subject ID's. The files "y_*.txt" are the ID's of the particular activity measured, e.g. standing, walking, sitting, etc. The files "X_*.txt" contain vectors of the data collected, one vector per subject per activity.
##About run_analysis.R: The script run_analysis.R should be placed in the folder "UCI HAR Dataset". When this script is executed, it will output a file called "avedata.csv" in the same location which contains the tidy dataset.
The steps taken by the script are as follows:
-
Read the "features.txt" file that describes the measurement vector columns. Then separate out the columns that contain the mean and standard deviation of each measurement.
-
Read in the 3 files in "test/" and "train/" folders, which contain the subject, activity, and measurements. Reformat the column names for subject and activity so that the script is easier to read.
-
Merge the train and test data together.
-
Extract the measurements for mean and standard deviation using the column names derived in Step (1).
-
Read the "activity_labels.txt" and substitute the activity ID in each row with the label.
-
Bind the data columns together, beginning with Subject, Activity, and then the mean and std measurements.
-
Make a new dataset that calculates the average of each variable, for each activity and each subject.
-
Simplify the column names in the new dataset. As a result of the aggregation, some columns are called "Group.1" and "Group.2". These are renamed to "Activity" and "Subject". Also, the column names for the measurement variables are reformatted to be easier to read, e.g. ". Ave mean()" or ". Ave std()" where is one of "X", "Y", or "Z" and is a shortened form of the values found in "features.txt".
-
Finally, write out the new dataset as a csv file.