Skip to content

Latest commit

 

History

History
94 lines (69 loc) · 9.89 KB

CodeBook.md

File metadata and controls

94 lines (69 loc) · 9.89 KB
title author date output
Codebook
Juan C. López-Tavera
2/26/2017
html_document
keep_md
true

Code book

This is the code book for the UCI HAR Dataset. It's intended to be a descriptive guide for future readers on how the data was obtained and processed step by step up to the the final tidy data set, and to provide information about the structure of the data set, its variables and units.

About the data

The following information was taken entirely from the UCI HAR dataset page:

The Human Activity Recognition Using Smartphones Data Set is a database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.

Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained data set has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

For each record in the data set it is provided:

  • Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
  • Triaxial Angular velocity from the gyroscope.
  • A 561-feature vector with time and frequency domain variables.
  • Its activity label.
  • An identifier of the subject who carried out the experiment.

Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.

Raw data

The original raw data files were downloaded from this link, provided in the "Peer-graded Assignment: Getting and Cleaning Data Course Project" Coursera page, although they can (should) be downloaded from the original source

The raw data taken to process was structured as follows:

  • features.txt: List of all features, which was used as column headers to name each variable (the names where later modified)
  • activity_labels.txt: Links the class labels with their activity name. This file was used to label each activity using the factor() R function.
  • train/X_train.txt: Training set, contains all the variables or features.
  • train/y_train.txt: Training labels, each row identifies an activity performed, its range is from 1 to 6.
  • train/subject_train.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.
  • test/X_test.txt: Test set, contains all the variables or features.
  • test/y_test.txt: Test labels, each row identifies an activity performed, its range is from 1 to 6.
  • test/subject_test.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.

Tidy data

The tidy data in this repository follows Hadley Wickham's definition of tidy data:

"Each variable is a column, each observation is a row, and each type of observational unit is a table."

The data set produced is a 813621 row by 9 column data table stored in a comma-separated values file:

jclopeztavera/human-activity/data/tidy/tidy_data.csv

The variables in the data set are: subject, activity, signal, acceleration, instrument, domain, parameter, axis, value. In more detail,

Type Value Number of distinct values
subject integer Subject ID number 30
activity character Activities of daily living 6
signal character Jerk, magnitude or jerk magnitude 4
acceleration character Gravity or body 2
instrument character Gyroscope or accelerometer 2
domain character Time or frequency 2
parameter character Mean or standard deviation 3
axis character x, y, z 4
value numeric i in [-1, 1] 783226

The tidy data contains only the mean and standard deviation values for each feature, as required by the project criteria. All features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.

Data processing: from raw to tidy

  1. The raw data was manually downloaded from this link, and placed in the data folder (see the Get data commit).
  2. Two separate data sets were constructed using the original data files: test and train. This task was performed using the data-gette.R script.
  3. The train and test sets were merged to create one data set, using the bind_rows. This task was performed using the data-cleane.R script (line 9)
  4. The columns were renamed using the feature list provided in features.txt. This task was performed using the data-cleane.R script (lines 13-20)
  5. Extracted only the measurements on the mean and standard deviation for each measurement. This task was performed using the data-cleane.R script (lines 24-29)
  6. Used descriptive activity names to name the activities in the data set, using the activity descriptors provided in activity_labels.txt. This task was performed using the data-cleane.R script (lines 32-37)
  7. Appropriately labeled the data set with descriptive variable names, following the R Style convention on variable identifiers. This task was performed using the data-cleane.R script (lines 40-54)
  8. Created two descriptive tables that summarised the average values grouped by activity and by subject. This task was performed using the summarise.R script.

To run all the analysis at once, including the generation of this code book, source the run_analysis.R script.

References

  • Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012
  • Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http:https://dx.doi.org/10.18637/jss.v059.i10; URL