title | author | date | output | ||||
---|---|---|---|---|---|---|---|
Codebook |
Juan C. López-Tavera |
2/26/2017 |
|
This is the code book for the UCI HAR Dataset. It's intended to be a descriptive guide for future readers on how the data was obtained and processed step by step up to the the final tidy data set, and to provide information about the structure of the data set, its variables and units.
The following information was taken entirely from the UCI HAR dataset page:
The Human Activity Recognition Using Smartphones Data Set is a database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.
Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained data set has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.
For each record in the data set it is provided:
- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
- Triaxial Angular velocity from the gyroscope.
- A 561-feature vector with time and frequency domain variables.
- Its activity label.
- An identifier of the subject who carried out the experiment.
Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.
The original raw data files were downloaded from this link, provided in the "Peer-graded Assignment: Getting and Cleaning Data Course Project" Coursera page, although they can (should) be downloaded from the original source
The raw data taken to process was structured as follows:
features.txt
: List of all features, which was used as column headers to name each variable (the names where later modified)activity_labels.txt
: Links the class labels with their activity name. This file was used to label each activity using thefactor()
R function.train/X_train.txt
: Training set, contains all the variables or features.train/y_train.txt
: Training labels, each row identifies an activity performed, its range is from 1 to 6.train/subject_train.txt
: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.test/X_test.txt
: Test set, contains all the variables or features.test/y_test.txt
: Test labels, each row identifies an activity performed, its range is from 1 to 6.test/subject_test.txt
: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.
The tidy data in this repository follows Hadley Wickham's definition of tidy data:
"Each variable is a column, each observation is a row, and each type of observational unit is a table."
The data set produced is a 813621 row by 9 column data table stored in a comma-separated values file:
jclopeztavera/human-activity/data/tidy/tidy_data.csv
The variables in the data set are: subject, activity, signal, acceleration, instrument, domain, parameter, axis, value. In more detail,
Type | Value | Number of distinct values | |
---|---|---|---|
subject | integer | Subject ID number | 30 |
activity | character | Activities of daily living | 6 |
signal | character | Jerk, magnitude or jerk magnitude | 4 |
acceleration | character | Gravity or body | 2 |
instrument | character | Gyroscope or accelerometer | 2 |
domain | character | Time or frequency | 2 |
parameter | character | Mean or standard deviation | 3 |
axis | character | x, y, z | 4 |
value | numeric | i in [-1, 1] | 783226 |
The tidy data contains only the mean and standard deviation values for each feature, as required by the project criteria. All features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.
- The raw data was manually downloaded from this link, and placed in the data folder (see the Get data commit).
- Two separate data sets were constructed using the original data files: test and train. This task was performed using the
data-gette.R
script. - The train and test sets were merged to create one data set, using the
bind_rows
. This task was performed using thedata-cleane.R
script (line 9) - The columns were renamed using the feature list provided in
features.txt
. This task was performed using thedata-cleane.R
script (lines 13-20) - Extracted only the measurements on the mean and standard deviation for each measurement. This task was performed using the
data-cleane.R
script (lines 24-29) - Used descriptive activity names to name the activities in the data set, using the activity descriptors provided in
activity_labels.txt
. This task was performed using thedata-cleane.R
script (lines 32-37) - Appropriately labeled the data set with descriptive variable names, following the R Style convention on variable identifiers. This task was performed using the
data-cleane.R
script (lines 40-54) - Created two descriptive tables that summarised the average values grouped by activity and by subject. This task was performed using the
summarise.R
script.
To run all the analysis at once, including the generation of this code book, source the run_analysis.R
script.
- Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012
- Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http:https://dx.doi.org/10.18637/jss.v059.i10; URL