*This section documents all the scripts in the chapter_4_modeling folder.*

## Machine learning definitions 
| Term | Definition | 
| ------- | ------- |
| [features](https://en.wikipedia.org/wiki/Feature_(machine_learning)) | descriptive numerical representations to describe an object. | 
| [machine learning](https://en.wikipedia.org/wiki/Machine_learning) |  the process of teaching a machine something that is useful. | 
| [classification model](https://en.wikipedia.org/wiki/Statistical_classification) | If the goal is to separate out into classes (e.g. male or female), then this is known as a classification problem. | 
| [regression model](https://en.wikipedia.org/wiki/Regression_analysis) | if the end goal is to measure some correlation with a variable and the output is more a numerical range (e.g. often between 0 and 1), then this is more of a regression problem. | 
| [deep learning models](https://en.wikipedia.org/wiki/Deep_learning) | models that are trained using a neural network. |
| [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) | if machines do not require labels (e.g. just need features), this is known as a unsupervised learning problem. | 
| [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) | if machines require labels (e.g. male or female as separate feature arrays), this is known as a supervised learning problem. | 
| [training set](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets#training_set) | Machines are fed training data in the form of feature arrays and compress patterns in these feature arrays into models through algorithms.| 
| [testing set](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets) | data that is left out during training so that the accuracy can be calculated using cross-validation techniques. | 
| [validation set](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets#Validation_set) | data that is left out during training to tune hyperparameters (often used in deep learning modeling. | 
| [label](https://stackoverflow.com/questions/40898019/what-is-the-difference-between-a-feature-and-a-label/40899529) | a tag of an featurized audio sample (e.g. male or female) to aid in supervised learning.| 
| [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) | how the performance of ML models are assessed (in terms of accuracy).| 

## Obtaining training data 
make_playlist.py (from CLI)
```
cd ~
cd voicebook/chapter_4_modeling/youtube_scrape
python3 make_playlist.py 
what is the name of this playlist?
what is the playlist id or URL?
… [‘n’ to stop making playlist]
```
download_playlist.py (from CLI)
```
python3 download_playlist.py 
what is the name of the playlist to download?
… downloads playlist to /playlist folder 
```
## Labeling training data 
label_samples.py (from CLI)
### labeling YouTube data in spreadsheets 
label_samples.py (from CLI)
```
python3 label_samples.py
what is the master label (e.g. stressed)? 
stressed
sample number: 0
what is the URL of the video? 
https://www.youtube.com/watch?v=47HLiAxHgdo
how long is the audio sample in seconds? (e.g. 20) 
20
what are the stop and start times of the video (e.g. 0:13-0:33)
0:05-0:25            
is this person stressed? 1 for yes, 0 for no
1
is this person a child (c, <13) or adolescent (d, 13-18) or adult  (a, >18 <70) or elderly (e, >70)? 
a
is this person male (m) or female (f)? 
m
does this person have an American (a) or foreign (f) accent? 
a
what is the audio quality? (1 - poor, 2 - moderate, 3 - good quality, 4 - high quality)3
is the environment indoors (i) or outdoors (o)?i
sample number: 1
what is the URL of the video?
...After entering [‘’] here, it ends the script and outputs excel sheet below.
```
### downloading YouTube data from spreadsheets
y_scrape.py (from CLI)
```
Run script in terminal...
python3 y_scrape.py
Get file name to parse 
what is the file name? 
Stressed_1.xlsx
```
All the files are then downloaded (Pafy module) and converted to .wav format with FFmpeg ...

## Classification models 
### building optimized classification models
train_audioclassify.py (from CLI)
```
cd ~
cd voicebook/chapter_4_modeling
python3 train_audioclassify.py 
# insert number of classes and class names 
how many classes are you training?2
what is the folder name for class 1?schizophrenia
what is the folder name for class 2?controls
# now all the classes will featurize 
SCHIZOPHRENIA - featurizing snipped38_start_2_end_22.wav
making 0.wav
[-4.51487917e+02  1.32250653e+02 -6.48964827e+02 -2.16927909e+02...
  9.57062705e-04  4.54699943e-02 -5.85259705e-02  5.74577384e-02]
...
Decision tree accuracy (+/-) 0.20779263167344933
0.5733333333333334
Gaussian NB accuracy (+/-) 0.1305543735171076
0.7866666666666667
SKlearn classifier accuracy (+/-) 0.039999999999999994
0.48
Adaboost classifier accuracy (+/-) 0.22666666666666668
0.6366666666666667
Gradient boosting accuracy (+/-) 0.1319090595827292
0.6599999999999999
Logistic regression accuracy (+/-) 0.07557189365836424
0.7366666666666667
Hard voting accuracy (+/-) 0.2341889076033373
0.6766666666666666
K Nearest Neighbors accuracy (+/-) 0.12666666666666668
0.5633333333333332
Random forest accuracy (+/-) 0.2758824226207808
0.7333333333333333
svm accuracy (+/-) 0.13556466271775172
0.7533333333333333

most accurate classifier is Gaussian NB with audio features (mfcc coefficients).
saving classifier to disk. 

Summarizing session…
GaussianNB(priors=None)

['gaussian-nb', 0.7866666666666667, 0.1305543735171076]
```
### loading classification models 
load_audioclassify.py (from CLI)
```
python3 load_audioclassify.py
```
This results in an output:
```
{"filename": "348.wav", "filetype": "audio file", "class": ["controls"], "model": ["schizophrenia_controls_sc_audio.pickle"], "model accuracies": [0.7866666666666667], "model deviations": [0.1305543735171076], "model types": ["gaussian-nb"], "features": [[-322.9664360980726, 59.53868288968913, -462.5294083924505, -166.3993076206564, 131.38738649438437, 52.44671783868567, -33.74398658437562, 227.8102207133376, 9.52738149362727, 28.505927165579884, -90.65927286414657, 71.52976680142815, 9.73530102063688, 25.62432182324615, -66.02663398503707, 73.87513246074612, -1.596002360610912, 22.81632350096357, -87.30807566263049, 41.72876898633217, 0.8865486997595385, 17.735652130525168, -65.99456073539176, 52.43567091641821, -14.286216477070838, 14.128449781073533, -59.836804831757654, 18.175026917411316, -9.131276510645463, 13.701302570519355, -57.44541029310883, 25.74622598177111, -4.545971824836885, 10.899138142787697, -42.116927063121395, 29.536967420470695, -3.4558647963609186, 10.31513522815575, -36.17230935229129, 26.551369428146693, -3.6667095757279236, 10.079488079876286, -33.78123311320836, 26.14112294381864, 5.366060779304841, 8.570956061981061, -19.248854886451802, 38.20513572569962, -5.458667628428172, 7.490745204714798, -31.338790159786562, 12.539046082339311, 0.024288590342538358, 10.584946850085212, -34.52340818393254, 38.15078289969128, -0.156898762979172, 11.158828455811786, -34.10403400345244, 30.973152153233336, 0.020648845552068328, 5.827064754672902, -22.052042500906857, 16.81872640844321, -0.06170338085832314, 5.229174923928, -14.518978383592026, 14.845857302315114, -0.04962607796690964, 4.5211806494022735, -14.998074177634704, 12.378100326632655, 0.07415513595268168, 3.724070455888158, -9.939566189661432, 10.85577098792062, -0.017072005372266726, 2.7463908847692204, -6.600475000502117, 6.524786791283427, -0.02310274039018664, 2.7092557498939636, -7.467322311111723, 7.481090337383571, 0.04464197716713606, 2.198722832501255, -6.88438775831641, 7.844106037059699, 0.045382707259550105, 2.0580935158253872, -6.638462605186588, 5.991186816663746, -0.013702557713332408, 1.9496130791163644, -6.458246324901151, 5.7716202748695, -0.007340250450717803, 1.6409103586116958, -5.380714141939734, 5.539025057788075, 0.011411587050311969, 1.3949062816882583, -4.390308824019425, 4.13132941219398, -291.2947346432915, 49.04737058565422, -381.6816501283554, -222.9638855557117, 158.3460978309033, 23.15415034729552, 99.62697329203677, 189.70121020164896, 7.287058326977949, 30.77474443760493, -38.71222832828984, 56.208286170618955, -1.0950341842073796, 21.3498811006992, -34.41685805740065, 31.926254848624147, -9.172025861857653, 11.511454213511039, -29.874153138705573, 8.203596981294625, 2.6663941698626865, 6.753684660513026, -5.9061357505887955, 19.305474480034082, -14.088225581455214, 17.47630600064678, -49.8886801840349, 8.935818425975743, -13.521963272886959, 8.25999525518404, -24.851695100203774, -0.11752456737790722, -12.762992506945213, 8.598616338770906, -29.72115313687536, 0.05275012294025435, -4.531403069755177, 11.8713757531457, -24.376936764599744, 12.207624665298002, -2.6914750628989266, 14.673164819510685, -22.308447521294887, 17.767626038347583, 11.80700932417913, 10.516802160193405, -13.092759032892214, 24.963056992755536, -10.390953114902164, 6.1887066403103965, -20.39253124562046, 2.7941268719402848, -4.41480192601625, 7.0550461587501, -15.045545852884578, 5.8468320221431656, 0.22555437964894862, 4.881477566532211, -4.990490269946867, 8.519079155558249, 3.4745028409138827, 2.7045163211953187, 0.5391937155699558, 8.988399905874912, 0.45051536549204274, 4.824805683998831, -4.424922867740668, 9.67554223394205, -0.8502687288362012, 3.1351941328536777, -4.844124443962841, 4.754766492721427, 0.870140923131266, 1.1137966493666094, -1.4131441258277446, 2.418345086057676, 2.4254793474500635, 1.2058772715931956, 0.5825294849801214, 4.536777131050609, 0.10251353649615984, 1.51146113365032, -1.4592806547585204, 3.291502702928505, -1.075428938064348, 1.0559521971759946, -2.4408814841825865, 1.12308565480587, -0.3002420005778045, 2.4751693616737347, -3.6333810904861688, 3.34737386167248, 0.17805269515377548, 3.7250267108754236, -5.189309157660288, 5.579262003298437, 0.24091712079378458, 2.451817967640338, -5.215650064568107, 2.3865116769275567, 0.003640041240486553, 1.4235044885102617, -2.379919268715038, 1.5581599658532437]], "count": 0, "errorcount": 0}
```

## Regression models
### training regression models 
train_audioregression.py (from CLI)
```
cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioregression.py
what is the name of the file in /data directory you would like to analyze? 
africanamerican_controls.json
RESULTS: 
+-------------------------------------------+-----------+----------------------+
|                model type                 | R^2 score | Mean Absolute Errors |
+-------------------------------------------+-----------+----------------------+
|             linear regression             |  -1.672   |        0.656         |
+-------------------------------------------+-----------+----------------------+
|             ridge regression              |   0.047   |        0.367         |
+-------------------------------------------+-----------+----------------------+
|                   LASSO                   |   0.426   |        0.273         |
+-------------------------------------------+-----------+----------------------+
|                elastic net                |   0.483   |        0.255         |
+-------------------------------------------+-----------+----------------------+
|       Least angle regression (LARS)       |   0.065   |        0.478         |
+-------------------------------------------+-----------+----------------------+
|                LARS lasso                 |  -0.025   |        0.502         |
+-------------------------------------------+-----------+----------------------+
|     orthogonal matching pursuit (OMP)     |  -0.032   |         0.39         |
+-------------------------------------------+-----------+----------------------+
|            logistic regression            |  -0.019   |        0.253         |
+-------------------------------------------+-----------+----------------------+
|     stochastic gradient descent (SGD)     |  -0.153   |         0.41         |
+-------------------------------------------+-----------+----------------------+
|                perceptron                 |  -7.297   |        1.131         |
+-------------------------------------------+-----------+----------------------+
|        passive-agressive algorithm        |   0.316   |        0.329         |
+-------------------------------------------+-----------+----------------------+
|                  RANSAC                   |   0.316   |        0.329         |
+-------------------------------------------+-----------+----------------------+
|                 Theil-Sen                 |  -1.672   |        0.674         |
+-------------------------------------------+-----------+----------------------+
|             huber regression              |  -0.582   |         0.49         |
+-------------------------------------------+-----------+----------------------+
|      polynomial (linear regression)       |  -0.582   |         0.49         |
+-------------------------------------------+-----------+----------------------+
logistic regression has the lowest mean absolute error (0.25252525252525254)
saving file to disk (africanamerican_controls_regression.pickle)...
```

### loading regression models 
load_audioregression.py
```
python3 load_audioregression.py 
1.0
controls
```
## Deep learning models 
keras_mlp.py
```python3 
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))

# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
```
### from CLI 
train_audiokeras.py
```
cd ~
cd voicebook/chapter_4_modeling 
python3 train_audiokeras.py
folder name 1 
africanamerican     
folder name 2 
controls
...
[[1.]]
Epoch 1/20
149/149 [==============================] - 0s 2ms/step - loss: 3.8728 - acc: 0.3423
Epoch 2/20
149/149 [==============================] - 0s 29us/step - loss: 0.3178 - acc: 0.3624
Epoch 3/20
149/149 [==============================] - 0s 26us/step - loss: -0.0068 - acc: 0.4228

...
 final acc: 50.34% 
...
 Saved africanamerican_controls_dl_audio.json model to disk
summarizing data...
testing loaded model

'Loaded model from disk'

[[1.]]
```

## AutoML approaches 
### using TPOT for classification 
train_audioTPOT.py
```
cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioTPOT.py
classification (c) or regression (r) problem?
c
what is the name of class 1? 
africanamerican
what is the name of class 2? 
controls
Generation 1 - Current best internal CV score: 0.9056433904259992               
Generation 2 - Current best internal CV score: 0.9100878348704435               
Generation 3 - Current best internal CV score: 0.9100878348704435               
Generation 4 - Current best internal CV score: 0.9100878348704435               
Generation 5 - Current best internal CV score: 0.9191787439613526               
                                                                                
Best pipeline: LogisticRegression(LogisticRegression(MinMaxScaler(StandardScaler(input_matrix)), C=1.0, dual=False, penalty=l1), C=5.0, dual=True, penalty=l2)
saving classifier to disk
```
Loading TPOT classification models: load_audioTPOT.py
```
Jims-MBP:~ jimschwoebel$ cd voicebook/chapter_4_modeling
Jims-MBP:chapter_4_modeling jimschwoebel$ python3 load_audiotpot.py
making 0.wav
making 1.wav
making 2.wav
...
making 36.wav
making 37.wav
making 38.wav
controls
```
### Using TPOT for regression
train_audioTPOT.py (from CLI)
```
cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioTPOT.py

classification (c) or regression (r) problem?
r
what is the name of class 1? 
africanamerican
what is the name of class 2? 
Controls
Generation 1 - Current best internal CV score: -0.06707070707070706             
Generation 2 - Current best internal CV score: -0.06707070707070706             
Generation 3 - Current best internal CV score: -0.06707070707070706             
Generation 4 - Current best internal CV score: -0.062207740346188735            
Generation 5 - Current best internal CV score: -0.062207740346188735            
                                                                                
Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=4, p=1, weights=distance)
saving classifier to disk
```
Loading TPOT regression models: load_audioTPOT.py
```
Jims-MBP:~ jimschwoebel$ cd voicebook/chapter_4_modeling
Jims-MBP:chapter_4_modeling jimschwoebel$ python3 load_audiotpot.py
making 0.wav
making 1.wav
making 2.wav
...
making 36.wav
making 37.wav
making 38.wav
controls
controls
```

## Resources
If you are interested to read more on any of these topics, check out the documentation below.

**Datasets**
* [Common Voice Dataset](https://www.kaggle.com/mozillaorg/common-voice)
* [Google Audioset](https://research.google.com/audioset/)
* [NeuroLex Disease Dataset](https://github.com/neurolexdiagnostics/train-diseases)

**Data labeling**
* [Pandas](http://pandas.pydata.org)
* [Xlsxwriter](https://xlsxwriter.readthedocs.io/)
* [Pytube](https://github.com/nficano/pytube)

**Featurization**
* [SpeechRecognition](https://pypi.org/project/SpeechRecognition/)
* [Librosa](https://librosa.github.io)
* [PyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis)
* [Spacy](https://spacy.io)
* [NLTK](http://www.nltk.org)
* [Gensim](https://radimrehurek.com/gensim/)

**Classification models** 
* [Numpy](http://www.numpy.org)
* [Scikit-learn](https://youtu.be/2kT6QOVSgSg)

**Regression models**
* [Statsmodels](https://www.statsmodels.org/stable/index.html)
* [Scikit-learn](https://youtu.be/2kT6QOVSgSg) 

**Deep learning** 
* [Keras](https://keras.io)
* [Tensorflow](https://www.youtube.com/watch?time_continue=1202&v=t1A3NTttvBA)
* [Deep learning book](http://neuralnetworksanddeeplearning.com/index.html)
* [Udacity class](https://www.udacity.com/course/deep-learning--ud730)

**AutoML** 
* [Autokeras](https://autokeras.com/)
* [TPOT](https://github.com/EpistasisLab/tpot)
* [Devol](https://github.com/joeddav/devol)
* [Clarifai](https://clarifai.com/)
* [H20.ai](https://www.h2o.ai/)
* [DataRobot](https://www.datarobot.com/)
* [Google Cloud ML engine](https://cloud.google.com/ml-engine/)
* [Microsoft Azure ML](https://azure.microsoft.com/en-us/services/machine-learning-studio/)