Data

We transcribed and analyzed speech data from both individuals with psychosis and healthy controls when they described three images for one minute each.

Aims of the project

This project aims to extract language features from speech samples. The extract features are used to predict (1) continuous Thought Language Index (TLI) impoverishment and disorganization scores; (2) Categorical Participant Group (Healthy controls vs. patients with first episodic psychosis)

Extract language features

Sentence-level coherence

'senN_4', 'senN_3', 'senN_2', 'senN_1'

Disfluency

 
 'N_fillers', 'N_immediate_repetation', 'false_starts', 'self_corrections'

Word-level association

'n_1', 'n_2', 'n_3', 'n_4', 'n_5' (#similarity between every word and its preceeding N words)

Lexical-level

'type_token_ratio','average_word_frequency'

Discourse topic-level

'entropyApproximate' (#the diversity of the topic distribution)
's0_mean' (#similarity between everything sentence and the picture label),
'consec_mean' (#similarity between the current sentence and its previous sentence)

For more details on BERTopic, see my post.

Syntactic complexity

'clause_density', 'dependency_distance', 'content_function_ratio'

Other relevant variables

'n_segment', 'length_utter','num_all_words', 'num_content_words', 'num_repetition'

Exploratory data analysis (EDA)

Check for missing values —> ignore all cognitive function measures

Feature selection

Visualize the distribution of all variables and identify outliers

Remove data points with average_word_frequency < 4.5, N_fillers > 20, content_function_ratio > 2.0
Remove variables with skewed distributions: N_immediate_repetition

Check pairwise correlation

Check pairwise correlation matrix to remove or combine variables that are highly correlated

n_1, n_2, n_3, n_4 and n_5: calculate the means to represent local word associations
sen_1, sen_2, sen_3, sen_4: calculate the means to represent local semantic coherence
Use VIF to identify highly correlated variables (VIF > 10): 'num_all_words', 'num_content_words', 'length_utter’

Before feature selection:

After feature selection:

Visualize data patterns

Continuous variables

Visualize how the continuous dependent variables correlate to the language features.

Categorical variables

Visualize how the Categorical dependent variables correlate to the language features.

Model continuous measures (TIL_IMPOV and TIL_DISORG)

Lasso Regression for Feature Selection

Use cross-validation to identify the best hyperparameters for the lasso regression:

Best alpha for IMPOV: 0.02848035868435799
Best Mean Absolute Error (IMPOV): 0.15227086130242762
Best alpha for DISORG: 0.05462277217684337
Best Mean Absolute Error (DISORG): 0.26148430772065

Use the identified hyperparameter to test the model

Mean Absolute Error (IMPOV): 0.33674784603513247
Mean Absolute Error (DISORG): 0.5009238250371127
R2 (IMPOV): 0.15227086130242762
R2 (DISORG): 0.26148430772065

Most Predictive Variables

For TLI_IMPOV:

'type_token_ratio', 'num_repetition', 'entropyApproximate', 'average_word_frequency', 'Age', 'Gender_M', 'self_corrections'

For TLI_DISORG:

's0_mean', 'num_repetition', 'false_starts', 'type_token_ratio', 'clause_density', 'N_fillers', 'consec_mean', 'Age',
 'self_corrections', 'Gender_M', 'dependency_distance'

Model categorical data (HC vs. FEP)

Deal with unbalanced data (36 HC vs. 64 FEP)

Use SMOTE to upsample data with less samples

Try out different models

Random forest

Model performance: 85% accuracy

Ranking the predictors based on their importance

L1 regularized logistic regression

Model performance: 65% accuracy

ML_06_FeatureImportance_LogisticRegression

Ranking the predictors based on their importance

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.vscode		.vscode
__pycache__		__pycache__
old		old
.DS_Store		.DS_Store
00_get_subject_info.py		00_get_subject_info.py
01_get_BERTopic.py		01_get_BERTopic.py
02_get_word2vec.py		02_get_word2vec.py
03_sentence_similarity.py		03_sentence_similarity.py
04_get_syntax.py		04_get_syntax.py
05_combine_features.py		05_combine_features.py
06a_ML_Predict_ContinuousVars.py		06a_ML_Predict_ContinuousVars.py
06b_ML_Predict_CategoricalVars.py		06b_ML_Predict_CategoricalVars.py
99_lmer_allFeatures.R		99_lmer_allFeatures.R
99a_lmer_entropy.R		99a_lmer_entropy.R
99b_lmer_w2v_new.R		99b_lmer_w2v_new.R
99c_lmer_SenSimilarity.R		99c_lmer_SenSimilarity.R
99d_lmer_syntax.R		99d_lmer_syntax.R
README.md		README.md
bert_utils.py		bert_utils.py
environment.yml		environment.yml
install.sh		install.sh
propensity_match.png		propensity_match.png
word2vec_results.R		word2vec_results.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data

Aims of the project

Extract language features

Sentence-level coherence

Disfluency

Word-level association

Lexical-level

Discourse topic-level

Syntactic complexity

Other relevant variables

Exploratory data analysis (EDA)

Check for missing values —> ignore all cognitive function measures

Feature selection

Visualize the distribution of all variables and identify outliers

Check pairwise correlation

Before feature selection:

After feature selection:

Visualize data patterns

Continuous variables

Categorical variables

Model continuous measures (TIL_IMPOV and TIL_DISORG)

Lasso Regression for Feature Selection

Most Predictive Variables

Model categorical data (HC vs. FEP)

Deal with unbalanced data (36 HC vs. 64 FEP)

Try out different models

Random forest

Model performance: 85% accuracy

Ranking the predictors based on their importance

L1 regularized logistic regression

Model performance: 65% accuracy

Ranking the predictors based on their importance

About

Releases

Packages

Languages

linwangmeyer/scz_LLMs

Folders and files

Latest commit

History

Repository files navigation

Data

Aims of the project

Extract language features

Sentence-level coherence

Disfluency

Word-level association

Lexical-level

Discourse topic-level

Syntactic complexity

Other relevant variables

Exploratory data analysis (EDA)

Check for missing values —> ignore all cognitive function measures

Feature selection

Visualize the distribution of all variables and identify outliers

Check pairwise correlation

Before feature selection:

After feature selection:

Visualize data patterns

Continuous variables

Categorical variables

Model continuous measures (TIL_IMPOV and TIL_DISORG)

Lasso Regression for Feature Selection

Most Predictive Variables

Model categorical data (HC vs. FEP)

Deal with unbalanced data (36 HC vs. 64 FEP)

Try out different models

Random forest

Model performance: 85% accuracy

Ranking the predictors based on their importance

L1 regularized logistic regression

Model performance: 65% accuracy

Ranking the predictors based on their importance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages