Skip to content

linwangmeyer/scz_LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data

We transcribed and analyzed speech data from both individuals with psychosis and healthy controls when they described three images for one minute each.

Aims of the project

This project aims to extract language features from speech samples. The extract features are used to predict (1) continuous Thought Language Index (TLI) impoverishment and disorganization scores; (2) Categorical Participant Group (Healthy controls vs. patients with first episodic psychosis)

Extract language features

Sentence-level coherence

'senN_4', 'senN_3', 'senN_2', 'senN_1'

Disfluency

 
 'N_fillers', 'N_immediate_repetation', 'false_starts', 'self_corrections'

Word-level association

'n_1', 'n_2', 'n_3', 'n_4', 'n_5' (#similarity between every word and its preceeding N words)

Lexical-level

'type_token_ratio','average_word_frequency'

Discourse topic-level

'entropyApproximate' (#the diversity of the topic distribution)
's0_mean' (#similarity between everything sentence and the picture label),
'consec_mean' (#similarity between the current sentence and its previous sentence)

For more details on BERTopic, see my post.

Syntactic complexity

'clause_density', 'dependency_distance', 'content_function_ratio'

Other relevant variables

'n_segment', 'length_utter','num_all_words', 'num_content_words', 'num_repetition'

Exploratory data analysis (EDA)

Check for missing values —> ignore all cognitive function measures

01_EDA_MissingValues

Feature selection

Visualize the distribution of all variables and identify outliers

  • Remove data points with average_word_frequency < 4.5, N_fillers > 20, content_function_ratio > 2.0
  • Remove variables with skewed distributions: N_immediate_repetition
02_EDA_DistributionOutlier

Check pairwise correlation

Check pairwise correlation matrix to remove or combine variables that are highly correlated

  • n_1, n_2, n_3, n_4 and n_5: calculate the means to represent local word associations
  • sen_1, sen_2, sen_3, sen_4: calculate the means to represent local semantic coherence
  • Use VIF to identify highly correlated variables (VIF > 10): 'num_all_words', 'num_content_words', 'length_utter’

Before feature selection:

03_EDA_PairwiseRawVars

After feature selection:

04_EDA_PairwiseNewVars

Visualize data patterns

Continuous variables

Visualize how the continuous dependent variables correlate to the language features. 05_EDA_pairplot_continuousVars

Categorical variables

Visualize how the Categorical dependent variables correlate to the language features. 06_EDA_byPateintCategory

Model continuous measures (TIL_IMPOV and TIL_DISORG)

Lasso Regression for Feature Selection

  • Use cross-validation to identify the best hyperparameters for the lasso regression:
Best alpha for IMPOV: 0.02848035868435799
Best Mean Absolute Error (IMPOV): 0.15227086130242762
Best alpha for DISORG: 0.05462277217684337
Best Mean Absolute Error (DISORG): 0.26148430772065
  • Use the identified hyperparameter to test the model
Mean Absolute Error (IMPOV): 0.33674784603513247
Mean Absolute Error (DISORG): 0.5009238250371127
R2 (IMPOV): 0.15227086130242762
R2 (DISORG): 0.26148430772065

Most Predictive Variables

For TLI_IMPOV:

'type_token_ratio', 'num_repetition', 'entropyApproximate', 'average_word_frequency', 'Age', 'Gender_M', 'self_corrections'

For TLI_DISORG:

's0_mean', 'num_repetition', 'false_starts', 'type_token_ratio', 'clause_density', 'N_fillers', 'consec_mean', 'Age',
 'self_corrections', 'Gender_M', 'dependency_distance'

Model categorical data (HC vs. FEP)

Deal with unbalanced data (36 HC vs. 64 FEP)

  • Use SMOTE to upsample data with less samples

Try out different models

Random forest

Model performance: 85% accuracy

ML_05_Accuracy_RandomForest

Ranking the predictors based on their importance

ML_03_RandomForest_PatientCat_beta

L1 regularized logistic regression

Model performance: 65% accuracy

ML_06_FeatureImportance_LogisticRegression

Ranking the predictors based on their importance

ML_04_Lasso_PredictPANSS_Pos_beta

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published