We transcribed and analyzed speech data from both individuals with psychosis and healthy controls when they described three images for one minute each.
This project aims to extract language features from speech samples. The extract features are used to predict (1) continuous Thought Language Index (TLI) impoverishment and disorganization scores; (2) Categorical Participant Group (Healthy controls vs. patients with first episodic psychosis)
'senN_4', 'senN_3', 'senN_2', 'senN_1'
'N_fillers', 'N_immediate_repetation', 'false_starts', 'self_corrections'
'n_1', 'n_2', 'n_3', 'n_4', 'n_5' (#similarity between every word and its preceeding N words)
'type_token_ratio','average_word_frequency'
'entropyApproximate' (#the diversity of the topic distribution)
's0_mean' (#similarity between everything sentence and the picture label),
'consec_mean' (#similarity between the current sentence and its previous sentence)
For more details on BERTopic, see my post.
'clause_density', 'dependency_distance', 'content_function_ratio'
'n_segment', 'length_utter','num_all_words', 'num_content_words', 'num_repetition'
- Remove data points with
average_word_frequency < 4.5
,N_fillers > 20
,content_function_ratio > 2.0
- Remove variables with skewed distributions:
N_immediate_repetition
Check pairwise correlation matrix to remove or combine variables that are highly correlated
- n_1, n_2, n_3, n_4 and n_5: calculate the means to represent local word associations
- sen_1, sen_2, sen_3, sen_4: calculate the means to represent local semantic coherence
- Use VIF to identify highly correlated variables (VIF > 10): 'num_all_words', 'num_content_words', 'length_utter’
Visualize how the continuous dependent variables correlate to the language features.
Visualize how the Categorical dependent variables correlate to the language features.
- Use cross-validation to identify the best hyperparameters for the lasso regression:
Best alpha for IMPOV: 0.02848035868435799
Best Mean Absolute Error (IMPOV): 0.15227086130242762
Best alpha for DISORG: 0.05462277217684337
Best Mean Absolute Error (DISORG): 0.26148430772065
- Use the identified hyperparameter to test the model
Mean Absolute Error (IMPOV): 0.33674784603513247
Mean Absolute Error (DISORG): 0.5009238250371127
R2 (IMPOV): 0.15227086130242762
R2 (DISORG): 0.26148430772065
For TLI_IMPOV:
'type_token_ratio', 'num_repetition', 'entropyApproximate', 'average_word_frequency', 'Age', 'Gender_M', 'self_corrections'
For TLI_DISORG:
's0_mean', 'num_repetition', 'false_starts', 'type_token_ratio', 'clause_density', 'N_fillers', 'consec_mean', 'Age',
'self_corrections', 'Gender_M', 'dependency_distance'
- Use SMOTE to upsample data with less samples