- The problem statement is to determine how variables such as gender, race/ethnicity, parental level of education, lunch, and test preparation course affect student performance (test scores).
- Relevant data was gathered from Kaggle.https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams
- A series of data checks were performed to ensure that the data was clean, complete, and in the correct format. This included checking for missing values, duplicate values, and outliers, as well as data types and the number of unique values in each column.
- The data was analyzed to understand its structure, patterns, and relationships. This involved computing summary statistics, exploring correlations between variables, identifying potential outliers or missing values, and finding numerical and categorical columns along with the number of unique values in each categorical column.
- Visualizations were created to identify trends and patterns that may be difficult to see in tabular format, helping to gain insights quickly and communicate results effectively to others.
- The data was transformed to make it suitable for use with machine learning models. This involved techniques such as scaling, normalization, feature selection, or feature engineering.
- Machine learning models were built using the pre-processed data. The data was split into training and test sets, and the training set was used to train the models.
- The performance of the models was evaluated using various metrics such as confusion_matrix, classification_report ,RandomForestClassifier and accuracy. This helped to determine which models were performing best.
- Based on the evaluation results, the best-performing model was chosen for predicting student performance.