Skip to content

akashraj11/Lab-10

Repository files navigation

Lab10

Lab 10

Overview of your assignment

Explore advanced Python tools for machine learning. Utilize publicly available datasets from the provided GitHub repositories to perform data exploration, preprocessing, implement machine learning models, and visualize the results using Python programming only. Dataset Repositories: • https://github.com/awesomedata/awesome-public-datasetshttps://github.com/openai/gpt-2-output-dataset

Dataset selection rationale

Dataset 1:

• Titanic dataset • The Titanic dataset is a well-known and commonly used dataset in machine learning and data analysis. It offers valuable information about the Titanic's passengers, such as their survival status, socioeconomic class, age, gender, family relationships, etc. The value of this dataset for machine learning tasks resides in its predictive modeling and classification capabilities. The main machine learning job that can be performed on the Titanic dataset is binary classification, which predicts whether a passenger survived the Titanic tragedy (1) or did not survive (0) based on the supplied attributes. This form of study can help determine the elements that influenced it.

Dataset 2:

• gpt-2-output-dataset • The reason for choosing this dataset is that it has enough data (approximately 250k) which is enough to train and test a suitable model. The main machine learning job that can be performed on the gpt dataset The GPT-2 output dataset could be relevant to various natural language processing (NLP) tasks, such as text generation, language translation, text summarization, sentiment analysis, and more. Since the dataset likely consists of diverse and varied texts, it could be valuable for training and evaluating machine learning models that deal with various language-related tasks. Potential for Analysis: Analyzing the GPT-2 output dataset could provide insights into the performance and limitations of the GPT-2 model itself. Researchers might examine the generated text’s quality, coherence, grammaticality, and ability to maintain context. This analysis could be beneficial for further refining and improving language models.

Instructions for running the code

  1. Download dataset from the repositories and place under /datasets path
  2. Download gpt small dataset from https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/small-117M.train.jsonl

Running cells

Once you have a notebook, you can run a code cell using the Run icon to the left of the cell and the output will appear directly below the code cell.

To run code, you can also use keyboard shortcuts in both command and edit mode. To run the current cell, use Ctrl+Enter. To run the current cell and advance to the next, use Shift+Enter.

Run Jupyter code cell

Run multiple cells

select multiple cell and do Run All, Run All Above, or Run All Below.

Run Jupyter code cells

Save Jupyter Notebook

Save Jupyter Notebook using the keyboard shortcut Ctrl+S or File > Save.

Export Jupyter Notebook

Export a Jupyter Notebook as a Python file (.py), a PDF, or an HTML file. To export, select the Export action on the main toolbar. You'll then be presented with a dropdown of file format options.

Convert Jupyter Notebook to Python file

Note: For PDF export, you must have TeX installed. If you don't, you will be notified that you need to install it when you select the PDF option. Also, be aware that if you have SVG-only output in your Notebook, they will not be displayed in the PDF. To have SVG graphics in a PDF, either ensure that your output includes a non-SVG image format or else you can first export to HTML and then save as PDF using your browser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published