Name		Name	Last commit message	Last commit date
parent directory ..
batches		batches
cleaned-data		cleaned-data
raw-data		raw-data
README.md		README.md
adjudicated.py		adjudicated.py
conversions.py		conversions.py
rate_checker.py		rate_checker.py
user_input.py		user_input.py

README.md

Data Collection

This subdirectory contains the logic for collecting and formatting Python code snippets from popular GitHub repositories that contain Python projects.

Scripts

`adjudicated.py`

The main script to run is adjudicated.py, which makes GitHub API requests and cleans the data so that it only includes repositories that have Python files with version 3.6.0 or greater (as determined by searching through the setup.py file for each repository). It takes in user input for the minimum number of stars and forks per repository, the maximum number of files to consider across all repositories, the max number of files to consider in each repository, and the max number of snippets to be considered from each file.

The collected snippets are categorized as either a "Class" or a "Function" and are exported to a .tsv file named 'adjudicated.tsv' with the following format:

UID: unique identifier for each snippet
Category: class or function
Snippet: the code snippet

Cleaned data will be saved to /cleaned-data with a unique filename. Repository data for each API request is saved to /raw-data.

`conversions.py`

With conversions.py, you can revert the .txt adjudicated data to .tsv. You can also revert the .tsv adjudicated data to .txt, but this is more of a manual process. This is exceptionally useful if you want to edit the data in .tsv format and convert back to .txt.

Acknowledgments

Although the program is intended to gather a significant amount of data of high quality, it is important to keep in mind that some misclassifications may still occur.

Note that the GitHub API has rate limits that restrict the number of requests that can be made within a certain time period.

Usage

Make sure you have the required packages installed, which are handled by pipenv. To install the required packages, run the following command:

```
pipenv install
```
Before running adjudicated.py, create a classic GitHub personal access token with the "repo" scope. Then, create a file named .env in the root directory with the following:
```
GITHUB_API_TOKEN=your_access_token
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-collection

data-collection

README.md

Data Collection

Scripts

`adjudicated.py`

`conversions.py`

Acknowledgments

Usage

Files

data-collection

Directory actions

More options

Directory actions

More options

Latest commit

History

data-collection

Folders and files

parent directory

README.md

Data Collection

Scripts

adjudicated.py

conversions.py

Acknowledgments

Usage

`adjudicated.py`

`conversions.py`