GitHub - sazeka/cep-analysis: Analyzing NYCDOE schools Comprehensive Education Plans (starting with year 2018-19)

CEP Analysis

Downloaded all NYC Department of Education school's Comprehensive Education Plan(CEPs) from the iPlan Portal. All public schools except charter schools, which are still considered public because they receive public funding, are required to submit CEPs annually. 1,579 schools filed plans for the 2018-19 school year.
Converted all PDFs to text files with PDFMiner's pdf2txt.py tool.
Removed newlines, leading spaces/tabs, and trailing spaces and tabs as it was hampering building the CSV of CEP questions - i.e. When copy-pasting the questions asked of schools into the cep1819-structure.csv file, I encountered unexpected newline characters that would result in the CSV being improperly read by Python and subsequently would make extracting answers to questions impossible.
Built a CSV of CEP text questions with their section headers to parse the short answer data from the clean text files.

Remove page headers and page numbers from clean text files
Build a text file of CEP specific "stop text", text that can be safely stripped out of question-answer CSV
Build a CSV for the CEP short answer data with structure - district-borough-number, question, answer
Extract tabular CSV data tabula-py
Topic analysis (Discovering and Visualizing Topics in Texts)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
txt		txt
txt_clean		txt_clean
.gitignore		.gitignore
README.md		README.md
analyze.py		analyze.py
cep-data.csv		cep-data.csv
cep1819-structure.csv		cep1819-structure.csv
clean_charters.sh		clean_charters.sh
download.sh		download.sh
parse.py		parse.py
pdf2txt.sh		pdf2txt.sh
redownload.sh		redownload.sh
remove_new_lines.sh		remove_new_lines.sh
requirements.txt		requirements.txt