#XSAVI 750 – Mining the Web: How to Scrape, Analyze & Map Open Data
###Pratt Institute, Center for Continuing and Professional Studies Spatial Analysis and Visualization Initiative (SAVI)
Instructor: Richard Dunks
Location: ISC Building, Lower Level, Room 003
Continuing Education Units (C.E.U.s): 3.0
Click for more information and to register
###Navigation
- Course Overview
- Learning Objectives
- Course Requirements
- Course Readings
- Class Format
- Submitting Assignments
- Assessment
- Class Policies
- Resources
- Course Outline
- Suggested Reading
##Administrivia
###Course Overview This course introduces the tools, techniques, and general approaches used to acquire, clean, analyze, and visualize open data, with particular emphasis on using web-based technologies and open-source tools at each step of the process.
We will be working with the community preservation group Save Harlem Now! to help collect, organize, and visualize data related to historic preservation in Harlem. There is no requirement to participate in this project and each student is free to pursue their own projects in class. The work with Save Harlem Now! is an opportunity to work on a real-world problem related to the collection, analysis, and visualization of data.
######back to top
- You will learn to formulate and articulate a meaningful research question with public open data, as well as meaningfully critique the work of others
- You will learn how to acquire data through open data portals, application programmer interfaces (APIs), and scraping data from web sites
- You will learn how to clean data using open source tools in preparation for analysis and visualization
- You will learn how to conduct exploratory data analysis using descriptive statistics
- You will learn to visualize your analytical findings in meaningful and visually-engaging graphics, as well as meaningfully critique the work of others
- You will learn the basics of cartographic design as it relates to visualizing open data
######back to top
###Course Requirements All students will need to bring their own laptop for exercises during class. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts. Please update your system to the latest version of your prefered operating system prior to the first day of class to ensure you're able to successfully install and use the tools in class.
You will be required to have free accounts with the following services:
Time will be set aside to help you register and setup these accounts, but please try to come to the first session having already registered for these servies.
In addition, please install the following applications prior to class:
- Slack
- OpenRefine
- A free text editor of your choice
- Sublime Text (All systems)
- TextWrangler (All systems)
- Notepad++ (Windows)
######back to top
###Course Readings The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available through the class portal. Recommended readings are suggestions if you wish to study further the topics covered in class. The books listed in the Suggested Readings section below offer even more depth and an extended discussion of the material we cover in class. Readings are due for the class under which they're listed.
######back to top
###Class Format Class runs from 6:30pm to 9:30pm, with the class time broken up into two 85-minute blocks with a single 10-minute break around the half-way point of the class. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class.
I will also be available for questions or further assistance before and after class. You will have ample time in class to work on practical exercises based on the information presented in lectures. When possible, the final half hour of class will be set aside for any additional questions or additional tutorials in tools, skills, or techniques. Please plan on attending the full class time.
######back to top
###Submitting Assignments All assignments will be submitted by adding your content to the class page and issuing a "pull request" in the class repository. All of this will be explained, setup, and otherwise clarified on the first day of class. Assignments aren't considered submitted until the pull request has been issued. We will have ample time in class to address any technical issues and a reference guide for the process.
######back to top
Area | Total Points |
---|---|
Attendance | 20 |
Class Participation | 20 |
Visualization Critiques | 20 |
Visualizations | 20 |
Final Project | 20 |
Total | 100 |
######back to top
###Class Policies
####Attendance and Tardiness
I expect you to attend every class, arriving on time and staying for the entire duration of class. Daily attendance counts 2 points toward your final grade. Excused absences won't result in points being lost.
####Participation
I expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. I will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered. Daily participation counts 2 points toward your final grade.
####Late Assignments
All assignments are to be due before the start of class to be presented in class. Points will be taken off late assignments.
####Office hours
I won’t be holding regular office hours, but I’m happy to set up a time to meet in person, over the phone, or via Skype/Google Hangout if you have any problems. Please use Slack to reach out to me. I will also be available before or after class to provide any assistance you may need.
######back to top
- Technical
- Stack Overflow Q&A community of technology pros
- GIS Stack Exchange (same as above but just for mapping)
- (Some) Open Data Sources
- Visualizations
- Reference
######back to top
##Course Outline ####Topics will be covered that day in class. Reading Assignments are to be read before class in preparation of the lecture and exercises. Assignments are due before the start of the next class and build on the information presented in class.
######back to top
##Week 1 - Acquiring Data ###Class 1 - April 11, 2016 ####Topics
- What is open data?
- Data on the web
- Introduction to mapping
- Introduction to open source tools and services for mapping and visualization
- Complete the visualization started in class with data from an open data portal. Style the map in CartoDB and have it ready to present in class.
- Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
######back to top
###Class 2 - April 13, 2016 ####Topics
- Introduction to HTML and CSS
- Introduction to Git and Github
- Guest lecture on Save Harlem Now
- Interactive Data Visualization for the Web, pg 15 – 23
- Matt A.V. Chaban, "Much to Save in Harlem, but Historic Preservation Lags, a Critic Says"
- Complete the online CartoDB “Online Mapping for Beginners” course.
- Create a second visualization or improve on your first, using new data or explore a data set from Save Harlem Now Project. Write 2-3 paragraphs discussing any challenges you encountered working with the data and/or creating your visualization in CartoDB.
- Codecademy HTML and CSS Course
- W3Schools HTML Tutorial
- A tutorial for getting started with Git and Github
######back to top
##Week 2: More Acquiring Data/Data Cleaning
###Class 3 - April 18, 2016 ####Topics
- Web scraping
- Introduction to APIs
- Introduction to OpenRefine
- Chris Whong "Foiling NYC's Taxi Trip Data"
- Thomas Levine, Introduction to web scraping
- Introduction to APIs ch 1-5
- Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
- Identify a question or topic you'd like to explore in this class, with the intention of creating a map related to the topics as part of your final project in this class. Write 2-3 paragraphs on why the topic is interesting to you, what data you'd like to explore using, and what you hope to contribute with your work.
######back to top
###Class 4 - April 20, 2016 ####Topics
- Overview of social media data
- Collecting social media data from APIs
- Introduction to Python for querying APIs
- TBD
- Using an API, either of an open data portal such as the NYC Open Data Portal or some other open data source, create a visualization of the data in CartoDB. Write a short (2-3 paragraph) description of the data, the API you used to access it, how you styled it, and the resulting visualization. Discuss other data you'd like to use or other techniques of cleaning the data to get your desired result. Submit your API code via the Slack channel in the format "lastname-assignment2.py" if you do your API query in Python or "lastname-assignment2.txt" if you did you query in OpenRefine.
- Update your project plan for your final project with additional questions, data sources, ideas for visualizing, or other issues/challenges you've discovered.
- CartoDB Academy
- McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012, "Appendix Python Language Essentials"
- Codecademy Python Course
- MIT Introduction to Computer Science and Programming with Python (free course)
- Codecademy Learn to Code for APIs
######back to top
##Week 3: Cleaning/Analyzing Data ###Class 5 - April 25, 2016 ####Topics
- Introduction to SQL for cleaning data
- Cleaning Data with APIs
- Obe, Regina, and Leo Hsu. PostGIS in action. Manning Publications Co., 2011, Pg 3-8.
- Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
- Complete the SQL and PostGIS in CartoDB course.
######back to top
###Class 6 - April 27, 2016 ####Topics
- Python for querying Geoclient API
- SQL for cleaning and analysis
- TBD
- Create a new visualization or improve on your previous visualization with additional data and provide analysis of the data you've found. Write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented.
######back to top
##Week 4: Visualizing Data
###Class 7 - May 2, 2016 ####Topics
- A (re-)introduction to statistics
- Introduction to visualization design
- Hon, Keone. “An Introduction to Statistics.” Ch. 1 and 2.
- Ben Wellington "Mapping the Sharing Economy"
- Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
######back to top
###Class 8 - May 4, 2016 ####Topics
- Advanced CartoDB (guest lecture)
- Heer, Jeffrey, Michael Bostock, and Vadim Ogievetsky. "A tour through the visualization zoo." Commun. ACM 53.6 (2010): 59-67.
- Munzer, Tamara. Chapter 27 – “Visualization”, p 675-707, of Fundamentals of Graphics, Third Edition. by Peter Shirley and Steve Marschner. AK Peters, 2009.
- CartoDB “Introduction to Map Design”
- Create your final presentation. Have mockups ready for class on Monday ahead of presentations on Wednesday. More detailed requirements will be provided in class.
######back to top
##Week 5: Advanced Topics/Final Presentations
###Class 9 - May 9, 2016 ####Topics
- Course review
- Advanced topics, to possibly include:
- Introduction to Interactive Visualization of Data with D3 and Leaflet
- Introduction to Spatial Databases
- Visualizing social media data
- Find an interesting or visually compelling visualization online and write 2-3 paragraphs on the visualization, discussing the data source(s), the visual style, and how well the data was represented. Feel free to use the visualization resources listed above. Submit your text to the class page following the example shown.
######back to top
###Class 10 - May 11, 2016 ####Topics
- Final presentations
######back to top
- Fry, Ben. Visualizing Data: Exploring and Explaining Data with the Processing Environment. O'Reilly Media, Inc., 2007.
- Garrad, Chris. Geoprocessing with Python. Manning Publications Co., forthcoming. Janert, Philipp K. Data analysis with open source tools. O'Reilly Media, Inc., 2010.
- McCallum, Q. Ethan. Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. O'Reilly Media, Inc., 2012.
- McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012.
- Munzner, Tamara. Visualization Analysis and Design. AK Peters, 2014.
- Murray, Scott. Interactive data visualization for the Web. O'Reilly Media, Inc., 2013.
- Tufte, Edward R., and P. R. Graves-Morris. The visual display of quantitative information. Vol. 2. Cheshire, CT: Graphics press, 1983.
######back to top