GitHub - pybokeh/data-sets

Repo containing sample data sets for your own personal projects

But why?

There are plenty of repos that provide data sets or links to sites that provide free data sets. So why yet another one? What will set this repo apart is it will not only show you how to get the data sets, but also how to populate your own PostgreSQL database server with the data sets. Which means, this repo will contain the necessary SQL scripts. I've also included instructions on how to install and setup your PostgreSQL database server using either Docker or Windows for those who don't have access to Docker or don't want to mess with Docker at the moment. Why PostgreSQL? It is free, is a full-fledged database server that has been around forever, and is relatively easy to install. If you want to use a different database, it is relatively easy to convert my SQL scripts to work with the database of your choice. You'll have to figure out the difference in data types and constraints.

As a bonus, I will also provide Python jupyter notebooks that show how you can query the data sets using PySpark or Ibis framework. Why? Well, why not? Sometimes, it is good once in a while, to step away from SQL world and venture into the world of dataframe-based APIs. For most cases, you'll want to stick with SQL, but sometimes, it is convenient to use a dataframe API. One such convenience is to be able to write/save a dataframe directly to the database without having to execute DDL statements (CREATE TABLE) beforehand, especially if the dataframe contains several columns. Other conveniences such as being able to select columns by regex pattern or by data types, are also super nice since it would be difficult or impossible to do using SQL. It is because of these conveniences, I use dataframe API early on in a project or when doing a POC or ad-hoc analysis or doing something quick-n-dirty. Then in later stages or when things need to be put into production, I converge to using SQL. I think having both SQL and dataframe API knowledge is valuable for any data professional.

Why Ibis? Well I want to learn or use something that is agnostic to an underlying backend technology. PySpark fits the bill, but I don't particular like its dependency on the JVM. What about pandas? Pandas is purely an in-memory dataframe API, so it isn't performant. There's also Polars, but it can't handle big data, but handles up to "medium" data very well! So I feel like Ibis is a good, balanced dataframe library that is likely to be "future-proof". I could be wrong since currently, Ibis isn't widely used or popular. I am still invested in learning PySpark only because that is where the job market is now and until the foreseeable future. So that is why I choose to provide both PySpark and Ibis examples. I believe as marketable skills, you can't go wrong with these 2 dataframe API frameworks. Plus, when you know one dataframe API, it is very easy to pick up another.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
air-quality-index		air-quality-index
bls		bls
cpi		cpi
nhtsa		nhtsa
postgres		postgres
sales		sales
stocks		stocks
us-population		us-population
veh-crash-stats		veh-crash-stats
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo containing sample data sets for your own personal projects

But why?

About

Releases

Packages

Languages

pybokeh/data-sets

Folders and files

Latest commit

History

Repository files navigation

Repo containing sample data sets for your own personal projects

But why?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages