Skip to content
/ pymet Public

Python API for consuming the Metropolitan Museum of Art's publicly available dataset.

Notifications You must be signed in to change notification settings

atheis4/pymet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyMet (WIP)

A Python API for consuming the Metropolitan Museum of Art's publicly available dataset.

Introduction

Finding sources for working with images of paintings is very difficult. The quality of Google image search results are highly variable.

The Met

Thankfully, the Metropolitan Museum of Art in New York City has graciously made their works avaiable for download and free use via their openaccess repo.

Unfortunately, they only provide a massive CSV (256 MB) without any instructions on how to download or use the collection.

How to download

Because of the size of the data file, you'll need to use git's Large File Storage extension to properly download the full collection.

Update the CsvPaths constants

Once you have the dataset on your local machine, you'll need to update the file paths located in the constants module.

MetObjects.csv represents the full collection. MetPaintings.csv represent only those rows where Object Name == 'Painting'.

The Collection

The full collection contains almost 500,000 rows of unique works. The data is not clean or uniform. Be wary of encoding issues if using Python 2. Columns were named to be human readable and are not easily accessed by data science libraries (pandas).

A list of all the columns (after transformation) in the dataset can be found here.

MetPaintings

MetPaintings is an object intended to make all the paintings in the collection more accessible for study. For this reason I have limited the columns that are contained in this dataset. This represents 6100 individual works in 750 mediums.

This is the primary object I will be developing and working with.

Incapsula

The Met prevents accessing their collection by web scraping through the incapsula service.

At first I tried using incapsula-cracker-py3 to handle my requests to their server, but this did not bypass incapsula.

My next plan is to try using PhantomJS and Selenium to better impersonate a non-bot user.

###PhantomJS PhantomJS on Mac: brew cask install phantomjs

PhantomJS is deprecated.

###Chrome WebDriver I downloaded the chrome webdriver and found success using Selenium with this new driver. By mimicking the browser, the software doesn't get recognized as a bot.

But this is pretty heavy weight for our purposes. I need to better define what my purpose is.

About

Python API for consuming the Metropolitan Museum of Art's publicly available dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages