A Python API for consuming the Metropolitan Museum of Art's publicly available dataset.
Finding sources for working with images of paintings is very difficult. The quality of Google image search results are highly variable.
Thankfully, the Metropolitan Museum of Art in New York City has graciously made their works avaiable for download and free use via their openaccess repo.
Unfortunately, they only provide a massive CSV (256 MB) without any instructions on how to download or use the collection.
Because of the size of the data file, you'll need to use git's Large File Storage extension to properly download the full collection.
Once you have the dataset on your local machine, you'll need to update the file paths located in the constants module.
MetObjects.csv represents the full collection.
MetPaintings.csv represent only those rows where Object Name == 'Painting'
.
The full collection contains almost 500,000 rows of unique works. The data is not clean or uniform. Be wary of encoding issues if using Python 2. Columns were named to be human readable and are not easily accessed by data science libraries (pandas).
A list of all the columns (after transformation) in the dataset can be found here.
MetPaintings
is an object intended to make all the paintings in the collection more accessible for study. For this reason I have limited the columns that are contained in this dataset. This represents 6100 individual works in 750 mediums.
This is the primary object I will be developing and working with.
The Met prevents accessing their collection by web scraping through the incapsula service.
At first I tried using incapsula-cracker-py3 to handle my requests to their server, but this did not bypass incapsula.
My next plan is to try using PhantomJS and Selenium to better impersonate a non-bot user.
###PhantomJS
PhantomJS
on Mac: brew cask install phantomjs
PhantomJS is deprecated.
###Chrome WebDriver I downloaded the chrome webdriver and found success using Selenium with this new driver. By mimicking the browser, the software doesn't get recognized as a bot.
But this is pretty heavy weight for our purposes. I need to better define what my purpose is.