Twitter is great; I use it all the time. But it has two huge issues:
- It's search feature never finds anything older than a few days
- The API won't let you access anything more than your 3000 most recent tweets
I often want to find a tweet that fits neither of those criteria to follow a link or grab an image or whatnot. So sometime before I hit the 3000 mark I wrote a tool that would scrape the Twitter website and save the tweets into a HTML file. It was a dirty hack; I knew Twitter had an API, but I didn't use it because I was lazy.
Then "New Twitter" came around and my dirty hack broke. At that moment a counter started; when I hit 3000 tweets after that I'd lose things. And I couldn't lose things; I'm obsessive like that.
So I grabbed the Twitter Gem and made a real solution. This one can read directly from the API, or from a JSON file stored locally, or from its own "tweetlib" format, or from the legacy "tweet-archive" format from the broken scraper. It can output to JSON or tweetlib files. It can mix and match inputs and outputs. It makes julienne fries.*
It's also just a fun way to spend some time writing Ruby and playing around with TDD and RSpec.
- Product does not actually make julienne fries.
Using Alexandria is simple:
alexandria.rb update TALlama
That will pull down as much history as it can for the user TALlama
and update their local tweetlib. If a tweetlib exists it will notice and pull tweets from there first; it will stop hitting the API once it finds a duplicate from the file, so subsequent updates are simple and fast.
You can also tell it to pull from specific places, or in specific orders:
# pull from an old-school Twitter HTML page, then hit the API
alexandria.rb update TALlama --source archive --source api
And you can tell it what format to output:
# don't save to HTML; just save to JSON
alexandria.rb update TALlama --dest json
Or you can just tell it what filename to use:
alexandria.rb update TALlama --opt lib_file Tweets.html
And you can even pull from one file and output to another, if you want to re-parse for some reason:
alexandria.rb update TALlama --opt in_lib_file OldTweets.html --opt out_lib_file NewTweets.html
This code is released under the MIT License; use it as you wish.
Find something wrong? Tell me!