Issues with datetime to retrieve historical data via twarc #640
-
Hello, I am new to twarc and I would really appreciate if someone could guide me here. I will put two chunks of code the first one works but the date is limited to now and the second one doesn't work when I change the date. I am not sure if I am using the wrong start date format. Any help would be appreciated.
#Second code only changed start and end time
The error I get is the following for the second code
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
It's interesting you chose to use twarc as a library rather than as a command line tool. It was designed for that, but I think that path is less traveled. Since you are searching for tweets from previous years you will want to use |
Beta Was this translation helpful? Give feedback.
-
Hi there,
The search_recent method maps to the Twitter recent search API which only
allows searching the last seven days.
To search the full archive you need both Twitter academic access and to use
the search_all method.
…On Tue, 24 May 2022, 21:38 malmubarak88, ***@***.***> wrote:
Hello,
I am new to twarc and I would really appreciate if someone could guide me
here. I will put two chunks of code the first one works but the date is
limited to now and the second one doesn't work when I change the date. I am
not sure if I am using the wrong start date format. Any help would be
appreciated.
#First code fine
import json
from datetime import datetime, timezone, timedelta
import pandas as pd
from twarc.client2 import Twarc2
from twarc_csv import CSVConverter
# Your bearer token here
t = Twarc2(bearer_token="xxxxx")
# Start and end times must be in UTC
start_time = datetime.now(timezone.utc) + timedelta(hours=-3)
# end_time cannot be immediately now, has to be at least 30 seconds ago.
end_time = datetime.now(timezone.utc) + timedelta(minutes=-1)
query = "$SPX lang:en -is:retweet"
print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")
# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the page of results:
with open("dogs_results.jsonl", "w+") as f:
f.write(json.dumps(page) + "\n")
print("Wrote a page of results...")
print("Converting to CSV...")
## This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
with open("dogs_output.csv", "w") as outfile:
converter = CSVConverter(infile, outfile)
converter.process()
col_list=['created_at','author.username','text','public_metrics.retweet_count',
'author.public_metrics.followers_count','author.verified',
'author.public_metrics.following_count','public_metrics.like_count']
df=pd.read_csv("dogs_output.csv",usecols=col_list)
df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True)
columns=['DateTime','text','likes','retweet','username','followers','following','verified']
df.columns=columns
print("Finished.")
#Second code only changed start and end time
# Your bearer token here
t = Twarc2(bearer_token="XXX")
import datetime
start_time = datetime.datetime(2020, 3, 21, 0, 0, 0, 0, datetime.timezone.utc)
end_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc)
query = "$SPX lang:en -is:retweet"
print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")
# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the page of results:
with open("dogs_results.jsonl", "w+") as f:
f.write(json.dumps(page) + "\n")
print("Wrote a page of results...")
print("Converting to CSV...")
## This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
with open("dogs_output.csv", "w") as outfile:
converter = CSVConverter(infile, outfile)
converter.process()
col_list=['created_at','author.username','text','public_metrics.retweet_count',
'author.public_metrics.followers_count','author.verified',
'author.public_metrics.following_count','public_metrics.like_count']
df=pd.read_csv("dogs_output.csv",usecols=col_list)
df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True)
columns=['DateTime','text','likes','retweet','username','followers','following','verified']
df.columns=columns
print("Finished.")
The error I get is the following for the second code
Unexpected HTTP response: <Response [400]>
Searching for "$SPX lang:en -is:retweet" tweets from 2020-03-21
00:00:00+00:00 to 2021-03-22 00:00:00+00:00...
HTTPError Traceback (most recent call last)
<https://localhost:8080/#> in ()
15
16 # Get all results page by page:
---> 17 for page in search_results:
18 # Do something with the page of results:
19 with open("dogs_results.jsonl", "w+") as f:
5 frames
/usr/local/lib/python3.7/dist-packages/requests/models.py
<https://localhost:8080/#> in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 400 Client Error: Bad Request for url:
https://api.twitter.com/2/tweets/search/recent?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&media.fields=alt_text%2Cduration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&start_time=2020-03-21T00%3A00%3A00%2B00%3A00&end_time=2021-03-22T00%3A00%3A00%2B00%3A00&query=%24SPX+lang%3Aen+-is%3Aretweet&max_results=100
—
Reply to this email directly, view it on GitHub
<#640>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACADAUOOSAI4SQQEC2I5ZJTVLS5TJANCNFSM5WZH7RXA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
It's interesting you chose to use twarc as a library rather than as a command line tool. It was designed for that, but I think that path is less traveled.
Since you are searching for tweets from previous years you will want to use
search_all()
instead ofsearch_recent()
. You will need to have access to the Academic Research Product Track for that to work.