Issues with datetime to retrieve historical data via twarc #640

malmubarak88 · 2022-05-24T11:38:16Z

malmubarak88
May 24, 2022

Hello,

I am new to twarc and I would really appreciate if someone could guide me here. I will put two chunks of code the first one works but the date is limited to now and the second one doesn't work when I change the date. I am not sure if I am using the wrong start date format. Any help would be appreciated.
#First code fine

import json
from datetime import datetime, timezone, timedelta
import pandas as pd
from twarc.client2 import Twarc2 
from twarc_csv import CSVConverter
# Your bearer token here
t = Twarc2(bearer_token="xxxxx")

# Start and end times must be in UTC
start_time = datetime.now(timezone.utc) + timedelta(hours=-3)
# end_time cannot be immediately now, has to be at least 30 seconds ago.
end_time = datetime.now(timezone.utc) + timedelta(minutes=-1)

query = "$SPX lang:en -is:retweet"

print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")

# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)

# Get all results page by page:
for page in search_results:
    # Do something with the page of results:
    with open("dogs_results.jsonl", "w+") as f:
        f.write(json.dumps(page) + "\n")
    print("Wrote a page of results...")

print("Converting to CSV...")

## This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
    with open("dogs_output.csv", "w") as outfile:
        converter = CSVConverter(infile, outfile)
        converter.process()
        col_list=['created_at','author.username','text','public_metrics.retweet_count',
            'author.public_metrics.followers_count','author.verified',
            'author.public_metrics.following_count','public_metrics.like_count']
        df=pd.read_csv("dogs_output.csv",usecols=col_list)
        df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True)
        columns=['DateTime','text','likes','retweet','username','followers','following','verified']
        df.columns=columns
        
print("Finished.")

#Second code only changed start and end time

# Your bearer token here
t = Twarc2(bearer_token="XXX")

import datetime
start_time = datetime.datetime(2020, 3, 21, 0, 0, 0, 0, datetime.timezone.utc)
end_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc)

query = "$SPX lang:en -is:retweet"

print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")

# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)

# Get all results page by page:
for page in search_results:
    # Do something with the page of results:
    with open("dogs_results.jsonl", "w+") as f:
        f.write(json.dumps(page) + "\n")
    print("Wrote a page of results...")

print("Converting to CSV...")

## This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
    with open("dogs_output.csv", "w") as outfile:
        converter = CSVConverter(infile, outfile)
        converter.process()
        col_list=['created_at','author.username','text','public_metrics.retweet_count',
            'author.public_metrics.followers_count','author.verified',
            'author.public_metrics.following_count','public_metrics.like_count']
        df=pd.read_csv("dogs_output.csv",usecols=col_list)
        df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True)
        columns=['DateTime','text','likes','retweet','username','followers','following','verified']
        df.columns=columns
        
print("Finished.")

The error I get is the following for the second code
Unexpected HTTP response: <Response [400]>
Searching for "$SPX lang:en -is:retweet" tweets from 2020-03-21 00:00:00+00:00 to 2021-03-22 00:00:00+00:00...

HTTPError Traceback (most recent call last)
in ()
15
16 # Get all results page by page:
---> 17 for page in search_results:
18 # Do something with the page of results:
19 with open("dogs_results.jsonl", "w+") as f:

5 frames
/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):

HTTPError: 400 Client Error: Bad Request for url: https://api.twitter.com/2/tweets/search/recent?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&media.fields=alt_text%2Cduration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&start_time=2020-03-21T00%3A00%3A00%2B00%3A00&end_time=2021-03-22T00%3A00%3A00%2B00%3A00&query=%24SPX+lang%3Aen+-is%3Aretweet&max_results=100

Answered by edsu

May 24, 2022

It's interesting you chose to use twarc as a library rather than as a command line tool. It was designed for that, but I think that path is less traveled.

Since you are searching for tweets from previous years you will want to use search_all() instead of search_recent(). You will need to have access to the Academic Research Product Track for that to work.

View full answer

edsu · 2022-05-24T11:51:57Z

edsu
May 24, 2022
Maintainer

It's interesting you chose to use twarc as a library rather than as a command line tool. It was designed for that, but I think that path is less traveled.

Since you are searching for tweets from previous years you will want to use search_all() instead of search_recent(). You will need to have access to the Academic Research Product Track for that to work.

1 reply

malmubarak88 May 24, 2022
Author

Thanks a lot for the help. I really appreciate it. The reason I used it as a library to be able to work on the cloud and save the data on my google drive.

SamHames · 2022-05-24T11:53:15Z

SamHames
May 24, 2022
Collaborator

Hi there, The search_recent method maps to the Twitter recent search API which only allows searching the last seven days. To search the full archive you need both Twitter academic access and to use the search_all method.

…

On Tue, 24 May 2022, 21:38 malmubarak88, ***@***.***> wrote: Hello, I am new to twarc and I would really appreciate if someone could guide me here. I will put two chunks of code the first one works but the date is limited to now and the second one doesn't work when I change the date. I am not sure if I am using the wrong start date format. Any help would be appreciated. #First code fine import json from datetime import datetime, timezone, timedelta import pandas as pd from twarc.client2 import Twarc2 from twarc_csv import CSVConverter # Your bearer token here t = Twarc2(bearer_token="xxxxx") # Start and end times must be in UTC start_time = datetime.now(timezone.utc) + timedelta(hours=-3) # end_time cannot be immediately now, has to be at least 30 seconds ago. end_time = datetime.now(timezone.utc) + timedelta(minutes=-1) query = "$SPX lang:en -is:retweet" print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...") # search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions. search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100) # Get all results page by page: for page in search_results: # Do something with the page of results: with open("dogs_results.jsonl", "w+") as f: f.write(json.dumps(page) + "\n") print("Wrote a page of results...") print("Converting to CSV...") ## This assumes `results.jsonl` is finished writing. with open("dogs_results.jsonl", "r") as infile: with open("dogs_output.csv", "w") as outfile: converter = CSVConverter(infile, outfile) converter.process() col_list=['created_at','author.username','text','public_metrics.retweet_count', 'author.public_metrics.followers_count','author.verified', 'author.public_metrics.following_count','public_metrics.like_count'] df=pd.read_csv("dogs_output.csv",usecols=col_list) df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True) columns=['DateTime','text','likes','retweet','username','followers','following','verified'] df.columns=columns print("Finished.") #Second code only changed start and end time # Your bearer token here t = Twarc2(bearer_token="XXX") import datetime start_time = datetime.datetime(2020, 3, 21, 0, 0, 0, 0, datetime.timezone.utc) end_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc) query = "$SPX lang:en -is:retweet" print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...") # search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions. search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100) # Get all results page by page: for page in search_results: # Do something with the page of results: with open("dogs_results.jsonl", "w+") as f: f.write(json.dumps(page) + "\n") print("Wrote a page of results...") print("Converting to CSV...") ## This assumes `results.jsonl` is finished writing. with open("dogs_results.jsonl", "r") as infile: with open("dogs_output.csv", "w") as outfile: converter = CSVConverter(infile, outfile) converter.process() col_list=['created_at','author.username','text','public_metrics.retweet_count', 'author.public_metrics.followers_count','author.verified', 'author.public_metrics.following_count','public_metrics.like_count'] df=pd.read_csv("dogs_output.csv",usecols=col_list) df.to_csv('/content/drive/MyDrive/Sentiment/SP',header=True) columns=['DateTime','text','likes','retweet','username','followers','following','verified'] df.columns=columns print("Finished.") The error I get is the following for the second code Unexpected HTTP response: <Response [400]> Searching for "$SPX lang:en -is:retweet" tweets from 2020-03-21 00:00:00+00:00 to 2021-03-22 00:00:00+00:00... HTTPError Traceback (most recent call last) <https://localhost:8080/#> in () 15 16 # Get all results page by page: ---> 17 for page in search_results: 18 # Do something with the page of results: 19 with open("dogs_results.jsonl", "w+") as f: 5 frames /usr/local/lib/python3.7/dist-packages/requests/models.py <https://localhost:8080/#> in raise_for_status(self) 939 940 if http_error_msg: --> 941 raise HTTPError(http_error_msg, response=self) 942 943 def close(self): HTTPError: 400 Client Error: Bad Request for url: https://api.twitter.com/2/tweets/search/recent?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&media.fields=alt_text%2Cduration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&start_time=2020-03-21T00%3A00%3A00%2B00%3A00&end_time=2021-03-22T00%3A00%3A00%2B00%3A00&query=%24SPX+lang%3Aen+-is%3Aretweet&max_results=100 — Reply to this email directly, view it on GitHub <#640>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACADAUOOSAI4SQQEC2I5ZJTVLS5TJANCNFSM5WZH7RXA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with datetime to retrieve historical data via twarc #640

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Issues with datetime to retrieve historical data via twarc #640

malmubarak88 May 24, 2022

The error I get is the following for the second code Unexpected HTTP response: <Response [400]> Searching for "$SPX lang:en -is:retweet" tweets from 2020-03-21 00:00:00+00:00 to 2021-03-22 00:00:00+00:00...

Replies: 2 comments · 1 reply

edsu May 24, 2022 Maintainer

malmubarak88 May 24, 2022 Author

SamHames May 24, 2022 Collaborator

malmubarak88
May 24, 2022

The error I get is the following for the second code
Unexpected HTTP response: <Response [400]>
Searching for "$SPX lang:en -is:retweet" tweets from 2020-03-21 00:00:00+00:00 to 2021-03-22 00:00:00+00:00...

Replies: 2 comments 1 reply

edsu
May 24, 2022
Maintainer

malmubarak88 May 24, 2022
Author

SamHames
May 24, 2022
Collaborator