This script will:
To run a block of code, click the block, then press Shift/Enter. Generally, code blocks must be run in order, from the top of the notebook to the bottom, because later code blocks often depend upon things that took place in earlier code blocks. You also may run the entire code by clicking Cell / Run All
Script by Ken Blake, https://drkblake.com/
Required add-ons: The script requires a once-per-environment installation of Tweepy, Pandas, and XlsxWriter. After the first time you run this script in a given Jupyter Notebook environment, you can speed up execution of the script by changing tweepy
, pandas
and XlsxWriter
to #tweepy
, #pandas
, and #XlsxWriter
. Doing so will tell the script to bypass these installation in future executions.
pip install tweepy
pip install pandas
pip install XlsxWriter
import tweepy
from tweepy import OAuthHandler
import pandas as pd
from pandas import ExcelWriter
import time
from datetime import datetime
print("All imports completed")
Twitter developer credentials: In the code below, replace PASTEYOURACCESSTOKENHERE
, PASTEYOURACCESSTOKENSECRETHERE
, PASTEYOURCONSUMERKEYHERE
, and PASTEYOURCONSUMERSECRETHERE
in the code below with your unique access token, access token secret, consumer key, and consumer secret, respectively. Be sure to keep the '
marks around each credential. You may obtain these credentials for free by applying for a Twitter developer account. To apply, see: https://developer.twitter.com/en/apply-for-access.
access_token = 'PASTEYOURACCESSTOKENHERE'
access_token_secret = 'PASTEYOURACCESSTOKENSECRETHERE'
consumer_key = 'PASTEYOURCONSUMERKEYHERE'
consumer_secret = 'PASTEYOURCONSUMERSECRETHERE'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,
wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
Search and iteration settings:
Edit query
to specify the search query you want to use. For help constructing your query, see: https://twitter.com/search-advanced. The default query includes -filter:retweets
, which excludes retweets, and min_retweets:100
, which limits the capture to tweets that have been retweeted at least 100 times. You may edit or omit these criteria, if you like.
Edit tweets_wanted
to indicate how many tweets you would like to retrieve.
Edit since
to specify a starting date for your search. By default, the Twitter API will sample only about the last seven days.
Edit Iterations
to specify how many times you want the program to pull tweets from Twitter.
Edit SecondsBetweenIteration
to specify how many seconds you want the program to wait between iterations. For example, specifying 3600
will cause the program to query Twitter every hour, because an hour = 60 seconds x 60 minutes = 3600 seconds. The default settings in the code below will run the script for 24 hours from the time you launch the code, with one hour between each iteration, and gather up to 900 tweets, each retweeted at least 100 times.
query = "Congress -filter:retweets min_retweets:100"
tweets_wanted = 900
since = "2021-01-01"
Iterations = 24
SecondsBetweenIterations = 3600
Search, structure, save, repeat: This code runs your search, prints a running count of the number of tweets retrieved per iteration, and timestamps and saves the data from each iteration. Nothing to edit or configure, here, unless you want to find the line of code that reads ('TweetFile_{}.xlsx'.format(datetime.today()
and change the generic file name prefix TweetFile
to a name that is more descriptive of your project. For example, if you were grabbing tweets about Congress, you could change ('TweetFile_{}.xlsx'.format(datetime.today(0
to ('Congress_{}.xlsx'.format(datetime.today()
. Be sure to change nothing other than TweetFile
. After the script runs, the file will be available on your computer, in the same directory as the script, in a file labeled with the file name prefix you selected plus the date and time of the file's creation.
icount = 1
while icount <= Iterations:
tweets = []
count = 1
for tweet in tweepy.Cursor(api.search,
q=query,
count=450,
since=since,
tweet_mode=
"extended").items(tweets_wanted):
#print(count)
count += 1
try:
data = [tweet.created_at,
tweet.id_str,
tweet.retweet_count,
tweet.favorite_count,
tweet.full_text,
tweet.user._json['screen_name'],
tweet.user._json['name'],
tweet.user._json['created_at'],
tweet.user._json['followers_count'],
tweet.entities['urls']]
data = tuple(data)
tweets.append(data)
except tweepy.TweepError as e:
print(e.reason)
continue
except StopIteration:
break
now = datetime.now()
print("Tweets captured at ", now, ": ", count)
df = pd.DataFrame(tweets, columns = ['created_at',
'id_str',
'retweet_count',
'favorite_count',
'full_text',
'screen_name',
'name',
'account_creation_date',
'followers_count',
'urls'])
writer = (pd.ExcelWriter
('TweetFile_{}.xlsx'.format(datetime.today()
.strftime('%Y-%m-%d-%H-%M')),engine='xlsxwriter'))
df.to_excel(writer,'Sheet1')
writer.save()
icount = icount + 1
if icount > Iterations:
break
time.sleep(SecondsBetweenIterations)
print("All iterations completed")