Author: Ken Blake, Ph.D.

Description: This script will capture metadata from the GDELT 2.0 API for articles matching specified date and search term criteria. The dates and search terms are embedded in URLs that are generated by an accompanying "URL Generator for GDELT Headline Grab" Excel spreadsheet, downloadable from https://tinyurl.com/y8bhdswj. For more about using the GDELT API, see: https://drkblake.com/gdeltintro/.

Note: The code directly below will install the requests module in your current Jupyter Notebook environment. The installation is necessary only once per environment. After you have installed requests, you may speed the script's execution up somewhat by changing pip install requests to #pip install requests.

In [ ]:
pip install requests
In [ ]:
from IPython.display import clear_output
In [ ]:
import requests
In [ ]:
import time

Directions: Paste the URLs from the spreadsheet here, between the [] in the URLlist = [] code. Each URL should be surrounded by a pair of ' characters and separated with a comma. I have left a few example URLs in the code block to show the pattern and to give you an easy way to test the script. Delete them before pasting in the URLs you want.

In [ ]:
URLlist = [
'https://api.gdeltproject.org/api/v2/doc/doc?query=("Donald Trump" OR "President Trump") domainis:apnews.com&startdatetime=20200501000000&enddatetime=20200502000000&mode=artlist&maxrecords=250&sort=datedesc&format=csv',
'https://api.gdeltproject.org/api/v2/doc/doc?query=("Donald Trump" OR "President Trump") domainis:apnews.com&startdatetime=20200502000000&enddatetime=20200503000000&mode=artlist&maxrecords=250&sort=datedesc&format=csv',
'https://api.gdeltproject.org/api/v2/doc/doc?query=("Donald Trump" OR "President Trump") domainis:apnews.com&startdatetime=20200503000000&enddatetime=20200504000000&mode=artlist&maxrecords=250&sort=datedesc&format=csv',
]
print("URL list loaded")

Directions: The code below will save the article metadata to a comma-separated-value file called PickAFile.csv. After the script has run, you will find the file on your computer, in the same directory as the script. You may change the PickAFile.csv filename to any filename you like. Running the script while the file is open in some other application will produce an error message. To avoid the error, close the file before running the script.

The time.sleep(6) line in the code below produces a six-second pause between retrieval operations. The pause will keep you under the GDELT API's rate limit, which allows no more than one retrieval every five seconds. You probable could speed the program up slightly by changing the code to time.sleep(5), but I haven't tested a pause shorter than six seconds. Either way, if you are downloading a full month of data for each of the nine sources, plan on letting the program run for several hours.

A handy counter will appear after the first retrieval telling you how many of your URLs have been processed. When the last URL has been processed, the script will report that "The dataset is ready to view."

In [ ]:
completed = 0.
OutOf = len(URLlist)
for URL in URLlist:
    clear_output(wait=True)
    myfile = requests.get(URL)
    open('PickAFile.csv','ab').write(myfile.content)
    completed = completed + 1
    print("Completed: ",completed," out of",OutOf)
    time.sleep(6)
print("The dataset is ready to view.")
In [ ]: