An extension for extracting and downloading press articles for text mining.
Cite this sofware
If you use this extension for your research, please reference it as follows:
Moncomble, F. (2024). Press Corpus Scraper (Version 0.9) [JavaScript]. Arras, France: Université d’Artois. Available at: https://fmoncomble.github.io/press-corpus-scraper/
Installation
Firefox (recommended: automatic updates)
Chrome/Edge
- Download .zip archive
- Unzip the archive
- Open the extensions manager:
chrome://extensions
oredge://extensions
- Activate “developer mode”
- Click “Load unpacked”
- Select the unzipped folder
Instructions for use
- Navigate to the site of a supported newspaper:
- or your institution’s Europresse portal: list of supported institutions
- French websites and Europresse :
- Perform a simple or advanced search by keyword
- A box appears at the top of the results page. Example for Le Monde :
- The Guardian and The New York Times:
- Click the
PCScraper
button in the top right-hand corner of the site’s menu bar to open the search window : - Build a query in the interface, then click
Search
- Click the
- Select the desired file format:
TXT
orXML/XTZ
(for import into TXM with theXML-TEI Zero + CSV
module) - Click
Extract
- Paywalled articles are not downloaded but listed as links
- Articles that the extension fails to process are listed as links
- When extraction is complete :
- Firefox: the .zip archive containing the files is automatically downloaded to the default folder
- Chrome/Edge: select the destination folder for the .zip archive
- Unzip the resulting archive to load the files into text analysis software
Known issues and limitations
French sites
Even with an active subscription, the extension does not have access to the full text of paywalled articles (the cookie is not accepted by the remote server). Only free-access articles are therefore retrieved, the others being listed as links.
Europresse…
- handles article metadata rather randomly, with no dedicated HTML elements, which can lead to some inconsistency in the way they are rendered in downloaded files (subheads where author name should belong, etc.). This is not a limitation of the add-on but of Europresse!
- only allows scraping 20 pages of results (1000 articles) at a time.
Guardian and New York Times
the query and extraction process relies on the APIs offered by these two publications. An access key is required, which can be obtained free of charge from the following links:
The New York Times
An active subscription is required to access the full text of all articles, so you need to be logged into your account first. The remote server accepts the cookie sent by the extension (for the time being), but there are a number of limitations and security features:
- requests can only return 10 results at a time, and the API only authorises 5 requests per minute: these are therefore spaced 12 seconds apart to avoid any blocking.
- the server blocks fetch requests that are too numerous and too fast: to avoid that, article content is only retrieved at a rate of 1 article per second. Despite this, a block may occur: the extension then invites you to click on a link to prove that you are not a robot…
- the subscriber account can be disconnected at any time: the extension then pauses and prompts you to click on an authentication link before resuming content retrieval.
- On Firefox : due to the way Firefox handles the dynamic loading of the NYT’s homepage, it needs to be opened in a new tab or window for the button to appear. In any other case, the button is likely to pop up briefly before disappearing.
Leave a Reply