Press corpus scraper (English version)

I am pleased to announce the availability of Press corpus scraper, a new tool for collecting corpora of press articles.

Press corpus scraper is a browser extension (Firefox and Chrome/Edge compatible) which injects into supported newspaper websites an interface allowing, after a keyword search, to extract and download the text of all or part of the articles, one file per article. Two file formats are offered: .txt for raw text, .xml for texts prepared for import into the TXM textometry software (XML-TEI Zero + CSV import module).

The sites supported to date are Le Monde , Le Figaro , Le Point and L’Humanité on the French side, as well as The Guardian and The New York Times . Other publications will be added later.

Europresse is also supported.

This is what the interface injected by the extension looks like on the search results page of a site like Le Monde :

You can download the entire search results, or only those of the page currently displayed on the screen.

For The Guardian and The New York Times , things are a little different since these two newspapers offer an API that allows you to build queries and directly retrieve article data and metadata. In their case, the extension adds a simple button to the site’s menu bar:

Clicking the button brings up a search interface:

You can build a custom query with a combination of keywords, sections, start and end dates, etc. before launching the actual extraction process.

NB. Searching The Guardian or The New York Times requires the creation of a personal key, which can be obtained free of charge: here for The Guardian and there for The NYT.

At the end of the process, a .zip archive containing all the .txt or .xml files is downloaded. Simply unzip the archive and load the files into the textual analysis software of your choice, such as TXM:

Files are named as YYYY-MM-DD_Source_Author for easy sorting. As can be seen above, XML files encode 4 types of metadata (source, author, title and date) and a link to the article on the original site.

NB. This resource is a work in progress by a completely self-taught amateur! It is possible, even probable, that some bugs remain, and that the code can be optimized. Any constructive feedback will therefore be welcome!

To find out more and install the extension: > click <

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29

Press corpus scraper (English version)

Leave a Reply Cancel reply

Articles récents

Archives

Meta

Press corpus scraper (English version)

Share this:

Leave a Reply Cancel reply

Articles récents

Archives

Meta