Scholarly APIs and Datasets
Some of these resources allow large data downloads for research purposes; others are intended for smaller scale research activities such as experimentation with visualization and development of tools to query journal and citation databases.
The use of the majority of these resources requires some programming skills.
Datasets and APIs for Full-Text Retrieval
arXiv provides open access to electronic pre-prints in Astrophysics, Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Automated downloads using robots, spiders, accelerators, and similar tools are not permitted. There are several access mechanisms provided to gain machine access to metadata and full-text. See our ArXiv Bulk Data Access guide for more information.
Chronicling America provides access to information about historic newspapers and select digitized newspaper pages. Use the API to search the newspaper directory and digitized page contents or take advantage of their stable URL pattern. Digitized newspaper content from grant awardees in the National Digital Newspaper Program is delivered in the form of batches, which are made available in various views. Bulk access to Chronicling America's OCR data is available here: http://chroniclingamerica.loc.gov/ocr/.
The Digital Public Library of America (DPLA) provides access to a wide range of content from America’s museums, libraries, and archives. The DPLA API offers metadata (and meta-metadata) on two types of resources: items and collections, where items represent single physical objects indexed by a DPLA data provider, and collections are logical groupings of items. All DPLA data in the DPLA repository is available for download as zipped JSON files: https://dp.la/info/developers/download/
Europeana acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe. The Europeana REST API is suited for dynamic search and retrieval of data. All Europeana datasets can be explored and queried through a SPARQL endpoint provided by Ontotext. The Annotations API currently an Alpha release, is an extension to the Europeana REST API that allows for the management of annotations to metadata or media. The Europeana OAI-PMH Service, currently in beta with limited service, allows harvesting of the entirety, or a selection, of all Europeana metadata using the OAI-PMH protocol.
Google Books Ngram Viewer creates graphs that show the number of times certain keywords appear in publications over a defined time range. Searchable corpora were generated in July 2009 and July 2012. The ngram data is available for download: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Note: This is a beta data release.
This dataset contains notable or informative characteristics (features) of the public domain volumes of the HathiTrust Digital Library. A number of useful features have been processed and are provided per-page, including part-of-speech tagged token counts, header and footer identification, and various line-level information.
Note: This is a beta data release.
This dataset represents a first attempt to provide word counts for English-language fiction, drama, and poetry published between 1700 and 1922, and contained in the HathiTrust Digital Library. Word counts come from the HathiTrust Research Center (HTRC) and information about genre comes from a parallel project led by Ted Underwood, and supported by the National Endowment for the Humanities and the American Council of Learned Societies.
Data for Research (DfR) is a free service for researchers wishing to analyze content on JSTOR through a variety of lenses and perspectives. Metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents can be downloaded. Larger datasets may be available upon request.
The Springer OpenAccess API provides metadata, full-text content, and images for over 280,000 open access articles from BioMed Central and SpringerOpen journals. The Springer Metadata and Springer Meta APIs allow users to retrieve metadata for more than 7 million online documents.
APIs for Limited Data Retrieval
The Google Books API allows users to perform programmatically most of the operations that are possible interactively on the Google Books website, but is not intended to be used as a replacement for commercial services.
The HathiTrust Bibliographic API returns bibliographic, copyright, and volume information (including permanent URLs) when queried with a variety of standard identifiers (e.g., ISBN, LCCN, OCLC, etc.). The API's intended use is to retrieve information about small numbers of items at a time.
HathiTrust's Data API makes it possible to retrieve page images, OCR text, rights information, and a variety of other data about objects in the repository. The API's intended use is for those who have an item identifier and need the corresponding data or metadata. The API only accepts one ID per request and is meant for burst activities, not large-scale retrieval of content. Information about obtaining full-text OCR datasets of public domain works can be found here: https://www.hathitrust.org/datasets
The Article-Level Metrics (ALM) API provides access to article information, including online usage, citations, social bookmarks, notes, comments, ratings and blog coverage. The API is not intended for high-volume retrieval of data for all articles.
The PLOS Search API allows PLOS content to be queried using any of the twenty three terms in the PLOS Search. Requests should be limited to those that return fewer than 100 rows.
API Directories and Data Set Lists
Awesome Public Datasets is a list of public data sources collected and tidied from blogs, answers, and user responses. Most of the data sets are free, but some are not.
Data.gov includes a catalog of APIs from across government.
The National Library of Medicine (NLM) supports several APIs which allow a variety of types of access to data. More data resources can be found on their Databases, Resources & APIs page: https://eresources.nlm.nih.gov/nlm_eresources/
ProgrammableWeb.com is "the world's leading source of news and information about Internet-based application programming interfaces (APIs)." Their API directory contains information about more than 14,000 APIs and can be filtered by category or protocol.
Chronicle is a tool for graphing the usage of words and phrases in New York Times reporting.
Google Books Ngram Viewer creates graphs that show the number of times certain keywords appear in publications over a defined time range. Searchable corpora were generated in July 2009 and July 2012. Multiple keywords can be graphed at the same time. Advanced features include wildcard search, inflection search, case insensitive search, part-of-speech tags, and ngram compositions.
The HathiTrust Research Center (HTRC) enables computational access to the public domain corpus of the HathiTrust Digital Library for nonprofit and educational researchers. The HTRC provides an infrastructure to search, collect, analyze, and visualize the full text of nearly 3 million public domain works.