arXiv Bulk Data Access

arXiv provides open access to electronic pre-prints in Astrophysics, Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics.

Due to practical and financial constraints on arXiv services, automated downloads using robots and similar tools are not permitted. There are several access mechanisms provided to gain machine access to metadata and full-text.

Bulk Metadata Access

Available information

The following information is available for every arXiv article:

  • Title of the article
  • The date in with the first version of the article was submitted and processed
  • The abstract for the article
  • The article authors, in order of authorship
  • The article category (using arXiv, the ACM Computing Classification System, and/or the Mathematics Subject Classification)
  • A link to the article's abstract page
  • A link to the article PDF

The following information may be available for a given arXiv article:

  • Author affiliation
  • The DOI for the article
  • A DOI link
  • The primary arXiv classification
  • Author comments
  • Journal reference for the article
  • The date on which the retrieved article was submitted and processed

For more information about arXiv metadata, read about formatting of metadata fields for arXiv submissions and/or the outline of an Atom feed.

Accessing bulk metadata

arXiv offers three ways to access metadata in bulk. If you use a programming language (such as Python, Perl, Ruby, or PHP) to access APIs or if you are a developer wanting to create a search interface for the archive, check out the arXiv API or OAI protocol for metadata harvesting (OAI-PMH).

If you want to find out about new additions to the archive without programming, check out the arXiv RSS feeds. Information about constructing arXiv RSS news feed URLs can be found here.

Categories and Subject Classes

To locate category names or subject class letters, look at the subject archive that you wish to subscribe to.

Subjects are listed on the arXiv home page (https://arxiv.org/). Open a given category archive page to see a list of available subcategories. For example, Physics (category: physics) has subcategories including Accelerator Physics (subject class: acc-ph), Fluid Dynamics (subject class: flu-dyn), and Physics Education (subject class: ed-ph).

Examples

https://arxiv.org/rss/astro-ph?version=0.91 is the URL for the RSS page (/rss) with version 0.91 output (version=0.91) of the Astrophysics (/astro-ph) archive.

https://arxiv.org/rss/cs.DL is the URL for the RSS page (/rss) of the Computer Science archive (/cs) Digital Libraries subject class (.DL).

Bulk Full-Text Access

High Energy Physics - Theory papers, 1992 - 2003

The KDD Cup 2003 was a knowledge discovery and data mining competition held in conjunction with the Ninth Annual ACM SIGKDD Conference. Approximately 29,000 papers from 1992 through 2003 from the High Energy Physics - Theory (hep-th) portion of arXiv were used for the 2003 competition. This dataset may be found here and includes extracted citation data.

Bulk PDF access

Processed PDF and source files are available from the Amazon Simple Storage Service (Amazon S3). When using Amazon S3 Requester Pays Buckets, the requester of the data pays for the data transfer and the request. Pricing information can be found here.

PDFs and source files are available in the "arxiv" Requester Pays bucket. New content is added on an approximately monthly schedule, with updates to existing files happening less frequently.

More information about bulk access of arXiv full-text articles and sources files can be found here.

License Information

Most articles are submitted to arXiv with the default arXiv license, which does not assign copyright to arXiv, nor grant arXiv the right to grant any specific rights to others. Therefore, you must link back to arXiv for downloads of full-text when building indexes or other tools. Additional license information is available here.