Environmental Scan

While social media archiving is still an emerging field within the archival profession, there are several existing social media archival programs that could inform future efforts. Additionally, increasingly sophisticated proprietary and open source software and services are available to support social media archiving.

Efforts to Collect Social Media Content

In the past decade, there have been several national initiatives to collect social media content. Depending on the purpose and resources of the institution, these archival efforts employ varying methods to harvest content deemed significant by the institution.

In 2008, the National Archives of Great Britain began a beta test of a Twitter archival program, designed to collect tweets from official government organizations. Using tools not publicly named, the Archives collected original tweets from official government accounts. Any retweets by official accounts were excluded from capture, as were all tweets and retweets from external accounts. Linked content from official UK government web pages was also captured and preserved. By the time the initial test ended in 2013, 65,000 tweets from 43 government accounts, as well as 7 other accounts affiliated with the 2012 London Olympic and Paralympic Games, had been archived and made available to the public.[1] The National Archives also created a similar compilation of YouTube videos.[2]

In 2010, the Library of Congress (LOC) announced an agreement with Twitter that would initially provide the Library with all publicly published tweets from Twitter’s founding in 2006 to 2010. The agreement also stipulated that Twitter would continue to provide all future public tweets to the LOC. A 2013 white page update on the Twitter archive stated that the LOC had collected 170 billion tweets at the time.[3] Currently, the tweets are not publicly accessible, partially due to the large volume of tweets and insufficient software to divide and search all of the data accumulated to make it usable for researchers.[4]

Organized in 2011 by the Department of Informatics at the Aristotle University of Thessaloniki in Thessaloniki, Greece, and funded by the European Commission, BlogForever worked alongside 12 European organizations with “multidisciplinary skills and ongoing work in weblogs, digital preservation, web archiving, semantics, analytics and software engineering skills" to create a site that would “harvest, preserve, manage and reuse blog content.”[5] Their work eventually became a blog archiving toolkit designed for use by businesses and individuals. The BlogForever Platform consists of a digital repository and a web crawler (“Spider”). There are two parallel versions of the Spider, one based on the open source Python framework, and the other based on the proprietary .Net framework. As the name implies, BlogForever was built solely to harvest blog posts. Both the Spider and repository components of the BlogForever Platform can be used with other repositories and crawlers. Despite the project having ended in 2013, BlogForever has not shared details on the product’s adoption.[6]

The incorporation of social media content into archival collections is becoming increasingly prominent among local and regional cultural heritage institutions in the United States. One common practice is to collect social media immediately relevant to the institution, such as a government documenting official tweets and blog posts, or a university documenting the social media presence of campus organizations and departments. Recently, institutions have also begun collecting social media content generated from significant current events in order to document public sentiments in real time. These events may not be directly related to the institution, but to a wider national narrative. While institutions are collecting social media content themselves using harvesting tools or partnering with third-party vendors, they are also accepting social media data from donors alongside their physical collections.  

When Love Makes a Family, Inc., closed its doors in 2009, the foundation donated all of its records to Yale University. Included in those records were Facebook, Twitter, and YouTube data, along with other born digital records. Access to these digital records is limited; they can be viewed onsite via a Yale University Library Manuscripts and Archives computer.[7]

Some institutions have incorporated social media into their collections by actively soliciting “donations” at the onset of events of interest. In June 2012, University of Virginia President Teresa Sullivan was forced to resign after two years in the position.[8] University staff and students protested this action, and eventually Sullivan was reinstated. In the midst of the protest, UVA’s University Archives began to collect physical and digital materials related to the event, and called on the public to contribute their own materials to the historical record. The UVA Library created a website for the collection where already contributed items can be viewed and new contributions can be made. The scope of collection includes “public Facebook events and support groups [and] tweets that use hash tags related to the controversy.” As of June 2015, only a few social media contributions have been made. The collection consists of online news reports and other published articles, photographs, e-mails, letters, and links to web pages containing video and audio recordings. According to the website, the option to contribute to the collection remains open.[9]

In September 2011, at the onset of the Occupy Wall Street Movement, people protesting against social and economic inequality gathered in New York City’s Lower Manhattan. The slogan of the occupy movement, “We are the 99%,” emphasized unrest over the U.S. wealth distribution and the unjust powers wielded by the nation’s wealthiest 1% over government and big business. The movement actively used social media, especially Twitter, to spread its message and organize protests. Emory University, as well as other universities and cultural heritage organizations, including New York University’s Tamiment Library, the Robert F. Wagner Labor Archives, George Mason University’s Roy Rosenzweig Center for History and New Media, and the New York Historical Society, started collecting content related to the movement, especially social media posts.[10] Scott Turnbull, a software engineering manager at Emory Libraries Digital Software Commons (DiSC), developed Twap to gather tweets, and DiSC fellows created maps from the accumulated data that visualized the locations of protestors and their discussions. While the archive ended up containing around 10 million tweets, they cannot be viewed individually as “legal, ethical, privacy and copyright concerns constrain the distribution of the collected tweets.” Instead, the tweets are compiled into data sets for researchers.[11] In addition to Emory's Twitter archive, the Roy Rosenzweig Center for History and New Media maintains the Occupy Archive, built with Omeka. This archive collects “stories, photos, video, and sounds from those participating in, organizing, or observing Occupy Movements” worldwide.[12]

As a response to the protests in Ferguson, Missouri, in 2014 and 2015 over the police shooting death of Michael Brown, Washington University worked with local universities, community organizations, and cultural heritage institutions to preserve material related to Brown’s death and the ensuing protests. Using Omeka and Archive-It, these groups created Documenting Ferguson, an online repository that documents events in Ferguson and provides public access to researchers. Content can be submitted through the website, which aims to “facilitate dialog and encourage educational outreach and community reconciliation within greater St. Louis.”[13] The #blacklivesmatter web archive, collects born digital records including social media content, websites, and blog posts. Related physical objects are being collected by the Missouri History Museum and Washington University.[14]

Some institutions, while not actively collecting social media at this time, are working toward collecting across various social media platforms. One such institution is the University of Texas at Austin,[15] where records management is mandatory. In 2013, a cross-departmental group was assembled to formulate a plan for archiving the “institutional domain,” which they determined included social media content.[16] The group is currently in the process of developing a records retention policy for web-based records. They believe the best outcome would combine historical preservation and records management under the same web archiving solution. Once completed, this policy will be reviewed for approval by University of Texas’s Information Technology Governance Committee and Texas State Records Management. After it is approved, the policy will be published online.

Tools and Services

Archive-It and ArchiveSocial are two popular sites that offer institutions a fee-based subscription service to archive and maintain their web content, including social media. Archive-It, built by the Internet Archive, focuses on preserving web content. The Oklahoma Department of Libraries is one institution that utilizes Archive-It to collect Twitter and Facebook posts from four accounts: Go Green Oklahoma Twitter account, Go Green homepage, Oklahoma Government Twitter account, and the Oklahoma Government Facebook account. Go Green tweets are private, so although Archive-It has been harvesting the page since 2011, the content is not available to the public.[17] Other governments and universities that use this archiving service to harvest social media content include Johns Hopkins University, University of Wisconsin-Madison, and Texas State Library and Archives Commission (TRAIL).

North Carolina-based ArchiveSocial focuses on social media preservation, and provides subscribers with the option to harvest data from Facebook, Twitter, LinkedIn, YouTube, and Instagram. In December 2012, using ArchiveSocial, the State Archive of North Carolina launched one of the first digital archives of social media content from selected government agencies.[18] Additional examples of institutions that use this platform include the South Carolina State Library, the City of Austin, Texas, and Snohomish County, Washington.

Instead of using a subscription program or a pre-existing open source program, some institutions have created new tools and platforms. The applications implemented by such institutions can serve as examples for others who consider creating their own harvesting application.

The George Washington University (GWU) Libraries created the Social Feed Manager software to fulfill the needs of the university. First developed for departmental research purposes, Social Feed Manager is a Django application that automates the collection and management of multiple feeds of social media data via Twitter’s public API. After discussion with other institutions, its potential for use outside GWU was quickly recognized, and an Institute of Museum and Library Services Sparks Ignition Grant was acquired for further application development. Now, one of the software’s primary goals is to be useful to other cultural heritage organizations wanting to collect social media data. To fulfill this goal, GWU Libraries has made it available as an open source application downloadable via GitHub; GWU also provides user guidelines and documentation.[19] Among the software’s current uses at GWU is the University Archives’ effort to capture aspects of student life on social media by harvesting tweets from university offices and student organizations. GWU Libraries hopes this software will make it possible for other institutions to collect data useful to students, scholars, archivists, and librarians for research purposes, and to identify, collect, and preserve this data for future use.[20]

North Carolina State University (NCSU) Libraries has developed Lentil, a Ruby on Rails engine that allows image harvesting through the Instagram API. The public interface provides several ways to browse and share the images, and lets users select their favorites. There is an administrative interface to moderate the harvested images and a system to generate donor agreements to prepare for ingest into external repositories. The Lentil program is a flexible open source application that is designed for use across devices, including mobile, tablets, desktops, and large screens.[21] Originally, the application was created for the My #HuntLibrary project, a platform to foster student and community engagement with, and promote feelings of ownership among the users of, NC State University’s new James B. Hunt Jr. Library. This crowdsourced documentation effort, and the Instagram images it generated, will become part of NCSU’s permanent digital collection. Programs such as My #HuntLibrary let users contribute to the university’s historical record through image submissions, and also shares the task of records curation via “like” and “battle” (i.e. this-or-that) voting tools.[22]

NCSU Libraries has also developed a pre-configured package of social media harvesting tools with the goal of enabling social media archival programs and institutions with limited IT resources. This project, the Social Media Combine, pre-assembles Lentil and Social Feed Manager, along with the web servers and databases needed for their use, into a single package that can be deployed to desktop and laptop computers and used in Windows, OSX, or Linux. An early version of this system has been released under an open source license.


[1“UK Government Web Archive: Twitter,” The National Archives, accessed June 30, 2015, http://webarchive.nationalarchives.gov.uk/twitter/.

[2“UK Government Web Archive: Videos,” The National Archives, accessed June 30, 2015, http://webarchive.nationalarchives.gov.uk/video/.

[3“Update on the Twitter Archive At the Library of Congress,” Library of Congress, January 2013, accessed June 30, 2015, http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf.

See entry in the annotated bibliography.

[4] Victor Luckerson, “What the Library of Congress Plans to Do With All Your Tweets,” Time, February 25, 2013, accessed June 20, 2015, http://business.time.com/2013/02/25/what-the-library-of-congress-plans-to-do-with-all-your-tweets/.

See entry in the annotated bibliography.

[5] BlogForever homepage, 2011-2013, accessed June 23, 2015, http://blogforever.eu/.

[6J. Garcia Llopis, et. al., “BlogForever D4.8: Final BlogForever Platform,” October 26, 2013, accessed June 23, 2015, https://zenodo.org/record/7497/#.VYlvh1VVhBd; to download the open source Spider, see also Vangelis Banos, “BlogForever Platform Released,” BlogForever, October 8, 2013, accessed June 23, 2015 http://blogforever.eu/blog/2013/10/08/blogforever-platform-released/.

[7Mary Caldera, “Guide to the Love Makes a Family Records: MS 1962,” Yale University Library Manuscripts and Archives, January 2011, accessed July 14, 2015, http://drs.library.yale.edu/fedora/get/mssa:ms.1962/PDF.

 [8] Gretchen Gueguen, “Capturing the Zeitgeist,” October 10, 2012, accessed June 30, 2015, https://www.slideshare.net/guegueng/capturing-the-zeitgeist.

See entry in the annotated bibliography.

[9“The Call: Help Preserve the Historical Records of President Sullivan’s Resignation and Reinstatement,” University of Virginia University Archives, accessed June 23, 2015, http://sullivan.lib.virginia.edu/about.

[10John Del Signore, “Museums Archiving Occupy Wall Street: Historical Preservation or ‘Taxpayer-Funded Hoarding?,’” Gothamist, December 26, 2011, accessed June 30, 2015, http://gothamist.com/2011/12/26/occupy_wall_street_the_museum_exhib.php.

See entry in the annotated bibliography.

[11] Leslie King, “Emory Digital Scholars Archive Occupy Wall Street Tweets,” Emory News Center, September 21, 2012, accessed June 23, 2015, http://news.emory.edu/stories/2012/09/er_occupy_wall_street_tweets_archive/campus.html; see also Jennifer Schuessler, “Occupy Wall Street: From the Street to the Archives,” New York Times, May 2, 2012, accessed April 10, 2015, https://artsbeat.blogs.nytimes.com/2012/05/02/occupy-wall-street-from-the-streets-to-the-archives/?_r=0; Amy Roberts, “Occupy Wall Street Archival Project,” Occupy Wall Street Library, accessed April 10, 2015, https://peopleslibrary.wordpress.com/; Howard Besser, “Archiving Media from the ‘Occupy’ Movement: Methods for Archives trying to manage large amounts of user generated audiovisual media,” in Imatge I Recerca, 12es Jornades Antoni Varés 2012, Ponencies, Experiencies I Communicacions, (proceedings of conference in Girona, Catalunya, 20-23 Novembre, 2012, pages 106-110), accessed April 10, 2015, http://besser.tsoa.nyu.edu/howard/Papers/besser-girona-occupy-paper.pdf.

See entry in the annotated bibliography.

[12] “#Occupy Archive: Archiving the Occupy Movements from 2011,” Occupy Archive, 2011, accessed June 23, 2015, http://occupyarchive.org/about.

[13] “Project Explanation and Purpose,” Documenting Ferguson, accessed June 30, 2015, http://digital.wustl.edu/ferguson/DFP-Plan.pdf; see also Emanuele Berry, “Washington University Libraries Builds Ferguson Digital Archives,” St. Louis Public Radio, September 21, 2014, accessed June 30, 2015, http://news.stlpublicradio.org/post/washington-university-libraries-builds-ferguson-digital-archives.

See entry in the annotated bibliography.

[14] Erica Smith, “Wash U, History Museum Seeking Ferguson Artifacts,” St. Louis Public Radio, January 30, 2015, accessed June 30, 2015, http://news.stlpublicradio.org/post/wash-u-history-museum-seeking-ferguson-artifacts; see also Emanuele Berry, “Missouri History Museum Looks For Ferguson Artifacts in Burnt Down Building,” January 29, 2015, accessed June 30, 2015, http://news.stlpublicradio.org/post/missouri-history-museum-looks-ferguson-artifacts-burnt-down-building.

See entry in the annotated bibliography.

[15] Elliot Williams, “Web Archiving for University Records,” powerpoint presentation, Society of Southwest Archivists, Austin, TX, May 22-25, 2013, accessed June 23, 2015, https://societyofsouthwestarchivists.wildapricot.org/Resources/Documents/SSA2013Presentations/Williams_SSA_2.pdf.

See entry in the annotated bibliography.

[16] Christie Peterson, et. al., “RMRT/WebArch Hangout,” SAA Records Management Roundtable video, 52:13, July 9, 2014, accessed June 30, 2015, https://www.youtube.com/watch?v=vDN1vvvW0q0.

See entry in the annotated bibliography.

[17] “www.ok.gov (Oklahoma’s Official Website) Social Media,” Archive-It, 2014, accessed June 30, 2015, https://archive-it.org/collections/2303.

[18] “State of North Carolina- Social Media Archive,” NC State Government Web Site Archives & Access Program, 2014, accessed April 10, 2015, http://nc.gov.archivesocial.com/; see also Colin Wood, “North Carolina Archives Social Media,” Government Technology, December 2, 2012, accessed June 30, 2015, http://www.govtech.com/e-government/North-Carolina-Archives-Social-Media.html; Nathan Dickerson, “Case Study: North Carolina Archives Social Media to Comply with Public Records Law,” The Council of State Governments Knowledge Center, August 1, 2012, accessed June 30, 2015, http://knowledgecenter.csg.org/kc/content/case-study-north-carolina-archives-social-media-comply-public-records-law.

[19] Daniel Chudnov, “Project Background: Social Feed Manager,” George Washington University Libraries, May 5, 2014, accessed June 23, 2015,  https://library.gwu.edu/scholarly-technology-group/posts/project-background-social-feed-manager.

[20“Welcome to Social Feed Manager!,” George Washington University Libraries, 2015, accessed June 23, 2015, https://social-feed-manager.readthedocs.io/en/m5_003/; see also “Introduction,” George Washington University Libraries, 2015, accessed June 23, 2015, https://social-feed-manager.readthedocs.io/en/m5_003/intro.html.

[21“NCSU-Libraries/lentil,” GitHub, Inc., 2015, accessed June 23, 2015, https://github.com/NCSU-Libraries/lentil.

[22] “My #HuntLibrary,” NCSU Libraries, August 12, 2014, accessed June 23, 2015, /projects/my-huntlibrary.