Social Media Harvesting Tools

Harvesting and preserving social media content is currently a growing area of consideration in the archival field. Companies and institutions have taken note of the interest in social media data and have developed programs that aid in data harvesting.

There are two third-party subscription based programs that are popular for archiving social media, ArchiveSocial and Archive-It. ArchiveSocial preserves content from Facebook, Instagram, Twitter, LinkedIn and YouTube. The site is commonly used by government agencies to comply with Freedom of Information Act (FOIA) requirements and is also geared toward financial services sectors to maintain compliance with regulations enforced by the Securities and Exchange Commission (SEC), as well as the Financial Industry Regulatory Authority (FINRA).[1] ArchiveSocial is able to capture an original social media post as well as comments made by other users and mirrors the functions of the original social media post, allowing archive viewers to expand posts, comments, and pictures. The site continually mines the aforementioned social media sites throughout the day so that it does not miss any content that could be deleted later on. Institutions that use this service include the State Archives of North Carolina, the South Carolina State Library, the city of Austin, Texas, and Snohomish County, Washington.

Created in 2006, Archive-It is affiliated with the Internet Archive and, as opposed to ArchiveSocial, collects web content beyond social media. For each subscribed institution, Archive-It sets a schedule that regulates the harvesting of pre-determined sites for collection. These screenshots are static and lack the functionality required for viewers to interact with the captured content such as comments or links that are not separately archived. Social media platforms that can be harvested are Facebook, Twitter, and Instagram. The Archive-It license includes storage and a discovery layer. Archive-It is commonly used as a social media harvesting tool, and examples of institutions that use it for that purpose include Johns Hopkins University, University of Wisconsin-Madison, Texas State Library and Archives Commission (TRAIL), and the Oklahoma State Library’s Oklahoma Digital Prairie collection.

Currently there are some free open-source programs that allow for the harvesting of social media data. These applications can be utilized by other institutions as part of a social media content archiving program.

Dan Chudnov and the team at George Washington University Libraries created Social Feed Manager, which harvests public tweets using Twitter’s Application Programming Interface (API). Social Feed Manager allows for tweets to be harvested by “specific users, search[ed]...by keyword, and filter[ed] by geolocation.” The program divides tweets into sets for easier mobility and allows for the data to be exported as CSV files to other programs for supplemental analysis.[2] In the summer of 2013, the University was awarded an IMLS Sparks Ignition grant that allowed for further development and growth of the program, followed in 2014 by a three year National Historical Publications and Records Commission grant.[3]

In 2013, North Carolina State University (NCSU) Libraries developed the open-source program Lentil, available on GitHub, that captures images posted on Instagram and the corresponding metadata. A crowdsourced storytelling site, My #HuntLibrary, was developed using the software in relation to the opening of the new James B. Hunt Jr. Library at NC State. Students tagged photos posted to Instagram of or from the new library with #HuntLibrary, and the photos were harvested before being uploaded to a Libraries website where visitors could vote on their favorites.[4]

Tweets are an emerging data source for researchers but due to the vast quantities of tweets, harvesting a usable data pool can be difficult. Developers have created open-source programs to utilize Twitter's search results, which are only available to index for a week following the creation of a tweet. The short time frame that tweets can be captured is problematic for harvesting as it requires archivists and researchers to be cognizant of hashtags, events, and accounts relevant to their collecting scope. Additionally, pictures associated with Tweets are usually not captured using the openly available tools.

Twitter Archiving Google Sheet (TAGS) is a free Google sheet template that incorporates Twitter’s Application Programming Interface (API) and runs an automatic collection of tweets based on a chosen hashtag. On his blog, MASHe, Martin Hawksey provides a useful guide on setting up and using TAGS. Twarc, developed by Ed Summers, is “a command line tool and Python library for archiving Twitter JSON data”, and is available on GitHub.[5]

Programs like Social Feed Manager and Twarc use API platforms provided by social media sites to create their harvesting tools. Sites that provide API code, allowing easy social media harvesting include: Instagram, Twitter, Facebook, YouTube, and LinkedIn.

Additionally, Facebook and Twitter provide their own applications allowing users to download their data. For additional information on these applications, and third-party sites and applications, that allow users to archive their social media content, please see the Facebook and Twitter Personal Archives portion of the toolkit.

________________

[1] “Why Archive Social Media?,” accessed March 24, 2015, https://archivesocial.com/.

[2] “Features,” Social Feed Manager, https://social-feed-manager.readthedocs.org/en/latest/intro.html.

[3] “Development and Community,” Social Feed Manager, https://social-feed-manager.readthedocs.org/en/latest/intro.html. See also: Brittney Dunkins, “Social Feed Manager Simplifies 21st-Century Digital Research,” GWToday, October 1, 2014, https://gwtoday.gwu.edu/social-feed-manager-simplifies-21st-century-digital-research.

[4]“About the Project”, My #HuntLibrary, https://www.lib.ncsu.edu/projects/my-huntlibrary.

[5]“Twarc,” Github, accessed March 24, 2015, https://github.com/edsu/twarc.