Library of Congress Having Major Issues with Twitter Archive
Back in 2010, the Library of Congress announced that they'd be archiving every public tweet made on Twitter since 2006. How's it going? Not so great as per a whitepaper released today by the Library of Congress which details their many technological challenges of searching 170 billion tweets - including searches taking more than 24 hours. Yes, hours.
Here's the update:
In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company's inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library's first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed. To date, the Library has an archive of approximately 170 billion tweets.
The Library's focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way. It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. The Library is now pursuing partnerships with the private sector to allow some limited access capability in our reading rooms. These efforts are ongoing and a priority for the Library.
This document summarizes the Library's work to date and outlines present-day progress and challenges.
Why the Twitter Collection is Important to the Nation's Library Twitter is a new kind of collection for the Library of Congress, but an important one to its mission of serving both Congress and the public. As society turns to social media as a primary method of communication and creative expression, social media is supplementing and in some cases supplanting letters, journals, serial publications and other sources routinely collected by research libraries. Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today's cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes. The Library of Congress Agreement with Twitter
The Library's agreement with Twitter announced April 14, 2010 provided that:2
• Twitter would donate a collection consisting of all public tweets from the Twitter service from its inception to the date of the agreement, an archive of 21 billion tweets that occurred between 2006 and 2010.
• Any additional materials Twitter provides to the Library would be governed by the terms of the agreement unless both parties agree to different terms in advance of receiving such additional materials.
• The Library could make available any portion of the collection six months after it was originally posted on Twitter to "bona fide" researchers.
• A researcher must sign a "notification" prohibiting commercial use and redistribution of the collection.
• The Library cannot provide a substantial portion of the collection on its web site in a form that can be easily downloaded.
Transfer of Data to the Library
In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library.
Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.
In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies. Building a Stable, Sustainable Archive
The Library's first and most fundamental activities included developing a stable and sustainable way to acquire, preserve and organize the Twitter collection. Although the Library regularly acquires digital content, the Twitter stream is the first collection coming into the Library in a continuous stream. The Library leveraged the technical infrastructure and workflow established for other digital content in the transfer of Twitter data.
The Library runs a fully automated process for taking in these new files. Gnip, the designated delivery agent for Twitter, receives tweets in a single real-time stream from 3 Twitter. Gnip organizes the stream of tweets into hour-long segments and uploads these files to a secure server throughout the day for retrieval by the Library.
When a new file is available, the Library downloads the file to a temporary server space, checks the materials for completeness and transfer corruption, captures statistics about the number of tweets in each file, copies the file to tape, and deletes the file from the temporary server space.
The technical infrastructure for the Library's Twitter archive follows the same general practices for monitoring and managing other digital collection data at the Library. Tape archives are the Library's standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.