Web Archiving: An Overview
Web Archiving: An Overview
Introduction
In this section, Carl introduces himself and the topic of web archiving. He explains that he will be laying a foundation for those who are new to the practice and providing some necessary background.
- Carl uses Archive, a subscription software service provided by the Internet Archive, to build and maintain collections.
- The goal is to learn about web archiving and bring back knowledge to our programs and institutions. Web Archiving is the process of collecting, preserving, and enabling access to web-made materials.
- Specific use cases vary but tend to coalesce around one or more general types. These include academic and special collections libraries using Web Archiving as a strategy to complement their traditional subject-based collecting scopes.
Types of Use Cases
This section covers different types of use cases for Web Archiving.
Academic Libraries
- Academic libraries use Web Archiving as a strategy when publications they're interested in start moving towards an online publishing model.
Government Organizations
- Government organizations find it useful for meeting their requirements to document their institution's histories and preserve their records.
- State and local actors are increasingly getting involved in recording and preserving conversations and events that only exist online.
Preservation
- Regardless of scope, Web Archiving empowers us to combat link rot or broken links in the chain when online resources disappear entirely.
Web Archiving: Why and How
In this section, the speaker discusses the importance of web archiving and how it can be used to preserve important references.
The Importance of Web Archiving
- Studies have shown that many articles with hyperlink citations contain link rot or content drift.
- By collecting and preserving web materials in a stable archive, we can maintain reliable access to important references.
The Process of Web Archiving
- A piece of software, often a web crawler, is directed to read all source code behind a web item and download it into an internationally standardized digital file format called WARC.
- WARC files require software that can interpret their contents for human users.
- The Wayback Machine is one example of replay technology that allows archived websites to be viewed as they appeared at the time they were archived.
Challenges in Web Archiving
- Crawls that archive web content are broad but shallow, often not penetrating beyond a website's homepage.
- Different institutions have different goals and needs when it comes to selecting and managing web archives.
- End-to-end services bundle tools together with a single software interface for those who don't have the technical or financial resources for sophisticated web archiving programs.
Participating in Web Archiving
- Joining email lists like SA's Web Archiving Roundtable can provide opportunities for participation in the community.
- End-to-end services allow more librarians and archivists to participate in web archiving, which can lead to more material being archived.
Web Archiving Programs and Technical Choices
This section discusses the results of web archiving programs in 2011, 2013, and 2015. It also covers technical choices made by nearly two-thirds of organizations with web archiving programs.
Member Organizations and Technical Choices
- Member organizations had web archiving programs in 2011, 2013, and 2015.
- Nearly two-thirds of organizations opted to rely on an end-to-end service for technical choices.
- Relatively few programs support a whole team of people for a web archiving program in-house.
Collaborative Web Archive
- Half of the respondents were interested in participating in a collaborative web archive.
- Collaboration helps move forward together on defining issues and questions that will all face individually if we get into web archiving.
Accessing Web Archives
This section discusses how even small web archiving programs can facilitate interesting new forms of access. It also covers how librarians and archivists can communicate to their patrons precisely how they decided what does or doesn't belong in their collections.
New Forms of Access
- Small web archiving programs have enough digital material to facilitate interesting new forms of access.
- Librarians and archivists need to standardize and document decisions about what belongs in their collections.
Ethics and Policy
- The question is whether the deed of gift in the age of web archiving applies to creators of web-native material.
- The question is more about ethics and policy, and one can only be answered by anyone collecting beyond the boundaries of their own institution's website and on a case-by-case basis.
Case Study: Web Archiving Initiative of NIRC
This section provides an overview of the Web Archiving Initiative of the New York Art Resources Consortium (NIRC).
Overview
- Sumitra Duncan provides an overview of the Web Archiving Initiative of NIRC.
Introduction to NIRC's Web Archiving Program
In this section, the speaker introduces the goals of NIRC and explains why web archiving is important for preserving digital art resources.
Goals of NIRC
- The shared goals of NIRC include improving access to art research resources through technology, advancing the scholarly educational and cultural enrichment missions of the three museums, and providing leadership in the development of innovative and model information services programs.
Importance of Web Archiving
- Given the rapid pace at which websites are updated or cease to exist, it is critical to explore solutions for collecting and preserving digital art resources that are significant to art historical scholarship.
- An estimated 80% of material online disappears or changes each year, making preservation a significant challenge for dynamic and ephemeral information sources.
- Establishing a sustainable program of web archiving for the NIRC libraries would require a more focused investigation about publishing trends and perceived technical and organizational challenges for building and preserving a web archived collection.
Pilot Study on Web Archiving at Frickart Reference Library
In this section, the speaker discusses NIRC's pilot study on web archiving conducted at Frickart Reference Library.
Pilot Study on Auction Sale Information Online
- In 2010, NIRC began exploring the efficacy of implementing a consortial web archiving program with a pilot study done at Frickart Reference Library.
- The pilot study sought to evaluate the usefulness of capturing auction sale information online.
- While the pilot study demonstrated the many complexities of capturing certain types of web content, it identified web archiving as a means of addressing the capture of ephemeral web resources that were significant to art history.
Implementation Grant for Web Archiving Program
In this section, the speaker discusses NIRC's implementation grant for its consortial web archiving program.
Andrew W. Mellon Foundation Grant
- In 2012, NIRC was awarded a one-year grant from the Andrew W. Mellon Foundation to research publishing trends and investigate web archiving technologies.
- The outcomes from the 2012 study indicated that our libraries needed to take action or risk developing significant gaps in the art historical record.
- In late 2013, NIRC received a two-year grant again from the Andrew W. Mellon Foundation in support of implementing our consortial web archiving program.
Consortial Web Archiving Program
- The implementation grant allows NIRC libraries to actively expand their web archived collection activities and develop workflows for administering that collection.
- NIRC is primarily utilizing the Art Private Subscription Service to build and manage its web archived collections.
Collaborative Collection Development Policy
In this section, the speaker discusses NIRC's collaborative collection development policy for websites.
Selection Process
- Websites of scholarly value are selected and nominated for inclusion in the consortium's ten curated archiving collections.
- Nominated websites are reviewed and revised periodically based on the rapid pace of change in website content functionality and features.
- Staff selectors at each of the NIRC libraries nominate websites for inclusion in web archives via an online Google form.
Submission Process
- Selectors can make note of content within nominated websites that warrant item-level cataloging to create greater access to specific PDFs or a family of files within a site.
- Staff, registered library patrons, and the public can recommend websites for inclusion in our collections. All submissions are reviewed for final approval.
- Prior to archiving a website, NIRC's web archiving staff will contact the owner of a site to seek their permission to include their content in the consortium's web archive collections.
NIRC Discovery Tool
The speaker discusses the implementation of the NIRC discovery tool, which allows researchers to search the full text of web archive collections and other holdings.
Archiving Collections
- The NIRC is harvesting websites across 10 collections, including art resources and art historical scholarship sites that have shifted online.
- They are also archiving artists' websites, auction house websites and embedded catalogs, catalog resonates such as Gucci's born digital salmon catalog.
- They are focused on archiving their own institutional website and consortium's web presence. This includes Mama, Frick, Bofen Museum, Narik.org and Mama PS1.
- They are also archiving New York City gallery and art dealer websites as well as sites devoted to scholarship for restitution of lost or looted art.
Challenges in Web Archiving
- Curator involvement in web archiving is key to retaining all elements of one digital resource for their collections.
- Quality assurance practices have been developed for the consortium to improve web captures.
- Challenges include scale, rapidly evolving new web technologies, cost limitations of infrastructure tools and permissions considerations for intellectual property.
- Web archiving requires more investment than collecting print materials. It requires shifting a print acquisitions budget over to a digital acquisition budget.
- Issues faced by institutions investing in digital preservation can be better addressed collaboratively by the community in a way that avoids duplication of effort.
Future Directions
- Large-scale web archiving requires shared responsibility to shape preservation plans with inclusive access points.
- There is great potential to utilize web archive collections for data and trend analysis by future scholars.
- The speaker wants to engage users to understand the challenges and opportunities of web archived collections in the domain of art history.
Conclusion
The speaker concludes the presentation by providing some resources for attendees who want to learn more about web archiving.
- Attendees can attend the second webinar in this series on February 4th.
Networking and Resource Sharing
This section discusses resources for networking and sharing web archiving information.
Resources for Networking and Learning
- IPC : The International Internet Preservation Consortium maintains technical tools and standards like the WARC file format and Open Wayback. It includes national libraries from around the world.
- Web Archiving Special Interest Group : A group hosted by Metro, headed up by Samitra and Columbia University's Alex Ferman, that focuses on web archiving in New York.
- Jill Lapor's Article : Harvard historian Jill Lapor wrote an article about web archiving for the New Yorker magazine that summarizes conceptual issues well.
Q&A Session
This section covers questions asked during a Q&A session at the end of the presentation.
Reasons for Denying a Proposed Site
- NIRG Collection Development Policy : NIRG requires sites to have perceived scholarly value. Sites are reviewed by a selection committee to determine their scholarly value. They also evaluate sites based on how well they can be captured since they have limited resources in this area.
- Permissions Model : NIRG archives based on permissions model. If site owners deny permission, they will not include the site in archives.
Archived Ability
- Challenges with Dynamic Content : Web design is always changing, making it difficult to archive dynamic content. Web crawlers see websites differently than users do, so designers must find ways to make their content user-responsive while still being crawler-friendly.
- Making Websites Archivable : Columbia and Stanford have resources available on making websites archivable.
Availability of Presentation Slides
- Slides Will Be Available : The presentation slides will be available later.
Archiving Websites
In this section, the speaker discusses the challenges of archiving websites and how to overcome them.
Robots.txt File
- A robots.txt file is a standard file that webmasters use to tell crawlers like search engines not to crawl and index their website.
- This can make a website more difficult to archive.
- Special rules can be made in the file to allow archiving crawlers or it can be bypassed entirely.
Content Management Systems (CMS)
- CMS platforms like WordPress, Wix, and Squarespace are becoming popular for managing web content.
- These platforms were initially challenging for archiving but have largely been conquered by creating special rules for crawlers to follow.
- Anything generated by a database at the back end can be difficult to predict and requires special attention from personnel for quality assurance.
Storage Considerations
In this section, the speaker discusses work files and storage considerations when archiving websites.
Work Files
- Work files can be as large as desired but software used in the process makes decisions about how large each work file will be.
- Several work files may be capped at a certain level during a single crawl.
- Even very large websites might exist in its archive coded form across multiple work files that speak collaboratively dynamically.
Storage
- Anyone who wants to run Heritrix or web crawler locally can download works right onto their laptops.
- At Internet Archive, works are stored in a repository based on the lots of copies keep stuff safe model with primary and backup copies geographically distributed securely.
- Cloud-based backups are also available through services like DuraCloud.
Long-Term Preservation Needs
In this section, the speaker talks about the long-term preservation needs of institutional archives websites and how they are being addressed.
Dura Cloud for Safe File Storage
- The files are safe within Dura cloud.
- They are backed up in and on a three and glacier.
- Files are checked for security and integrity over time.
Local Storage Option
- Partners have the option to download their works and store them locally.
- Investigating options for long-term preservation needs.
Conclusion and Next Steps
This section covers the conclusion of the webinar and what to expect in future webinars.
Technical Issues Acknowledgement
- Thanking everyone for joining despite technical issues at the start of the webinar.
Future Webinars
- Three more webinars in series with next one on February 4th.
- More information can be found on Metro's website.
Additional Information
This section provides additional information about upcoming webinars.
Link to Upcoming Webinars
- A link will be sent out with information about other upcoming webinars.
Closing Remarks
This section covers closing remarks from the speakers.
Final Thanks
- Thanking everyone for attending today's webinar.
- Special thanks to Symmetric and Carl for their excellent presentations.
End of Meeting
- Closing up the meeting.
Final Thanks
This section covers final thanks from the speakers.
Final Thanks
- Thanking everyone for attending today's webinar.
I understand the task and formatting requirements. However, I cannot provide a summary of the transcript as there is no text provided for me to summarize. Please provide me with the transcript so that I can create a comprehensive and informative markdown file in the required format.