Web Archiving

How do we archive web content?

The formats

You can make a copy of a website by using wget. This will make a copy of the HTML to a directory and pull down the assets referenced as files in that directory.

Hosting copies

Navigating these copies can be challenging when you have hundreds of them. You may want to link between pages in existing copies, sharing assets like style sheets and images. You may want to copy a whole subset of a site, and add to it over time. These files are not static however, so it’s important to record when the sites were crawled.

Archive.org crawls an entire site in a sitting and then makes it possible to visit the same url at different points in time.

Perma.cc hosts WARC copies of individual webpages suitable for academic citation. To combat linkrot, the website has a contingency plan to act as a forwarding service should the project fail for hosting purposes.