The National Archives Labs

Web archiving: what we do and why

One of the major innovative preservation activities which The National Archives has developed in recent years has been to archive central government websites.

We do this because government websites contain unique data which would be lost if we did not capture it. Originally we saw websites as a publications medium, but, over the past five years, they have become much more interesting as government has engaged in a dialogue with citizens, and Ministers and others have used social media tools such as blogs to communicate directly with the public.

We also help government departments to ensure that links to documents and other information persist, for example ensuring that users of their websites don’t end up with a ‘Page Not Found’ error message. We’ve achieved this by encouraging them to use redirection technology which takes users to the UK Government Web Archive if the page no longer exists on their live website. Try clicking on this link to see what happens: www.justice.gov.uk/publications/consultation-pandemic-flu.htm

This technology was particularly useful immediately after the General Election when lots of content was removed from government websites. We saw a lot of traffic to the web archive, reaching a peak of 146 million ‘hits’ in July 2010.

What is web archiving?

Web archiving began in December 1996 when Brewster Kahle began his amazing programme to archive the whole of the world wide web (see www.archive.org). Very little of the web survives before that date, though there are a few surviving early sites. The Internet Archive have collected some of these – you can see WebCrawler, The Well and the famous Cambridge coffee pot (a very early web cam) here: web.archive.org/collections/pioneers.html

The National Archives has been archiving government websites since 2003, when we started working with the Internet Archive. We were able to inherit some older sites that they had captured, so in our collection we have websites going back as far as 1997. There appear to be no government websites before that date – the earliest Public Record Office (PRO) site is dated from December 1998. This is not really surprising since the early sites were created by sending floppy disks to the government telecommunications agency in Norwich. They could only be updated once a month.

Since 2008, we’ve really expanded our programme to create a comprehensive archive of government’s web presence. The European Archive have been contracted to crawl and host the collection since 2005.

Some facts about the web archive:

  • It is free to use and is accessible online: www.nationalarchives.gov.uk/webarchive
  • It contains approximately 1 billion documents
  • There is a great variety of content, archived over many years
  • It has a wide variety of users (government itself, researchers, journalists, the general public)
  • Approximately 2,000 websites have been catalogued so far

Read more about web archiving.

Watch UK Government Web Archive: A retrospective, a video illustrating some of the changes that government websites have undergone.

Archiving social media

The Web Archiving team at The National Archives is also exploring possible solutions for archiving social media platforms used by government departments. Using The European Archive we are able to take snapshots of Twitter and Flickr and for example, did a sweep before the general election in May. Many government department Twitter feeds are now crawled on a similar basis to their websites. We are continually looking to improve on the processes used and are exploring more reliable methods and different technologies. We’re happy to share our knowledge and what we’ve learnt so far and would also be interested to hear from anyone developing similar solutions. Please email webarchive@nationalarchives.gsi.gov.uk.

Director of Technology and Chief Information Officer – David Thomas

As a senior archivist and records specialist at The National Archives, David’s career has focused on developing access to archives and information in both government and the archive sector.

David is responsible for information technology services at The National Archives, and is leading on the major cross-government project to develop a shared service for preserving digital records.

One comment

  • lapin nain

    Thank you for sharing this valuable informative article. I am sure that it will help me a lot to know a lot. Waiting to know more from you.

Leave a comment




Comment validation by @