The National Archives Labs

Linked data PRONOM

PRONOM is The National Archives’ technical registry – we plan to release the data it holds, in a linked open data format, and make it easier to reuse.

Data Tube

Update – 10/04/2013

Development of Linked Data PRONOM has been on hold for a while, but we will be in a position to restart work on Linked Data PRONOM in the second half of 2013.

Update – 27/10/11

Following feedback from the community about our initial vocabulary publication, we have now released a revised version found that can be found here

The vocabulary is now available in RDF and Turtle, and attempts to amend a number of issues that were raised with the initial modelling attempt.  Special thanks for comments go out to Alexander Dutton of Oxford University Computing Services, Mathieu of the LUCERO Project, Bill Roberts at OPF, Lisa Colvin of UDFR, Dave Tarrant at Southampton University and Chris Rusbridge.

We look forward to any feedback you have on the new release and hope that we are going in the right direction of travel.

Update – 25/05/11

A draft vocabulary specification for the linked data version of PRONOM is now available.  The document contains specialised information aimed at the linked data community, and we’re putting the vocabulary on Labs as a means of gathering feedback from those with linked data expertise.  If you aren’t familiar with linked data and want to get a deeper understanding of this area of interest, a useful tutorial can be found at: linkeddatatools.com/semantic-web-basics

Also, a reminder of the location of the prototype PRONOM SPARQL Endpoint and API.

Update – 27/01/11

On his visit to The National Archives in December 2010, Professor Nigel Shadbolt, Professor of Artificial Intelligence in the School of Electronics and Computer Science, University of Southampton, and Transparency and Open Data Adviser to UK Government, discussed the Transparency strategy and the impact of open data in supporting social initiatives and in generating economic growth:

Professor Shadbolt talked about linking data to make it easier to uncover. This is exactly the intention of the linked data PRONOM project. The existing data in PRONOM will be accessible though HTTP URIs, allowing users to view the data and follow links to find out more information about that data. This will make it easier to link to the data in PRONOM, promoting the discoverability and reuse of that data, and providing the means to develop the dataset further.

Update – 17/12/2010

Recently we’ve been busy transforming PRONOM data into RDF, experimenting with putting it into a triplestore, and running Puelia, a linked data API maintained by data.gov.uk and Talis on top of the data.

The API provides access to the data in multiple format representations including a very handy HTML visualisation.  A very early prototype of this work is now available, as well as a SPARQL endpoint.

There is still much to do, such as developing various web services (for example, so that it can interact with DROID), and exploring how we will present data from multiple different sources and express provenance.  While we are a long way from completion, we hope it’ll give you a chance to see how you will be able to use the data from the new version of PRONOM – and to post your comments on how we can build on what we have done to date.

Please remember that, should you receive a message stating page not found, or the site displays a blank page, this is only an early prototype. We are still in the process of modelling the data and making it available, and some of the logic resting behind the linked data API is still to be configured. Just click back in your web browser and continue to browse. Please also be aware that the vocabulary we are using currently is in its draft stages and will change before the project is finally put into production.

Linked data and PRONOM – 6/10/2010

The PRONOM registry contains information about file formats, compression techniques and encoding types. Linked data is about linking up related data on the web, to help expose, share and connect data, information, and knowledge through using URIs and RDF.

Initially we will concentrate on modelling and publishing file format data already stored in PRONOM, using linked data standards. This is the largest core of data within PRONOM, and our first step to transform the data will be to convert existing data to RDF to describe features of each format. The new version of PRONOM will be extensible, so at a later stage we will enhance the data model to improve other areas of information in the database.

Eventually we hope to be able to use linked data to populate PRONOM from other external data sources, transparently showing where the information came from, and in doing so develop a more comprehensive technical registry.

We want the new version of PRONOM to be an open source system with a completely open code base.

We’d like to hear your comments on our plans, or suggestions for improving the PRONOM database, below – your input will inform its development.

Comments (24)

  • Tom

    Suggestions: Use Perl or Ruby, and create a web application (ala LAMP). Save the data in a SQL database, preferably PostgreSQL with the option for all tools to use SQLite, especially tools which will be installed locally by users outside the National Archives. In addition to the usual serialization formats, publish data in SQLite. SQLite is portable to all platforms, is pre-installed on most platforms, and requires no outside software for parsing an exporting data. XML and even CSV require an external parser and lack the rigor and flexibility of SQL.

    Thanks,
    Tom

    The National Archives reply:

    Hi Tom,
    The new version of Pronom will use a triplestore to store the data, rather than an SQL database, as we think this gives the data greater flexibility and the ability to expand the range of data that Pronom provides more easily and quickly. Users of Pronom won’t need to install any external software to use the new version, and the data will be published in an open format, RDF, so it will be possible to export the data to a variety of formats and this hopefully should give better adaptability than at present.

  • Tweets that mention The National Archives Labs » Blog Archive » Linked data and PRONOM -- Topsy.com

    [...] This post was mentioned on Twitter by KLA, KeepIt. KeepIt said: National Archives confirms plan for Linked data and PRONOM http://bit.ly/c0eEE3 [...]

  • Euan Cochrane

    Hi David,

    I have many comments about the structure of PRONOM, but I’ll keep this blog comment short.
    Much greater emphasis on creating application would be the main thing. The “formats” should be identified by a combination of the standard that the file is attempting to match its formatting to and the creating application for the file, rather than just the standard.

    Thanks,

    Euan

    The National Archives reply:

    Hi Euan, thanks for your thoughts on this. We will be using a triplestore to store existing Pronom data, and this will entail structuring existing data in a way that makes the whole Pronom data model more flexible, allowing us to add new types of information about the format further down the line. We will retain the Pronom Unique Identifier (PUID) associated with each format, but will also be improving how we demonstrate the source of information about the format. So, both the structure and identifying features of a format in Pronom should improve fairly early on.

  • Tom

    The RDF triplestore is something we can cope with. It would help us re-purpose the data if there were an existing RDF-to-SQL parser. I’m an evangelist for publishing all data in SQLite due to the flexibility and ease of use. (Beyond SQLite or SQL statement, I try to publish data in several formats so the customer can choose.) Fundamentally, my suggestion is that this remarkable resource be published in a form that is easy for programmers to work with. If the only download format will be RDF, it would be delightful to have Perl and Ruby examples of queries. SQL queries are easier than RFD calls to an API.

  • Tom

    In terms of improvements, is it practical for DROID to optionally return the same information and in the same format as the Linux “file” command? FITS generally returns a conflict status because DROID, Jhove and others return complex descriptions, even when the question is simple. .doc files may have a ream of detailed information, but the mime type is often enough. I have noticed that during ingest of files, we want two pieces of information: common name of a file type, detailed file type information. I think PRONOM helps with the detailed info, but it would be nice to have a standard heuristic that provides the common name (or simplified identity).

    The National Archives reply:

    Hi Tom, well, DROID and Pronom are different systems; DROID only uses a small subset of the Pronom data and links back to Pronom to offer more guidance to users. Linked data will mean all output from Pronom is more flexible, that is, there will be a lot users can do with it once it is released (making available a queryable endpoint and multiple representations such as JSON, RDF, and more). Our primary objective is to recreate a system that fulfils The National Archives and community requirements while opening up that flexibility at the same time. A new version of DROID is being developed at the moment; we do try and understand the needs of the community when developing this tool and hope that it provides a balance for everyone. The current tool is open source and available on SourceForge: http://sourceforge.net/projects/droid/ – as such it is always possible for developers to access the code to develop the functionality they require and feed back that development to the rest of the community. Before that, however, we recommend having a look at DROID 5.0, seeing if a combination of the CSV output and filtering options can’t help you with reducing the complexity as you discuss.

  • Keith

    Hi David,

    Are you able to give any information on what ontology(s), if any, you are using to model and create relationships for your data in the RDF triple store? This would be useful for others who might envisage creating, or aligning, similar data along similar lines? Are you using some form of Government standards from data.gov.uk for this?

    Thanks
    Keith

    The National Archives reply:

    Keith,
    The vocabulary we use for Pronom will be restricted necessarily by the current Pronom database schema. It will be normalized where appropriate and expanded in areas where there is an identified advantage or a business case. Improving our handling of provenance is one example where we are likely to improve the data model. We appreciate there will be an interest in the vocabulary we adopt so this will be made available through various channels while we work on it, however a more publicly available release will only be available much closer to the time this version of Pronom goes live.

  • Dulanjali Adhikari

    I would like to know the archival file formats that use in PRONOM for particular type. for an example what is the archival format for word, spread sheet, image, audio, video and database.

  • Maria

    The “formats” should be identified by a combination of the standard that the file is attempting to match its formatting to and the creating application for the file, rather than just the standard. . / .

    The National Archives reply:

    Hi Maria, there often isn’t a way to tell what application was used to create a file. Sometimes the name of the creating application will be found in the metadata of a file, but this can easily be changed, or be completely absent. In any case, knowing the creating application doesn’t necessarily help in the long term preservation of the file. The National Archives’ preferred approach to identifying file formats is to identify a unique internal sequence or sequences of bytes that are only found within a particular file format. Once a digital object has been categorically identified in this way, preservation plans can be developed and processes put in place to manage those objects. In this way, digital objects which may not adhere to a standard specification, but which are still valid files representative of a format type, can also be managed. Constricting identification solely to a specification standard and the creating application would limit our ability to manage digital objects that fall between the cracks of these two criteria.

  • The PRONOM vocabulary | In a pickle

    [...] The National Archives are currently preparing a vocabulary specification for describing file formats that appear in digital repositories, targeted at the field of digital preservation. You can find the current version linked from their blog post. [...]

  • PRONON and linked data « The LUCERO Project

    [...] the national archive’s technical registry, and is currently being `transformed’ to be exposed as linked data. We can of course only welcome such an initiative and be very enthusiastic about this potentially [...]

  • Ruth Duerr

    Hi,

    I was very happy to see at least the stubs of entries for formats like HDF5 and binary, etc. (in other words, formats often used for scientific data). I would like to know what the plan is for moving those entries forward and adding missing ones (like HDF4, HDF-EOS 2, etc.)

    The National Archives reply:

    The Linked Data version of PRONOM will make it easier for us to link to resources where additional information is held about an entry, and where we do that, to show where the data came from. Where PRONOM is missing an entry, or where there is only an outline entry, such as with HD5, we would really welcome submissions of data from others to help us create a better populated database. To send us data for inclusion in PRONOM, please send us an e-mail: pronom@nationalarchives.gov.uk. Thanks.

  • Kifah

    good morning
    You can help to provide me the imformation about: The National Archives Laboratory specefication.
    thank you
    Dr/ Kifah

    The National Archives reply:

    Thank you for your enquiry. We’ve passed it to our Collection Care (Conservation) department and we’ll get back to you as soon as possible.

  • Future Proof – Protecting our digital future » Future Perfect: Digital preservation by design

    [...] format information – none of us should have to do it on our own. Excitingly, Ross spoke about a new initiative to make PRONOM available as linked open data. He also mentioned a new DROID suite of tools including a signature development utility. One of the [...]

  • Gary McGath

    The 26 October 2011 PRONOM vocabulary specification has an XML error in its example. A tag is opened as pronom:hasExtension and closed as pronom:extension. If I’m reading the documentation correctly, it should be neither of these but pronom:fileExtension. pronom:hasInternalSignature also looks incorrect, unless I’m misunderstanding something.

    The National Archives reply:

    Hi Gary,
    Thanks for raising this issue, the team who maintain Linked Data PRONOM are currently away but we will endeavour to fix this as soon as they return.

  • Ross Spencer

    Hi Gary,

    I worked on this vocabulary. If you look at the underlying RDF the answer is there for you, in Turtle you will see:

    pronom:hasExtension a rdf:Property ;
    rdfs:label “File Extension”@en ;

    The transformation might perhaps confuse things although you can see what it dereferences to by clicking on it. It will provide a 404 not found but it does follow through to: http://reference.data.gov.uk/technical-registry/hasExtension

    Otherwise, thanks for pointing out the error in the XML example at the beginning. It’s a lesson in not copying and pasting code!

    On the internal signature sample we have a similar issue**:

    pronom:hasInternalSignature a rdf:Property ;
    rdfs:label “Internal Signature”@en ;

    Having said that, if you think it doesn’t look correct or you think the approach isn’t quite following community best-practice then I am sure it is something The National Archives will value hearing and can perhaps roll into a next version of the vocabulary before eventually mapping the PRONOM triple content to the updated predicates.

    I think the team at The National Archives should be able to handle the remaining issues you find and take on board any future comments. Hopefully this helps in the short term.

    Kind Regards,

    Ross

    ** NOTE: The data in the triple store maps to this original version of the vocab: http://labs.nationalarchives.gov.uk/wordpress/wp-content/uploads/2011/06/draft-pronom-vocabulary-specification.pdf (sorry about it being PDF!) – one of the next stages for The National Archives will be setting the new vocabulary standard in concrete and eventually mapping to it, so it is something to bear in mind (for all the community) when working with these prototypes.

    The National Archives reply:

    Thanks very much Ross – and we hope this answers your question, Gary? Please do get in touch with any further issues or comments and we’ll get back to you as soon as possible.

Leave a comment




Comment validation by @