Wednesday, April 7, 2010

Global Registries Initiative Meeting Tucson


We first went through the submitted use cases and how they could realistically be met (or not) through an achievable model for a global register of registries.

We subsequently thrashed out a common view of how a "Global Registries Service" should be architected and a high level view of the first class entities and information elements for each entity.

The consensus was that the record types in the global registry would be "registries" "collections" and "services".

There was some discussion about whether parties/agents should also be a record type available through the service, but in the global context it was felt that there weren't use cases that required it. But the properties of registry/collection/service that referred to a party should allow identifier schemes to be used where available.

Below is a diagram prepared by Tim from what was drawn on the flip-chart. Each of the participating registries would expose their collection/registry/service records using the same service architecture as the GRS service architecture. They will not expose resource/dataset level records in the GRS service and they will be required to distinguish "registry" records from "collection" records. This is already part of the model used by ANDS and Ockham, but not by IESR. This will need to be followed up with Vic.



We went through each of the entity types listing what we thought their information attributes should be. This is recorded on the following flip chart. We added 'location' after the picture of the flip-chart was taken.

There was discussion about 'subject' and 'resource type' which are usually item level attributes. Although subject is often used in collection descriptions it was deemed of limited value in a 'registry' description. Registries might be limited to a particular discipline and registry descriptions should have a place to record subject(s) but there would be no assumption that it would be supplied for either registry or collection records.

The notion of resource/item type in a collection description or registry description has limited value. A collection may be of resources/items of one particular type and is it worth recording this fact by having an optional resource type attribute for a collection or registry ?

Spatial coverage might be relevant information for collections but not for registries. A registry may be administered in a particular location and focus on registering collections whose coverage is in that region, but this should not be recorded as spatial coverage, but rather as spatial 'location' for the registry's administration.

Anywasy, spatial coverage not really useful for search refinement unless coded - apart from coordinates there are also
  • iso31661: ISO 3166-1 Codes for the representation of names of countries and their subdivisions - Part 1: Country codes
  • iso31662: Codes for the representation of names of countries and their subdivisions - Part 2: Country subdivision codes

Temporal coverage - same as for spatial coverage but even less likely to be relevant except for historical collections - and not relevant for registries although their description might have a temporal component - but again related to administrative metadata.




A subsequent task is to review standards such as DCAP, IESR, RIF-CS as candidates for the data model.

There was discussion about the requirements of a participating registry in addition to providing a GRS API. There was discussion about whether the 'registration' of a participating registry required a different model than the GRS API itself. The consensus I think was that each participating registry should be able to describe itself using the same data model as the GRS API and return this record via a specific method but that additional information might be required that would require an admin interface for the GRS registry for operational reasons.

We also went through what we though the questions were that the API needed to answer. See the following flip-chart.


Action items were:

  1. Jeremy to draft 1 page brochure content ready for distribution to interested parties at RDAP10 (phoenix 9-10 april) and CNI spring meeting (Baltimore 12-13 April)
  2. Jeremy to write up report of this meeting.
  3. Comparison of data model candidates against GRS API model methods and functions
  4. Profile details for participating registries - ie what information needs to be maintained about them in the inventory so that the operational service can function
  5. Details for human discovery/access portal based on GRS API service
  6. Monica to update GRI task list


Monday, April 5, 2010

San Francisco - Internet Archive

I was picked up at around 10am this morning from my hotel by Kris Carpenter, Director of the Web Archiving group at the Internet Archive.

We dropped by their original Data Centre in the Mission area of downtown San Francisco and saw the latest storage boxes they have designed in collaboration with Capricornia Technologies which of course have further reduced the amount of power and cooling required overall to manage equivalent data volumes. Overtime these will gradually replace the now famous red "petaboxes" which they have relied on for some years. They now have two other sites with the Sun Microsystems "datacentre in a shipping crate" located at Santa Clara and the ISC.org data center in Redwood City also now providing space for them. They intend to relocate most of their data storage to a NASA data center in Mountain View, CA over time.

The IA's web archiving team. along with the other IA activities have been located at the Presidio for many years. However IA is now consolidating all of its activities in their own building (a former Christian Science Church) in the Richmond distric of San Francisco and they will all have moved there within the next couple of months.

I caught up with current initiatives with web archiving and in particular talked about the take-up of their Archive-It subscription service which allows researchers to build thematic collections of content captured from the web. This is very useful to social scientists in particular. The take-up amongst Universities and Libraries/Archives in the US is very substantial.
...
Archive-It, a subscription service from the Internet Archive, allows institutions to build and preserve collections of born digital content. Through our user-friendly web application, Archive-It partners can harvest, scope, catalog, manage, and browse their archived collections. Collections are hosted at the Internet Archive data center and are accessible through Url and full-text search.
Over 125 partners currently use Archive-It, including state archives and libraries, university libraries, federal institutions, museums, and public libraries.
....
The take-up in Australia is still very low, just the University of Melbourne and the NLA (who although they use their own web archiving infrastructure for archiving Australian web content use the subscription service for collecting Asian web content because of its better support for non-roman scripts). A number of other Australian universities have expressed interest but have not subscribed yet.