DataCite.org Analysis

Roger Hyam, 2012/07/202012/07/20

I have been seeing DataCite.org mentioned quite a lot so I thought I’d take a look at what they were up to. They have an OAI-PMH provider so you can simply go to the List Records page and see how many records they have. Try it for yourself now. At the bottom of the page today it says that there are 654,748 records in their repository.

Then I read on the DataCite.org home page (my emphasis):

We think it is very important that the two largest DOI Registration Agencies work together in order to provide metadata services to DOI names.

This seemed an amazing claim as I had it in my head that CrossRef had 50+ million references. So if the second biggest only had just over half a million that implies there is only really one registration agency. I wonder who claims to be numbers three, four and five in the DOI Registration listings?

So I tweeted:

DataCite 654,748 records – I think this inflated. CrossRef have 50 million. CrossRef is about 100x bigger. Who is the 3rd largest?

And @epentz replied:

@rogerhyam #datacite has registered about 1.3 million DOIs, #CrossRef 54.6 million

Where did those numbers come from? Now I was curious so I went back to the DataCite.org OAI-PMH provider but this time I used a script to have a look. This is the deposition history by month for the entire registry:

The vast majority were created in December last year – a single provider I presume. Then there was another large batch in June this year.

Next I looked at the Sets in the registry. DataCite seem to create a set and then subsets for each organisation the create DOI’s for. Here is the pie chart showing the number of records per organisation:

The vast majority are TIB (German National Library of Science and Technology) and CDL (California Digital Library) who have contributed about 90%. There are 15 organizations in total 11 of those have contributed less than 2,500 records each.

I have no intention of knocking the project but from looking at the DataCite.org website and reading their promotional material (which talks of providing access to ‘data sets’) you do not get the impression that they are, in fact, mainly libraries providing citations to publications rather than data. The data citations I have seen do not look like the will give permanent access to data. Look at this example doi:10.5520/SAGECITE-1 which resolves to a website on a free Google service. How long is that going to last? If you read the Google terms and conditions they give no warranty and may remove the service when they like. The site is called sagecitedemorepository. The clue is in the name. The data is hardly longer than the DOI that is used to cite it so why does it have a DOI? I’m confused. What value have DataCite.org added to this process other than indicate that something will persist that clearly is never intended to. Where is the quality control? What does it mean for a piece of data to have a DOI?

I will watch with interest to see how this develops and whether it makes the leap to linking to significant quantities of raw scientific data that is being properly curated.

Here is the code and data from my analysis – you can run it again as command line PHP scripts if you like: datacite

Your comments and corrections are always welcome!

Biodiversity Informatics

Comments (7)

Roger Hyam says:

2012/07/20 at 8:46 pm

I should have googled the DOI first. There was a bug associated with this DOI in that it was resolving to the wrong publication… but the example I give above is the correct link.
Roger Hyam says:

2012/07/21 at 2:55 pm

Here I am responding to my own blog again… This video Jeremy Keith at buildconf.org is very relevant http://vimeo.com/34269615 – suggested by @charlesroper
Roger Hyam says:

2012/07/24 at 5:25 pm

@DataCiteTech Tweeted: @rogerhyam @epentz 1.3m is correct, but not for all of them we have metadata in our repository yet.

I Tweeted back: @DataCiteTech If you have DOIs without metadata in a repository how do they work? Example?

I Tweeted back again: @DataCiteTech Put another way. Have you issued 600k + DOIs that don’t resolve?
Jan Brase says:

2012/07/25 at 11:18 am

Hi Roger,

This is Jan Brase from DataCite.

Thank you for the analysis. I know that some things about our DOI numbers are a bit confusing. DataCite has registered over 1.3 million DOI names. The reason that we haven’t got describing metadata yet for all of these records is that some of our members have registered around 500k, but haven’t gotten round to uploaded their metadata yet. I nevertheless understand that this is going to happen now. The URL is of course given for every DOI name registered, it’s only the describing metadata that is missing for these records.

The metadata store has just started in the beginning of the year. Before that, the members used the central DOI registration, but stored the metadata locally. That is also why most of the metadata uploads are from December last year, when we started to harvest the metadata from the DOI names registered some years ago.

DataCite itself is offering a central service (the DOI registration and the central metadata repository) but is run by 16 members who individually work with data centers. These data centers have their individual workflows and quality criteria. Usually the data centers agree to certain criteria and preservation strategies. I agree with you, the doi:10.5520/SAGECITE-1 is an example of a not ideal solution.

Generally the granulation of data is up to the data centers and their disciplines; what some disciplines see as citeworthy data granularity might differ a lot from other disciplines.

I would be happy to show you many more examples that are actually used and cited.

Best,

Jan
Roger Hyam says:

2012/07/25 at 11:45 am

Thanks for this Jan. I’m still a little confused. If a DataCite.org DOI redirects me to an HTTP URI for the data and that returns a HTTP 404 NOT FOUND who do I contact? DataCite? The DOI Foundation? The data center? The owner of the domain in the failed URI?

In this stack of organizations who is responsible for persistence of the data?

Citing example DOIs would always be appreciated.

Roger
Jan Brase says:

2012/07/25 at 12:36 pm

If a DOI resolves to a 404 this is the absolute worst case umaginable. If it is a DataCite DOI you should contact DataCite and we will contact the data center.

Best,
Jan
Roger Hyam says:

2012/07/26 at 3:31 pm

Just to put this to bed after a couple of twitter exchanges it looks like the ~400k DOIs not in the OAI-PMH repository do ‘resolve’ but they don’t respond to a request for RDF so aren’t good semantic web citizens but that they are planned to be when their metadata is submitted to DataCite.org

If you curl a proxied DOI like this then it will print out the headers (sorry Windows users you need a real computer for this to work)

curl -I -L -H “Accept: application/rdf+xml” http://dx.doi.org/10.1594/WDCC/CCCMA_CGCM2_SRES_A2

HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://data.datacite.org/10.1594%2FWDCC%2FCCCMA_CGCM2_SRES_A2
Expires: Fri, 27 Jul 2012 10:47:30 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 208
Date: Thu, 26 Jul 2012 14:11:53 GMT

HTTP/1.1 204 No Content
Date: Thu, 26 Jul 2012 14:11:53 GMT
Server: Apache-Coyote/1.1
Via: 1.1 data.datacite.org
Content-Type: text/plain

I guess 204 is not a 404 so Jan’s worst case is not occurring!

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Comments (7)

Leave a Reply