Assessing Open Access of Repositories

This blog post is part of an assignment for the Open Science course offered by the . In this week assignment, we were asked to assess the openess of the following 3 repositories:

The Data Basin repository provides environmental information, such as physical locations, qualitative and quantitative measurements. Although the website allows non-registered users to search and visualize datasets, it requires an account to contribute to any datasets. The default license for any datasets is the Creative Commons attribution license (i.e. ), but any user is able to enforce a less open license. Out of the three investigated datasets, this is the one that meets most of the open data paradigms.

The Cambridge Structural Database contains data about small-molecule organic and metal-organic crystal structures. The data in the database is copyrighted by the . The use of the data is restricted to research and academic and cannot be re-published or used for commercial purpose. The license agreement even stipulates that the data needs to be deleted within 14 days of downloading the data. In other words, the access to data is not very open.

The Life Science Data Repositories provides information about spatial missions and the experiments that took place there. Although the website provides a search tool to see the description of the datasets, the actual data is protected via the . However, users can request access to the data via the .

What is Open Data?

The summarizes "Open Data" as

A piece of data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.
More specifically, the open data movement aims to make data (e.g. scientific dataset) freely available to people and organizations to re-use and republish as they wish through . Note that it recommended to include any additional data generated from analysis, etc. as part of the re-published dataset.

Over the last few years, government and scientific institutions have published a wide range of datasets. The website published a list of 294 sites across 50 countries, including the United States and Belgium, covering domains, such as science, government and economics. Ben Jones has created a to easily navigate across the different sites.

Re-using Open Access Content

As part of my work in the legal publishing industry, we were asked to gather information about companies (e.g. registered name, address, homepage, and stock exchange ticker symbol). Creating this dataset from scratch was such a daunting task that I started investigating the re-use of existing dataset on the Web.
Due to the requirement to represent the data in , we started looking at open linked data repositories (i.e. , , and the ). DBpedia aims at extracting information in as structured data and is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license, while Freebase is another repository of structured data and is used by Google to drive its feature. Note that content from Freebase is available through the CC Attribution (i.e. CC-BY) license (but cannot be used for commercial use). However, analysis of the content in these repositories showed (i) that the information was not expressed consistently, (ii) that it was often incomplete or (iii) that required information (e.g. stock exchange ticker symbol) was missing.
As a result, we used Google to determine whether they were any more suitable datasets for our problem. Out of the hundreds of repositories being mentioned, we performed an in-depth analysis of the and the . The CorpWatch API is funded by the . Its dataset is based on the extraction of company information submitting 10-K filings to the Securities and Exchange Commission and is provided as structured data through its API. Although the content set does not fall under any particular license, the copyright holders request that contribution to the data be made public. The OpenCorporate repository contains information for more than 55,000,000 companies across the world and is the most complete in terms of the data available. The content is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license for non-commercial use.

Introduction to Open Access

Open Access (OA) is the practice of providing unrestricted access and use of content (e.g. research data, academic publications, governmental data) via the World Wide Web. For instance, the provides free access to 9948 journals covering a wide range of domains, such as Law and Political Science, Computer Science, and Agriculture.
Although the modern OA movement can be traced to mid-60's, its principles can be attributed to Paul Otlet (1868-1944), who began the creation of an open repository (called ) of facts in 1895. The following year, he developed a mail-based question answering service using the 400,000 facts they had accumulated. Nowadays, the repository contains over 15 million facts and the service is seen as a precursor to online search.
With the advent of the World Wide Web in the mid 90's, the focus on open access of scholarly material has been more prominent. For instance, launched a website offering a free search service of scientific and academic papers. Although it did not always allow access to a paper, it provided a database of bibliographic information (e.g. citations). In the last few years, the movement has gained even more prominence with many countries defining manifestos to make governmental data publicly available. In June 2013, the G8 published a to provide open governmental data to their constituents.
From a legal perspective, the OA movement has been made possible through the expiration of copyrights or by copyright holders consenting to make content freely available. The permission to access and re-use content can be expressed via one of the licenses. For instance, the attribution license (i.e. ) allows third-parties to distribute, remix, tweak, and build upon on someone's work as long as they credit the original source. Through its licensing model, the content made available through open access can be legally shared and re-used. The image below describes the different types of Creative Commons licenses. Note that the image was originally part of an article on how to publish a book under Creative Commons license.