As part of my work in the legal publishing industry, we were asked to gather information about companies (e.g. registered name, address, homepage, and stock exchange ticker symbol). Creating this dataset from scratch was such a daunting task that I started investigating the re-use of existing dataset on the Web.
Due to the requirement to represent the data in RDF, we started looking at open linked data repositories (i.e. DBPedia, Freebase, and the New York Times). DBpedia aims at extracting information in Wikipedia as structured data and is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license, while Freebase is another repository of structured data and is used by Google to drive its Knowledge Graph feature. Note that content from Freebase is available through the CC Attribution (i.e. CC-BY) license (but cannot be used for commercial use). However, analysis of the content in these repositories showed (i) that the information was not expressed consistently, (ii) that it was often incomplete or (iii) that required information (e.g. stock exchange ticker symbol) was missing.
As a result, we used Google to determine whether they were any more suitable datasets for our problem. Out of the hundreds of repositories being mentioned, we performed an in-depth analysis of the CorpWatch API and the OpenCorporate repository. The CorpWatch API is funded by the Sunlight Foundation. Its dataset is based on the extraction of company information submitting 10-K filings to the Securities and Exchange Commission and is provided as structured data through its API. Although the content set does not fall under any particular license, the copyright holders request that contribution to the data be made public. The OpenCorporate repository contains information for more than 55,000,000 companies across the world and is the most complete in terms of the data available. The content is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license for non-commercial use.
Due to the requirement to represent the data in RDF, we started looking at open linked data repositories (i.e. DBPedia, Freebase, and the New York Times). DBpedia aims at extracting information in Wikipedia as structured data and is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license, while Freebase is another repository of structured data and is used by Google to drive its Knowledge Graph feature. Note that content from Freebase is available through the CC Attribution (i.e. CC-BY) license (but cannot be used for commercial use). However, analysis of the content in these repositories showed (i) that the information was not expressed consistently, (ii) that it was often incomplete or (iii) that required information (e.g. stock exchange ticker symbol) was missing.
As a result, we used Google to determine whether they were any more suitable datasets for our problem. Out of the hundreds of repositories being mentioned, we performed an in-depth analysis of the CorpWatch API and the OpenCorporate repository. The CorpWatch API is funded by the Sunlight Foundation. Its dataset is based on the extraction of company information submitting 10-K filings to the Securities and Exchange Commission and is provided as structured data through its API. Although the content set does not fall under any particular license, the copyright holders request that contribution to the data be made public. The OpenCorporate repository contains information for more than 55,000,000 companies across the world and is the most complete in terms of the data available. The content is available through the CC Attribution-ShareAlike (i.e. CC-BY-SA) license for non-commercial use.
No comments:
Post a Comment