About Transparency International Linked Data

Table of contents

What is this?

The data that is collected from Transparency International is composed of Corruption Perceptions Index data. Both data and accompanying metadata was collected by downloading the Excel datasets and PDF reports from their website.

The data can be used to derive statistical information that’s consumable by humans in the form of charts (see also the World Bank Linked Data’s Tools section), or compared to statistics from other organizations.

The purpose of the Transparency International Linked Data here is to allow consumers and publishers to merge this data with theirs or link to for more information.

Given that this is the initial Linked Data release, there is no regularity in how these datesets are extracted, transformed and loaded. The last update was made on 2012-10-03.

Who is behind this?

The Transparency International Linked Dataspace was created by Sarven Capadisli.

DERI, NUI Galway graciously offered to host this service on their servers.

Process

The original Excel (XLS) files were transformed to RDF Turtle using Google Refine + RDF Extension.

Apache Jena’s TDB storage system and Fuseki is used to run the SPARQL server. The HTML pages are generated by the Linked Data Pages framework, where Moriarty, Paget, and ARC2 does the heavy lifting for it.

SPARQL Endpoint

A public SPARQL endpoint is available, which accepts SPARQL 1.1 queries.

About the datasets

There is a VoID file which contains metadata for the datasets. The information included, but not limited to is: locations to RDF datadumps, named graphs that are used in the SPARQL endpoint, vocabularies used, dataset size. Statistics for the VoID file is generated using LODStats. The data dumps are available in RDF Turtle format or in compressed gzip format.

RDF/XML, Turtle, and JSON serialization formats are supported for the resolvable URIs on this site. However, the resources in the dataset are in the form of generic URIs i.e., they don't have an extension of the serialization format.

Completeness# of triples
Currently 2009, 2010, and 2011 Corruption Perceptions Indexes are RDFized (note: this is not a word!)
Corruption Perceptions IndexIncomplete38641

The Transparency International Metadata consists of 2893 triples.

Decisions on source data

Herein is a list of some of the limitations and inconsistencies in the original data which introduced an extra problem layer. In order to arrive at a proper and useful Linked Data representation, some of these problems were solved either with a script, or manually curated, and others were brought up to the Transparency International team’s attention for investigation.

Data modeling

The data is primarily composed of observations (i.e., survey or assessment) using the RDF Data Cube vocabulary. There are also code lists for classifications like countries, sources (i.e., organizations whom originally gathered the data), and so on.

Data interlinking

The dataset is interlinked (~731 links) with DBpedia, World Bank Linked Data, Eurostat Linked Data, and Geonames for countries using LInk discovery framework for MEtric Spaces (LIMES). More interesting interlinks coming soon!

Additional interlinking was done by adding links to resources with corresponding homepages on the Transparency International site, as well as links to referenced documents (e.g., definition of concepts).

Vocabularies

Besides RDF, RDFs, XSD, OWL, the most common vocabularies in these datasets are: RDF Data Cube for modeling statistical observations, SDMX for statistical codes, British reference periods (Year, Gregorian Interval), SKOS, DC Terms. Where appropriate, new properties and classifications were created to represent Transparency International Linked Data. The URI patterns section gives a further break down of this.

In the case of country codes, it should be noted that ISO 3166-2 is used as the primary representation for countries. For example, the URI http://transparency.270a.info/classification/country/CA identifies the country Canada in the datasets.

Blank nodes

Good news everyone! No blank nodes in data. Only a few due to the DataStructureDefinition (DSD) from RDF Data Cube.

Normalization

Trimmed whitespace at start and end of strings.

Data provenance

As part of data enrichment, triples pertaining provenance was added in order to partially provide extra metadata about the data. For the datasets and observations in these datasets, they address the following information:

Provenance in Transparency International Linked Datasets
Type of provenanceTransparency International
Defining sourcerdfs:isDefinedBy
Licensedcterms:license
Source locationdcterms:source
Related resourcedcterms:hasPart, dcterms:isPartOf
Creator of the datadcterms:creator
Publisher of the datadcterms:publisher
Issued datedcterms:issued
Modified datedcterms:modified

URI patterns

Classifications
http://transparency.270a.info/classification/{id}, where id is one of; country, source, indicator, attribute.
Properties
http://transparency.270a.info/property/{id}, percentile-90-lower, percentile-90-upper, rank, score, source, indicator, surveys-used
Data Cube datasets
http://transparency.270a.info/dataset/corruption-perceptions-index/{year}, http://transparency.270a.info/dataset/corruption-perceptions-index/{year}/sources, where year is in yyyy
Named graphs in RDF store
http://transparency.270a.info/graph/{id}, where id is one of; meta, corruption-perceptions-index

Notes

Alternate formats as listed (at the bottom of the HTML page) for a given resource is currently the generated version (from a SPARQL query). It may contain additional triples like labels for the vocabulary terms that’s not in the RDF dumps, therefore, you should keep the difference in mind.

Source Code

The code which retrieves the Transparency International data, transforms it to RDF serializations, and imports to TDB Triple Store can be found at GitHub: csarven/transparency-linked-data. It is using the Apache License 2.0.

Terms of use

The material on this site is not endorsed by Transparency International. The data on this site comes with no warranty. Hence, I am not responsible if chaos ensues on any level at any point in time, in any universe, in any dimension, in any anything. My responsibility is to make sure that the data here is represented using the Linked Data design principles. Unless stated otherwise, the data is not altered during the transformation from Transparency International data source. You shall assume that this data does contain errors, and you shall resort to the data provided by the original content provider if you have doubts about its validity. Make sure to double-check with Transparency International’s terms and restrictions as well, and everything else that they say. If you spot errors in the data here, lets fix them. If I’ve done something else wrong, please inform me so I can make it right. If you agree with this paragraph, you may use this data.

Data License

With the exception of Transparency International’s own licensing, the Linked Data version of this data is licensed under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

Datasets from The Transparency International
NameSource
This table data is from SPARQL query.
Corruption Perceptions Index 2009http://archive.transparency.org/policy_research/surveys_indices/cpi/2009/cpi_2009_table
Corruption Perceptions Index 2010http://archive.transparency.org/content/download/56231/898923/CPI+2010+results_pls_standardized_data.xls
Corruption Perceptions Index 2011http://files.transparency.org/content/download/313/1264/file/CPI2011_DataPackage.zip
Page notice
  • Last updated on 2012-10-03.