New! Sign up for our free email newsletter.
Science News
from research organizations

Getting to the bottom of statistics: Software utilizes data from the Internet for interpreting statistics

Date:
July 16, 2012
Source:
Technische Universität Darmstadt
Summary:
Interpreting the results of statistical surveys, e.g., Transparency Internation­al’s corruption indices, is not always a simple matter. As Dr. Heiko Paulheim of the Knowledge Engineering Group at the TU Darmstadt’s Computer Sciences Dept. put it, “Although methods that will unearth explanations for statistics are available, they are confined to utilizing data contained in the statistics involved. Further, background information will not be taken into account. That is what led us to the idea of applying data-mining methods that we had been studying here to the semantic web in order to obtain further, background infor­ma­tion that will allow us to learn more from statistics.”
Share:
FULL STORY

Interpreting the results of statistical surveys, e.g., Transparency Internation­al's corruption indices, is not always a simple matter. As Dr. Heiko Paulheim of the Knowledge Engineering Group at the TU Darmstadt's Computer Sciences Dept. put it, "Although methods that will unearth explanations for statistics are available, they are confined to utilizing data contained in the statistics involved. Further, background information will not be taken into account. That is what led us to the idea of applying data-mining methods that we had been studying here to the semantic web in order to obtain further, background infor­ma­tion that will allow us to learn more from statistics."

The "Explain-a-LOD" tool that Paulheim developed accesses linked open data (LOD), i.e., enormous compilations of publicly available, semantically linked data accessible on the Internet, and, from that data, automatically formulates hypo­theses regarding the interpretation of arbitrary types of statistics. To start off, the statistics to be interpreted are read into Explain-a-LOD. Explain-a-LOD then automatically searches the pools of linked open data for associated records and adds them to the initial set. Paulheim explained that, "If, for example, the country "Germany" is listed in the corruption-index data, LOD‑records that contain information on Germany will be identified and further attributes, such as its population, its membership in the EU and OECD, or the total number of companies domiciled there, generated. Attributes that are unlikely to yield useful hypotheses will be automatically deleted in order to reduce the volumes of such enriched statistics.

Once that preprocessing has been concluded, Explain-a-LOD proceeds to the second stage and automatically formulates hypotheses, based on the enriched statistics. The methods employed include simple correlation analyses, as well as other methods for recognizing regularities in statistical data, in order to allow formulation of more-complex hypotheses covering more than just a single attribute. Users will then be presented with the resultant hypotheses, in the form of, e.g., phrases, such as "OECD-member countries have low corruption indices" if any positive correlation exists between the attribute "OECD‑member­ship" and the target attribute, "corruption index," regardless of whether the original statistics contained any references to countries' OECD‑membership, or lack of it. That background knowledge will be automat­ically taken into account by Explain-a-LOD.

Surprising and useful hypotheses

Paulheim and his colleagues have thoroughly tested their approach on various sorts of statistics, including Mercer's standard-of-living study and Trans­parency International's corruption index. Paulheim noted that, "What one obtains are mixtures of obvious and surprising hypotheses, such as "cities where tempera­tures do not exceed 21°C during the month of May have high stan­dards of living," "capital cities generally have lower standards of living than other cities," or "countries that have few schools and few radio stations have high cor­rup­tion indices." An evaluation of the results by test persons verified that impression. Paulheim added that, "The test persons perceived the resultant hypotheses as largely surprising, as well as nontrivial, and, very frequently, as useful." However, the test persons had serious doubts regarding the trustworth­i­ness of the resultant hypotheses, which, Paulheim noted, was also attributable to the unsatisfactory qualities of some of the data contained in the open-data cloud.

Explain-a-LOD has been presented at several international conferences over the past few months. The tool received the "Best In-Use Paper" and "Best Demo" awards at the Extended Semantic Web Conference 2012 held on Crete in late May. Several upgradings of Explain-a-LOD, among them implementation of further attribute-generation algorithms and facilities for accessing further data pools from the LOD‑cloud, are planned for the future.

Further information: http://www.ke.tu-darmstadt.de/resources/explain-a-lod


Story Source:

Materials provided by Technische Universität Darmstadt. Note: Content may be edited for style and length.


Cite This Page:

Technische Universität Darmstadt. "Getting to the bottom of statistics: Software utilizes data from the Internet for interpreting statistics." ScienceDaily. ScienceDaily, 16 July 2012. <www.sciencedaily.com/releases/2012/07/120716091925.htm>.
Technische Universität Darmstadt. (2012, July 16). Getting to the bottom of statistics: Software utilizes data from the Internet for interpreting statistics. ScienceDaily. Retrieved November 23, 2024 from www.sciencedaily.com/releases/2012/07/120716091925.htm
Technische Universität Darmstadt. "Getting to the bottom of statistics: Software utilizes data from the Internet for interpreting statistics." ScienceDaily. www.sciencedaily.com/releases/2012/07/120716091925.htm (accessed November 23, 2024).

Explore More

from ScienceDaily

RELATED STORIES