DBpedia entity types in 31 languages

New dataset

The second version of the dataset contains articles extracted from Wikipedia chapters in 31 languages: Albanian, Belarusian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Icelandic, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian.

Files are saved as RDF triples, using the relation http://airpedia.org/ontology/type_with_conf#N, where N is an integer between 6 and 9, depending on the reliability of the guessed class: the higher the number, the more precise the classification. The accuracy is around 90% when N is 9. A detailed evaluation of the dataset has been performed on six languages (see below).


ESWC dataset (old)

The first version of the dataset contains articles extracted from the whole Wikipedias in English, German, Italian, Spanish, French and Portuguese and automatically mapped to the DBpedia Ontology classes. Files are saved as RDF quads.

  • First column contains the DBpedia entity (for example <http://dbpedia.org/resource/Barack_Obama>).
  • Second column contains <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  • Third column contains the DBpedia ontology type (for example <http://dbpedia.org/ontology/Politician>).
  • The fourth column represents the source of the information.

Regarding the last column, there are four possible values, depending on the source file:

  • <http://airpedia.org/extraction/10-all>
    This dataset is extracted using our classifier with all kernels and the bottom-to-top strategy.
  • <http://airpedia.org/extraction/10-all-top>
    This dataset is extracted using our classifier with all kernels and the top-to-bottom strategy.
  • <http://airpedia.org/extraction/10-tpl>
    This dataset is extracted using our classifier with the sole template kernel and the bottom-to-top strategy.
  • <http://airpedia.org/extraction/dbpedia-cl>
    This dataset is the result of the first step (cross-language links) and contains the deepest class found in DBpedia.

The first three resources contain only pages that are not in DBpedia, while the last one contains the complementary.


Evaluation of the dataset

Experiments on the accuracy of the extracted dataset are evaluated on a data set containing 400 randomly extracted entities not already present in DBpedia in any language. The data set is split in development set (100 entities) and test set (300 entities). All entities have been annotated with the most specific available class in the version 3.8 of the DBpedia ontology. The test set is available in the download section.

We used a k-nn algorithm for classification and we compare three alternative schemas based on three different strategies.

  • Bottom-up. The classification is done directly on the leaves of the ontology tree.
  • Top-down. The classification is applied more than one time, starting from the top-level classes, going deeply to the leaves.
  • Hybrid. This variant consists in first training a k-nn as defined in the bottom-up schema. Then, a set of specialized k-nns are trained for the most populated classes, such as, Person, Organisation, Place, Work.

This graph show the results (precision/recall). The red line considers our classification algorithm using a subset of features including only templates. This can be seen as a baseline, since it uses the same strategy used in DBpedia mappings. The remaining three lines represents the three strategies explained above.

Schermata 2013-02-12 alle 11.17.23