Data Sets, Terminology Repositories, Data Schemes
- The Ontologies of Linguistic Annotation (OLiA) provide formal representation of annotation schemes and linguistic terminology for more than 70 languages
- The OLiA Discourse Extensions provide formal representation of annotations for discourse structure, discourse relations, information structure, information status and coreference
- POWLA, an OWL/DL implementation of a generic data model for linguistic annotations in NLP pipelines and corpora
- Cuneiform resources, including a machine-readable sign list
- Named entity recognition, including an RDF dump of 500,000 names and its linking to the DBpedia
- Linked Old Germanic Dictionaries, a set of automatically constructed word lists and digitized etymological dictionaries for Germanic languages (under construction)
Software, Software Components and Demos
- We maintain an instance of ANNIS, a corpus information system which provides a search and visualization facilities for linguistically annotated text.
This is supported by the LOEWE cluster Digital Humanities and the Institute for English Studies
- access to the ANNIS server (login required for full access)
- access to the Anglistik portal
Bibliography analysis (demo)
- We develop a multi-modular system for the fine-grained analysis of bibliographical information in scientific papers, based on an ensemble architecture of multiple machine learning components. Currently, we explore the integration with state-of-the-art rule-based approaches.
- This line of research is supported by Springer Science+Business Media and conducted in collaboration with Springer and Crest.
Abikwanne, a Machine Reading system
- Abikwanne (Acholi phrase for I will read it) takes digital-born PDF documents, applies an NLP analysis, extracts machine-readable information and augments this with references to external vocabularies. The resulting database can be queried with SPARQL 1.1 as well as (experimental) natural language queries. Results are visualized graphically or as tables.
- Use case Contextualizing News Messages (description, a collaboration with the BigData Lab)
- Use case Archeology (presentation,
- A set of JAVA classes that allow to merge morphosyntactic analyses produced by different NLP tools using the OLiA ontologies
- SVN access
- Note that this project is partially deprecated with the development of the NLP2RDF project. Yet, the implementation of multi-tool morphosyntactic annotation for German has not yet been replicated there.
- The Cognate aligner produces and visualizes a phonology-based word alignment of quasi-parallel text in closely related languages to support manual alignment. Based on phonological-orthographical distance metrics, illustrated here for Old Saxon and Old High German gospel harmonies (Heliand and Tatian) and a Levenshtein distance derivative.
- For the example of the pater noster, matches are highlighted:
quað/quad "(he) spoke", queðad/quedet "(you) speak", thu bist (an) himila/thu bist (in) himile "you are at/in heaven", thîn namo/thin namo "your name", thîn uuilleo/thín uuillo "your will", erðo/erdu "earth", ...
Niko Schenk and Christian Chiarcos. 2016.
Unsupervised Learning of Prototypical Fillers for Implicit Semantic Role Labeling.
NAACL 2016. Association for Computational Linguistics.
PDF, bibtex, protofiller data for download
Samuel Rönnqvist, Niko Schenk and Christian Chiarcos. 2017.
A Recurrent Neural Model with Attention for the Recognition of Chinese Implicit Discourse Relations.
ACL 2017. Association for Computational Linguistics.
PDF, bibtex, source code
Niko Schenk and Christian Chiarcos. 2017.
Resource-Lean Modeling of Coherence in Commonsense Stories.
EACL LSDSem WS 2017. Association for Computational Linguistics.
PDF, bibtex, source code