The wikipedia gives a very basic introduction to LLOD
that may serve as a first introduction.
Linguistic Linked Open Data (LLOD) cloud diagram
Even though somewhat dated, Hitzler et al.'s introduction to the Semantic Web is still highly readable in both its English and German version, and it comes with considerable supplemental material. When interested in practical applications of Linked Open Data in NLP, linguistics or philology, I recommend reading chapters 1, 2, 7, optionally chap. 4 (German edition: chapters 1, 3, 7, optionally chap. 5).
- Linked Data has been introduced by Tim Berners-Lee as a design issue for the (Semantic) Web in 2006, and subsequently coupled with the notion of Open Data.
Linguistic Linked Open Data (LLOD) refers to the application of Linked Open Data principles to language resources
- To facilitate publishing LLOD data, the LIDER project created a number of reference cards, including
Converting dictionaries and other lexical resources: OntoLex/lemon
The primary format to represent machine-readable dictionaries in RDF and potentially as Linked Data is the Lexicon Model for Ontologies (lemon)
developed by the W3C OntoLex Community Group.
- Lexicon Model for Ontologies (W3C OntoLex Community Report, 10 May 2016)
- Guidelines for Linguistic Linked Data Generation (W3C BP-MLOD Community Group Reports, 29 September 2015)
- Use case: Converting an etymological dictionary from XML
Frank Abromeit, Christian Chiarcos, Christian Fäth, Maxim Ionov (2016), Linking the Tower of Babel: Modelling a Massive Set of Etymological Dictionaries as RDF, presented at the 5th Workshop on Linked Data in Linguistics (LDL-2016): Managing, Building and Using Linked Language Resources, Portorož, Slovenia, 24th May 2016. Co-located with LREC 2016.
Corpora and linguistic annotations: OLiA+NIF/OA/POWLA
Tutorial on Annotation Interoperability with Linked Open Data and the Ontologies of Linguistic Annotation (Christian Chiarcos, EUROLAN-2015 summer school, Sibiu, Romania)
The representation of linguistic corpora in RDF is an area of active research. At the moment, we see a multitude of competing formats focusing on different types of applications, e.g.,
- The NLP Interchange Format (NIF).
Originally designed for NLP pipelines, NIF is well-suited for corpora with "simple" word-level, phrase-level and sentence-level annotations. It lacks support for representing annotation layers, non-branching syntactic nodes, empty elements (traces), labeled edges, conflicting tokenizations, semantic annotation, etc. It is well-suited for word-level NLP and entity linking.
- Open Annotation (OA).
Originally developed for expressing metadata about web content, OA has been applied experimentally to represent philological corpora, resp. biomedical annotations in text. However, these experiments are currently limited to selected collaboration projects. Given the relative verbose model and the (from a linguistic point of view) counterintuitive terminology (hasTarget, hasBody), it is doubtful whether it will be accepted widely in the NLP and language resource community.
- POWLA is a full-fledged reconstruction of the ISO TC 37/SC4 LAF/GrAF model in RDF/OWL, and thus, specifically designed to facilitate corpus querying. Unlike NIF, it can thus express arbitrarily complex annotations. Unlike OA, it is terminologically transparent. Similar to OA, it suffers from a certain degree of verbosity in comparison to NIF.