Version: 1.0
Long Name: Corpus of Aligned Read speech Including Annotations
Short Name: CARInA
Corpus Date: 2021/05/11

The CARInA uses the speech material of the German Spoken Wikipedia Corpus:
  Arne Köhn, Florian Stegen, Timo Baumann. 2016.
  "Mining the Spoken Wikipedia for Speech Data and Beyond".
  in Proceedings of LREC 2016.
The ISLRN for this resource is 684-927-624-257-3 and you can find the
resource under http://islrn.org/resources/684-927-624-257-3/.
The original authors of each article can be found under
https://de.wikipedia.org/wiki/ [name of the article]
The speech material was devided into sentences and sorted by the
speakers. The phonetically alignment was used and morphosyntactic, canonic
and prosodic annotation were added. The audio files were changed from .ogg to .wav
and the annotations are provided in different file formats.

The CARInA consists of two parts: The corpus Complete and the corpus WorkInProgress.
The corpus Complete consists of all files which are fully annotated with orthographic, canonic,
morphosyntactic, syllabic, phonetic and prosodic information.

This README file reports on the CARInA which consists currently of 74686 sentences
recorded by 327 speaker.

Table of contents
-----------------
1.) Speech material
2.) Folder structure
3.) Files
3.) Phonemic annotation
4.) Morphosyntactic and canonic annotation
5.) Prosodic annotation
6.) Content Status
7.) Missing Sentences
8.) License
9.) How to cite




Speech material
---------------
The speech material was recorded by the speakers themselves for the online encyclopedia Wikipedia.
The recordings were not made under specific conditions and no specific microphone was used.
The speakers are for the most part no professional speakers.




Folder structure
----------------
The CARInA is structured in two main parts with the names 'Complete' and 'WorkInProgress'
All files in the folder Complete are fully annotated on all levels. The Folder WorkInProgress
includes all other files. There is a subfolder for each speaker in boths parts. The last 
character of the speakers name codes the gender of the speaker:
	m	male
	f	female
	u	unknown

Not every speaker has indicated his or her gender. Below is a list of the subjective impression
of the gender for all speakers with unknown gender and speakers with differences between the specified
gender and the listening impression.
	SpeakerID0023_f	maskulin
	SpeakerID0304_u	maskulin
	SpeakerID0305_u	feminin
	SpeakerID0306_u	maskulin
	SpeakerID0307_u	maskulin
	SpeakerID0308_u	feminin
	SpeakerID0309_u	maskulin
	SpeakerID0310_u	maskulin
	SpeakerID0311_u	feminin
	SpeakerID0312_u	maskulin
	SpeakerID0313_u	maskulin
	SpeakerID0314_u	maskulin
	SpeakerID0315_u	maskulin
	SpeakerID0316_u	feminin
	SpeakerID0317_u	maskulin
	SpeakerID0318_u	maskulin
	SpeakerID0319_u	maskulin
	SpeakerID0320_u	maskulin
	SpeakerID0321_u	maskulin
	SpeakerID0322_u	maskulin
	SpeakerID0323_u	maskulin
	SpeakerID0324_u	maskulin
	SpeakerID0325_u	maskulin
	SpeakerID0326_u	maskulin
	SpeakerID0327_u	maskulin
	SpeakerID0328_u	maskulin
	SpeakerID0329_u	maskulin
	SpeakerID0330_u	maskulin
	SpeakerID0331_u	maskulin
	SpeakerID0332_u	maskulin
	SpeakerID0333_u	maskulin
	SpeakerID0334_u	maskulin
	SpeakerID0335_u	maskulin
	SpeakerID0336_u	feminin
	SpeakerID0337_u	maskulin




Files
-----
Three files exist for each sentence. The names of the files are only distinguished
by their extensions. The names, which stand before the extentions, follow the structure
'article****_sentence****' where '****' is the number of the article or the 
number of the sentence in the article (from 0001 to maximum 9999).
The prosodic labels of the Complete corpus are stored in the WorkInProgress corpus
in .snippet files. Below the files are described.

	.wav
		The audio file contains the audio with 44100 Hz sampling frequency


	.par
		The BAS Partitur Format file contains the annotations in BAS partitur standard format
		Used Label:
			ORT: orthographic representation of the prompted sentence*
			KAN: canonical pronunciation (SAMPA) of sentence
			POS: part of speech**
			WOR: word alignments***
			MAU: phonetic alignments (SAMPA) with MAUS
			PRB: prosodic label PyToBI
			PRM: prosodic label Prr system

		* the words in ORT: contains also punctuation marks

		** set of POS tags varies from the original set. Used label are:
		abbreviation
		adjective
		adverb
		affix
		article
		letter
		demonstrative pronoun
		proper noun
		indefinite pronoun
		interjection
		interrogative adverb
		interrogative pronoun
		konfix
		conjunction
		conjunctive adverb
		contraction
		local adverb
		modal adverb
		number
		onomatopoeia
		particle
		personal pronoun
		possessive pronoun
		preposition
		pronoun
		pronominal adverb
		phrase
		reflexive pronoun
		relative pronoun
		reciprocal pronoun
		characters
		saying
		subjunction
		noun
		symbol
		temporal adverb
		transcription
		verb
		compound
		number
		adposition

		*** The words in WOR: contains also spaces between the syllables


	.TextGrid
		The TextGrid file contains the annotations in Praat Label file with interval tiers
		and point tier for prosodic label


	.snippet
		The par.*.snippet files contains the prosodic information in BAS Partitur format.
		For par.PyToBI.snipped the Label used is 'PRB:' and for par.Prr*.snippet the label is 'PRM:'.
		The TextGrid.*.snippet file contains the prosodic information in TextGrid format.
		Used label for each prosodic system:

		PyToBI
			Tones
				L*+H
				L+H*
				H*+L
				L*
				H*
				!H*
				LH-
				L-H%
				HL-
				H-L%
				L-
				LL-
				L-L%
			Breaks
				1
				2
				3
				4

		PrrSRNC
			Tones
				H*
				H*L
				L*H
				L*
			Breaks
				-
				%
				L%
				H%

		PrrBITSUS
			Tones
				H*
				H*L
				L*
				L*H
			Breaks
				-
				%
				H%

		PrrKCSGrs
			Tones
				H*
				H*L
				L*H
			Breaks
				%
				H%




Phonemic annotation
-------------------
The phonemic annotation was automatically created by MAUS.




Morphosyntactic and canonic annotation
--------------------------------------
The morphosyntactic and canonic annotation was generated by dictionaries. For the
generation of the dictionaries the online dictionary Wiktionary was used.




Prosodic annotation
-------------------
The prosodic annotation was generated by the programms PyToBI and PRR.




Content Status
--------------
The file ContentStatus.txt contains eight columns. The columns are separated by 
a tabulator. The first line explains the eight columns.
The first column contains the identification name of the speaker and the name of the 
corresponding record. Columns 2--7 contain information on the 
alignment at the word level, the part of speech, the syllabification, the 
canonical pronunciation, phonetic alignment and word accents.
These columns contain a '0' for incomplete annotation of this level and a '1' for 
a complete annotation. The eighth column contains information on prosodic 
annotation. Each annotation system used for the sentence is listed. The different systems 
are separated by a '|'. If the sentence has not been prosodically annotated,
the column contains a '0'.

With the file ContentStatus.txt all files with specific annotations can be found.




Missing Sentences
-----------------
The file MissingSentences.txt contains all sentences which could not be extracted 
from the original article.




License
-------
All data is licensed under the Creative Commons Attribution-ShareAlike
3.0 International license (CC-BY-SA 3.0).
https://creativecommons.org/licenses/by-sa/3.0/



How to cite
-----------
If you use this resource in your research, please cite
	Hannes Kath. "Aufbau und Evaluation einer Datenbasis von annotierten Sprachdaten des Deutschen". Diploma Thesis.
	Technische Universität Dresden, Germany. 2021
