Lemmatization protocol

Table of Contents

Proper noun

Proper nouns containing an article are lemmatized without the article.
Ex.: 'Le Havre' is lemmatized only 'Havre' and 'le' is excluded and lemmatized as a definite article (apart).

Abbreviation

They carry the tag <abbr> and are resolved during lemmatization (sometimes it is necessary to spell out the unrecognized lemma in full).

Partitive

There is a lemma of "partitive art".
We lemmatize as follows: Several cases are ambiguous, especially in negative sentences. We move on to the affirmative:

Phrase

We choose not to use the lemma loc. adv., loc. prep. or loc. conj.> to assign a lemma to each unit constituting a phrase.
For example :
N.B. We take the DMF lemmas as they are, which sometimes include two grammatical categories.
We do not need to go further in the analysis for lemmatization: attrib> lemma of the DMF, nothing more.

Demonstrative

Compound tense

We lemmatize the auxiliary and the participle separately.
Example : 'a parti' > 'a', verb 'AVOIR' and 'parti', verb 'PARTIR'.

The pronoun 'on' and its allomorph 'l’on' (or 'l on', 'lon')

The segment 'l' is placed between two tags <w> and is classified among the « excluded words » during lemmatization, it is not assigned a lemma.

Liaison and euphonic 't'.

To complete the transcription protocol, let us add that:

Apostrophe

The apostrophe must be linked to the elided segment.
Example :
<w>qu’</w><w>il</w>.

"Que": pronoun, conjunction and adverb (exceptive / restrictive)


When it can be replaced by only, we consider it an adv. There are therefore three 'que':
Note that some occurrences of 'que' admit both a pron. and a conj. interpretation: these are the subject of a (sometimes heartbreaking) choice of lemmatization on a case-by-case basis and although we have tried to be consistent, we recommend searching for 'que' pron. & 'que' conj. to be sure to reach all occurrences.

Request to create a new lemma



[Back to Macintosh corpus]