Lemmatization protocol
Proper noun
Proper nouns containing an article are lemmatized without the article.
Ex.: 'Le Havre' is
lemmatized only 'Havre' and 'le' is excluded and lemmatized as a definite article (apart).
Abbreviation
They carry the tag <abbr> and are resolved during lemmatization (sometimes it is necessary to spell out the unrecognized lemma in full).
Partitive
There is a lemma of "partitive art".
We lemmatize as follows:
- "Je mange du gâteau." 'du' > lemma of « art. partitif » ;
- "Je mange de la soupe." 'de' > lemma of « art. partitif », la > lemma le « art. défini » ;
- "Je mange des épinards." 'des' > lemma of « art. partitif »;
Several cases are ambiguous, especially in negative sentences. We move on to the affirmative:
- "Je ne vois pas de vagues" > affirmative form > of art. partitif + le art. déf.
- "Je vois des vagues" > lemma un « art. indéf. »
Phrase
We choose not to use the lemma loc. adv., loc. prep. or loc. conj.>
to assign a lemma to each unit constituting a phrase.
For example :
- "Après qu’il eut fini […]." 'après' > lemma « prép. », 'que' > lemma « conj. »
- Du depuis le temps > 'du', double lemma « prép. » (de) + « art. défini » (le), 'depuis' lemma
« prép. »
- Du depuis que 'depuis' > lemma « prép. », 'que' > lemma « conj. »
- 'Pendant que' pendant lemma « prép. », que lemme « conj. »
- 'Bien que' bien lemma « adv. », que lemma « conj. »
- 'Autant que' lemma « adv. », que lemma « conj. »
N.B. We take the DMF lemmas as they are, which sometimes include two grammatical
categories.
We do not need to go further in the analysis for lemmatization: attrib>
lemma of the DMF, nothing more.
Demonstrative
- 'Celui-ci' (and affine form, alternating with 'ici'), pronom dém.
We segment in lemmatization 'celui' and 'ci' (or 'ici') separately, even if the TLF
does include an entry CELUI-CI pron. dém. - 'Ceci', because we meet the forms
-
'celui-ici' > lemma = 'celui' (pron. dém.) + 'ici' (adv.)
and 'celui-ci' > lemma = 'celui' (pron. dém). + 'ci' (pron. dém.), which are in free variation in the corpus.
- The feminime form 'celle' is
lemmatized 'CELUI' (pron. dém.)
Compound tense
We lemmatize the auxiliary and the participle separately.
Example : 'a parti' > 'a', verb 'AVOIR' and
'parti', verb 'PARTIR'.
The pronoun 'on' and its allomorph 'l’on' (or 'l on', 'lon')
The segment 'l' is placed between two tags <w> and is classified among the « excluded words » during lemmatization, it is not assigned a lemma.
Liaison and euphonic 't'.
To complete the transcription protocol, let us add that:
- When a linker is expressed multiple times, it is isolated between two tags <w>
for lemmatization and is given the « mot exclu » label.
-
Example, HCA-30381-FL-1 :
"les demoiselle sont tel jolie" > = les demoiselle sont <w>t</w><w>el</w> jolie
The
segment <t> is excluded from lemmatization.
- In the case of plurals of noun phrases, on the other hand, the linking segment can be reexpressed. It is thus placed at the initial of the lexical word and the form with expression of the link becomes an allomorph of the one without link.
- Example with HCA-32205-Adrienz-1671 :
et <w>a</w><w>ses</w><w>zan fan</w> et <w>a</w><w>tous</w>
<w>ses</w><w>zamis</w><lb/>.
'Zanfan' is an allomorph of 'anfan', and 'zamis' of 'amis'.
- The euphonic t is like excluded segment.
-
Another example :
"comment à tel suporté" = comment à <w>t</w><w>el</w> suporté, <t>;
word excluded from
lemmatization. Same for verb phrases, we exclude linking consonants from
lemmatization.
Example : vous zavest « vous avez » = <w>vous</w> <w>z</w> <w>avest</w>.
Apostrophe
The apostrophe must be linked to the elided segment.
Example :
<w>qu’</w><w>il</w>.
- The phrase "Dieu merci" :
'Dieu' is a masculine noun and 'merci' is an interjection.
"Que": pronoun, conjunction and adverb (exceptive / restrictive)
When it can be replaced by only, we consider it an adv. There are therefore
three 'que': - 'que' pron.
- 'que' conj.
- 'que' adv.
Note that some occurrences of 'que'
admit both a pron. and a conj. interpretation: these are the subject of a (sometimes
heartbreaking) choice of lemmatization on a case-by-case basis and although we have tried to be consistent,
we recommend searching for 'que' pron. & 'que' conj. to be sure to reach
all occurrences.
Request to create a new lemma
- (1) fill in the Lemma box (on the left, just below the lemmas proposed by LGeRM)
with the lemma you want to create;
- (2) click on the Code > box and choose the appropriate grammatical category from the drop-down list;
- (3) check the "absent nomenclature" box;
- (4) add a note indicating that a request to create a new lemma has been made.