Instructions and tips for using the BlackLab tool and its different search levels



Table of Contents


[Back to the Macintosh corpus]




The interface and the different search modes



Simple Search (Simple)

Simple search allows you to perform a quick search by words or lemma. To perform a search, simply enter the word you are looking for in the search bar and press Enter or click on Search.
Please note: in Simple search, searches are case-insensitive, i.e. accents or upper and lower case letters are not taken into account: thus the searches "mère", "mere", "Mère", "Mere" will provide the same results.

Additionally, it is possible to use wildcards to refine the search level:






Extended Search (Extended)


Extended search is an extension of Simple search and allows you to find all occurrences of a token with its specific attributes. A token generally represents a single word and is the smallest unit within the corpus. A token has different attributes which are:

In the Word and Lemma search fields, you can either enter the attribute values ​​or upload your list of searched values. In the Part of Speech search fields, you can select the searched value. Then press Enter or click the Search button to execute the search and display the results.

Please note that there is an important difference between the Word and Lemma search fields. Example: typing the word "aimer" in Word will only give you occurrences of that exact string. When you enter "aimer" in the Lemma search field, you will get - in addition to the lemma "aimer" - all word forms related to that lemma, i.e. its forms in all tenses and moods ("j'aimerai", "nous aimions", etc.) as well as spelling variations, forms incorrectly written by authors (e.g. "j ème", "émé", etc.).

In the Extended search, in order to allow the search of specific and complex word forms, the use of wildcards is always highlighted in the search with the Word and Lemma fields. As a reminder, a wildcard is a symbol used to replace or represent one or more characters. As for the simple search, the wildcards that can be used here are:

For the Word and Lemma search fields, it is also possible to search for a sequence, a series of tokens by entering several values, including wildcards, separated by a space.
Example: in Word, typing 'je n* * pas' means that we are looking for a first token 'je' followed by a second starting with 'n-', a third random token and a last token 'pas'.


On the right side of the Word and Lemma search fields, there is an option to upload a list of values ​​to search for; these values ​​must all be separated by a white space. Note that this function only works for .txt files. (If you are using a text editor like Word, you must first save your file in .txt format.) Each word in the uploaded file will be added to the list of values ​​to search for.

You can start a new search by pressing the Reset button. Doing so will clear the search query and the results found. Your search history will remain unchanged, however.

Advanced Search (Advanced)

Advanced Search allows you to create complex queries without needing to master the CQL (Corpus Query Language) query language. The basic element of Advanced Search and its query builder is the token box. A box represents a token. By clicking on the blue '+' icon, you can add an unlimited number of token boxes, the set thus forming a sequence.


A token box has two tabs: search and options.

The search

tab The search tab of the token box allows you to define all the attributes or values ​​that the token must or can have.


Token attributes

Specifying token attributes allows you to better focus and specify your search.
You can select a token attribute (Word, Lemma, POS) and enter the value it must or must not have. A token box can combine several attributes.



The AND and OR options in the token boxes

By clicking on the white '+' button on the right in the token box, you can add new attributes to your token with the AND (sets mandatory values) and OR (sets alternative values) options.


The AND option creates, in a clause, a new condition, a new attribute essential to your token.

The object searched is all the words 'iron' which are verbs. Here we find 9 occurrences.


The OR option adds a new alternative, a new clause, a value that the token attribute can potentially take.
The object searched is all the words 'fer' or all the tokens that are verbs. Here we will find 9,658 occurrences.

The difference between the white '+' icons in the token box

The difference between the '+' sign on the right of a token and the one at the bottom of the token box is that the '+' sign on the right keeps the newly added attribute in a subclause while the one at the bottom creates a new clause.
Example: Suppose we want to search for all the words 'we' or 'you' used as a 'possessive adjective'. If we only use the '+' on the right, we cannot apply the exclusivity of the search box to possessive adjectives.


Here, the query will search for all the words 'we' (regardless of their POS) and the words 'you' whose POS is a possessive adjective.


If we add the attributes using the '+' sign at the bottom of the token box, we can create a clause.
By clicking the '+' at the bottom of the token box, we have created a clause that the query searches for all words with the value 'we' or 'you', to which is added the argument that the tokens searched for must have 'possessive adjective' for POS.



Expert Search(Expert)

Expert search mode allows you to edit queries using the BlackLab Corpus Query Language, a dialect of the Corpus Query Language (CQL). This language is used to query corpus texts using queries. CQL queries are expressions constructed using sequence operators and block brackets, within which one or more token attributes are specified. In CQL, spaces only affect a search if they are enclosed in quotes.

Some examples of searching with the BCQL language:

For more information on CQL, click here.

Filtering by metadata

The Macintosh corpus has been enriched with a set of metadata providing information on the letters (place of writing, author, year, place of arrival, etc.).



This metadata therefore allows you to filter and refine your search (in Extended, Advanced and Expert modes) with temporal, geographical and nominative requirements. Below is a description of the metadata present on the search tool:

Displaying results

The results can be viewed in two ways:

Results by hit

If you click on one of the results of your search, you can display the properties and values ​​of the token found as well as a text extract from which the token comes.

The result lines are always preceded by a line containing the name of the letter in which the token(s) were detected. If you want, you can always hide the title of the documents by clicking on 'Hide Titles' at the bottom right of your screen at the bottom of your web page.

Show content and metadata of the letter

By clicking on the name of the letter or on the hyperlink above the token in the text extract, you will open a new tab showing you the content of the letter as well as the metadata relating to it.

Sort the results

You can click on the column headers to sort the results according to the values ​​of this column, i.e. the attributes word, lemma or pos :

The headers in question:

You can also sort your results using the 'Sort by ... ' tab at the bottom right of the results table. Clicking on it will open a drop-down menu and allow you to sort the results based on the attributes of Hit, Before hit, After hit, as well as the letters' metadata.

Group results

It is also possible to group your hits. Once your search is done, at the top left of the results table, you can click on 'Group hits by ...'.


Next, you can select a grouping type. The grouping criteria are the same as the sorting criteria. Once you have determined your grouping criteria, the tool will show you groups of results.

Here we have chosen to group the results by lemma.



By clicking on one of the groups obtained, you will be able to access the search tool again with only the letters of the chosen group by clicking on 'View detailed concordances'




By clicking on 'View detailed concordances' we have grouped here the results having 'vous' as lemma.


Among the grouping criteria, you will however notice that there is a new one, Context (advanced). This option allows you to group the results by delimiting a context. You can thus choose the nature of the hits (word, lemma, pos) and especially delimit (from 1 to 5 tokens) the number of hit, before hit and after hit to display.


Export results

Search results can be exported by clicking on the 'Export' or 'Export for Excel' buttons at the bottom right of the results table. The first button exports the results to a .csv file while the second exports them to a .csv file more suitable for Excel software.

Information and viewing of a letter

Once a search is done, you can click on the title of a document or the hyperlink of a token to open a new tab, the Content tab which contains the transcription of the letter.

Content

The results of the current query will be highlighted in underline, bold and red in the document transcript. In case of multiple results, only the current result will also appear grayed out. You can navigate from one result to another by using the arrows of the Hits button.

If you hover your cursor over a specific word in the document a pop-up window will appear with the word's lemma and the Show details option. By clicking Show details , you will see additional information at the word level such as its POS and explanations of its transcription and meaning.

Metadata

In this tab, all the metadata properties of the letter are displayed and provide information.

Statistics

The Statistics tab displays several statistics of the document: the number of tokens, the number of unique word forms, the number of lemmas and the unique word/token ratio. It is possible to print or download these statistics via the drop-down menu to the right of the Distribution of parts of speech in the letter title.

Images

In this tab, you will find images of the transcribed letters with tools to zoom or orient the images.


The Explore

interface

The Explore

tab This tab has three subdivisions:

Documents

'Documents' allows you to group letters according to their metadata (recipient, sending location, etc.). This tab allows you to group letters from the corpus according to metadata (geographic location, dates, etc.).

N-grams

N-grams allows you to list the frequency of N-elements (word, lemma, pos) in the corpus. In a way, it is a simplified version for searching and grouping sequences.
To do this, you have different options:


You can still filter your results using metadata.


Statistics

With Statistics you can create comprehensive frequency lists of all words, lemmas or Part of Speech in the corpus.

Options for creating lists:
Example: It is possible to determine the use of the ten most frequent words by women by searching for the frequency list type 'Word' and filtering the search by 'sex'.