This page introduces you to some notions of understanding the natural language of dydu bots.
A sentence is divided into words or compound words. Each of these words is associated with a set of meaning and weight. Indeed, there are many homonyms and many polysemic words. When a word carries several meanings, the whole of its senses is preserved, there is no choice a priori on a meaning to keep.
Each sense is associated with a penalty. Indeed, the meanings of a word do not necessarily have the same probability of being used.
The weight of words is dependent on the frequency of the word in the language used.
The overall structure is as follows:
Meaning 1 - Penalty - Sense 2 - Penalty - ...
Spelling mistakes are common in cat-treated sentences automatically, a correction proves to be necessary.
The dydu technology uses a library based on the Hunspell spell checker from Open Office and Firefox. This library has been adapted for the specific needs of dydu.
The spelling correction suggests several possible corrections. There are no choices made and the various close corrections are kept. Each correction is associated with a penalty.
Identification of compound words
Possible compound words are identified in the sentence, the words they are made of are thus gathered together in a new meaning.
Identification of lemmas
For each word, the different accessible lemmas are searched.
A lemma is the unaccounted and unconjugated basic form of a word, such as an infinitive verb or an adjective to the masculine singular.
Links to lemmas can be defined for common abbreviations, for example: "asap → as soon as possible".
Identification of synonyms and hyperonyms
The synonyms of the lemmas as well as the hyperonyms are identified and associated with the word.
A hyperonym is a generalization of meaning.
Hyperonyms are essentially used to define a set of products or terms specific to the bot's business logic.
For example, dog and cat are not synonymous but animal is a hyperonym of both.
The synonyms are indicated in the language structure for the sentences of the users but are not indicated for the matches.
Business ontologies can be defined.
An ontology is composed of a hyperonym that designates it and the hyponyms contained in it. The use of ontologies in the knowledge base reduces the number of formulations needed and improves the understanding of automatic chat.
It becomes possible to define the following ontologies:
- Vital card: vital card, green card, etc.;
- Attending physician: attending physician, city doctor, family doctor, etc.
The user's sentence is represented flatly with the structure presented above.
The formulations found in the automatic chat knowledge base can also use this flat structure, but increasingly they use a more elaborate structure to significantly reduce the workload necessary to understand the automatic chat.
Between two flat structures
Once we have a linguistic structure of the user's sentence, it is possible to compare it to the sentence structures contained in the knowledge base (called match).
This distance calculation is inspired by the TF-IDF algorithm (https://en.wikipedia.org/wiki/Tf-idf).
It is a sum of partial scores. Whenever a meaning is identified as being present in both the sentence and match structures, the partial score is updated.
This is dependent on the weight of the word in the sentence and the match, and the difficulty applied to the meaning for each of the two structures.
Once the partial scores are calculated, we obtain a matrix containing them: one dimension of the matrix represents the words of the sentence and the other dimension represents the words of the match.
It is therefore necessary to determine the maximum sum of these partial scores by considering only once each word in the sum. One should not be stuck in local maxima that would not be optimal for the overall solution. This is an allocation problem, so we use the Hungarian algorithm (https://en.wikipedia.org/wiki/Hungarian_algorithm).
The Hungarian algorithm detects in this matrix the cells that maximize the sum.
Since this algorithm poses different performance problems when the sentences are long, we subdivide the matrix into disjoint subsets before applying this algorithm on each of these subsets.
To make the calculation of this score more concrete, we invite you to discover the following example:
This example is shown on the calculation of a "flat" distance without considering possible combinations between matching groups.
Consider in this example the match "Loss or theft of my life card" and the user's sentence "loss life card".
For information, the final score is between 0 and 1024. 0 means that there is no common point between the two sentences. 1024 means that the two sentences are identical.
The following image comes from the debugger of this calculation. This tool is accessible only by the dydu team and allows a better understanding of the structure of a score.
The score obtained between these two sentences is 770 out of 1024.
In the event that this score is one of the best obtained for the knowledge base of the bot, it will respond to the user in the form of a reword in which the user can confirm one of the proposed knowledge or rephrase his sentence.
Before going into the detail of the score, it is important to specify that these formulations are close, but considered different by default. It would therefore be necessary to add this sentence in the formulations associated with the knowledge of the bot to obtain a direct answer.
Dark blue bubbles represent the words for which there is a match, the lighter blue bubbles represent the words without matching.
The weight of each word in the sentence is expressed as a percentage in the blue bubbles.
The words in the pink bubbles represent the meanings associated with the word in the sentence. Some have a penalty of 1024; some have no penalty. Others in lighter pink have a penalty at 829. This penalty is applied to synonyms.
With a tree structure
In many cases, the questions corresponding to a knowledge use a language-specific structure that can express itself with a very important number of formulations.
For example, let's take a look at "How to modify my password?"
This sentence is composed of two independent parts which each have a large number of formulations. On one hand "how to modify" and on the other hand "my password".
How to modify
how to modify
I would have to modify
how to update
my confidential code
my secret code
⇒ In this case, if we had wished to define all the possible combinations in plant structures, it would have been necessary to create 5 * 4 = 20.
It is actually only necessary to create 4 for the "password" since the formulations associated with "how to modify" are already defined in the solution.
It was here presented a simple example with only one level but in reality, "How to modify" uses the "how" matching group.
how does it happen when
how do I proceed
how to do
how to do in case of
know the procedure to follow
the modalities for
This matching group contains several dozens of formulations. If we consider here only these 8 formulations, our basic example corresponds to 8 * 20 = 160 formulations in a plane structure.
This fine comprehension of the language thus makes it possible to decrease in a very important workload while ensuring better understanding. Indeed, it would be almost impossible to define via plane structures all possible combinations.
Enrichment of formulations
For your bot to be able to answer correctly to users, it is necessary that it has a large number of formulations in its knowledge base. Each knowledge is in fact associated with a set of formulations that make it possible to recognize the sentences that must lead to the corresponding answer.
In general, it takes several thousand formulations in the configuration of a bot for its understanding is correct.
Two tools are meant to significantly improve productivity in enriching the formulations. This enrichment is being made by dydu.
- A tool gathers similar misunderstood sentences to identify the most used to make them a priority ;
- Another tool uses sentences that resulted in a reword and for which the user has chosen one of the rewords. Associations can then be accepted or refused.
This enrichment is manual, suggestions are the only ones to be automated for more efficiency, but any change in the knowledge base is made by an authorized person.
Comparison of different matching algorithms
Other technologies are used by competing bots:
- Syntax Analysis;
- Matching keywords.
The Syntax Analysis consists of analyzing the sentence and highlighting its structure. It is linked in to the language in which the sentence is written (SVO: subject-verb-object in English).
The structure revealed by the language analysis then shows how the syntax rules are combined in the text. This structure can be represented by a syntactic tree which nodes can provide additional information for a fine analysis.
Therefore, the meaning of the sentence is likely to be correctly understood by the system and properly interpreted even in cases where the nuance is subtle. On the other hand, this analysis can not succeed when the sentences are grammatically incorrect.
The keyword matching works the same way as a search engine.
The system finds the words that have been highlighted the knowledge base among the user's sentence. It will give the answer to the knowledge containing one or two found keywords at the same time. Some systems implment a prioritization in keywords or even a system to exclude keywords to manage ambiguities.
In this table you will find the advantages and inconvenients of each of the technologies.
|Syntax Analysis||Accurate understanding of the sentence |
|Complexity of the configuration of the knowledge base|
Requires the input sentence to be grammatically correct (less than 50% of the questions to a bot)
Substituted by matching keywords if no results
Costly in CPU and Memory Resources
|Matching keywords||Easy Intial Setup|
Very fast and inexpensive algorithm
The scheduling and exclusions rules can become tedious
|Distance calculation||Accurate understanding of the sentence|
Fast algorithm that uses little CPU and Memory Resources
Does not require the input of a grammatically-correct input sentence
|Learning period required on the first questions of users to complete the formulations|
Here are some examples of the possibilities and problems of each technology:
|User's sentence||Syntax Analysis||Matching Keywords||Distance calculation|
|I am looking for a blue card||The distinction is possible||The distinction is not possible, the keywords being search and blue card||The distinction is possible|
|I want to go on a trip, but not in Martinique||The system will not return trips to Martinique||The system will only return trips to Martinique||The knowledge base will have to be configured to take this case into account.|
|How much does it cost per month?||The question is grammatically incorrect, the system will not understand the question||The system will identify keywords months and cost and will give the correct answer||The distance calculation will give the correct answer because it will identify as being very close to the knowledge "how much does it cost per month?"|