Advertisement

Semantic Processing

  • Francisco M. Couto
Open Access
Chapter
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 1137)

Abstract

In the previous chapter we were able to automatically process text by recognizing a limited set of entities. This chapter will introduce the world of semantics, and present step-by-step examples to retrieve and enhance text and data processing by using semantics. The goal is to equip the reader with the basic set of skills to explore semantic resources that are nowadays available using simple shell script commands.

Keywords

Ontologies OWL: Web Ontology Language Semantic resources DO: disease ontology ChEBI: chemical entities of biological interest Ancestors Recursion Lexicons Entity linking Semantic similarity 

Classes

In the previous chapters we searched for mentions of caffeine and malignant hyperthermia in text. However, we may miss related entities that may also be of our interest. These related entities can be found in semantic resources, such as ontologies. The semantics of caffeine and malignant hyperthermia are represented in ChEBI and DO ontologies, respectively.

OWL Files

Thus, we can start by retrieving both ontologies, i.e. their OWL files.

Open image in new window

The -O option saves the content to a local file named according to the name of the remote file, usually the last part of the URL. The equivalent long form to the -O option is --remote-name.

The previous commands will create the files chebi_lite.owl and doid.owl, respectively. We should note that these links are for the specific releases used in this book. Using another release may change the output of the examples presented in this chapter.

The links may also change in the future, so we may need to check them on the BioPortal1 or on the OBO Foundry2 webpages. Alternatively, we can also get the OWL files from the book file archive3.

Class Label

Both OWL files use the XML format syntax. Thus, to check if our entities are represented in the ontology, we can search for ontology elements that contain them using a simple grep command:

Open image in new window

For each grep the output will be the line that describes the property label (rdfs:label), which is inside the definition of the class that represents the entity:

Open image in new window

Class Definition

To retrieve the full class definition, a more efficient approach is to use the xmllint command, which we already used in previous chapters:

Open image in new window

The XPath query starts by finding the label that contains malignant hyperthermia and then .. gives the parent element, in this case the Class element.

From the output we can see that the semantics of malignant hyperthermia is much more than its label:

Open image in new window

A graphical visualization of this class is depicted in Fig. 5.1.
Fig. 5.1

Class description of malignant hyperthermia in the Human Disease Ontology (Source: http://www.ontobee.org/)

For example, we can check that malignant hyperthermia is a subclass of (specialization) the entries 0050736 and 66. We can directly use the link4 in our browser to know more about this parent disease. We will see that it represents a muscle tissue disease. This means that malignant hyperthermia is a special case of a muscle tissue disease.

We can do the same to retrieve the full class definition of caffeine:

Open image in new window

From the output we can see that the types of semantics available for caffeine differs from the semantics of malignant hyperthermia, but they still share many important properties, such as the definition of subClassOf:

Open image in new window

A graphical visualization of this class is depicted in Fig. 5.2.
Fig. 5.2

Class description of caffeine in ChEBI (Source: http://www.ontobee.org/)

The class caffeine is a specialization of two other entries: 26385 (purine alkaloid5), and 27134 (trimethylxanthine6). However, it contains additional subclass relationships that do not represent subsumption (is-a).

Related Classes

Figures 5.3 and 5.4 show other related classes of malignant hyperthermia and caffeine, respectively.
Fig. 5.3

Related classes of malignant hyperthermia in the Human Disease Ontology (Source: http://www.ontobee.org/)

Fig. 5.4

Related classes of caffeine in ChEBI (Source: http://www.ontobee.org/)

For example, the relationship between caffeine and the entry 25435 (mutagen7) is defined by the entry 0000087 (has role8) of the Relations Ontology. This means that the relationship defines that caffeine has role mutagen.

We can also search in the OWL file for the definition of the type of relation has role:

Open image in new window

The XPath query starts by finding the elements ObjectProperty and then selects the ones containing the about attribute with the relation URI as value.

We can check that the relation is neither transitive or cyclic:

Open image in new window

Open image in new window

Open image in new window

A graphical visualization of this property is depicted in Fig. 5.5.
Fig. 5.5

Description of has role property (Source: http://www.ontobee.org/)

URIs and Labels

In the previous examples, we searched the OWL file using labels and URIs. To standardize the process, we will create two scripts that will convert a label into a URI and vice-versa. The idea is to perform all the internal ontology processing using the URIs and in the end convert them to labels, so we can use them in text processing.

URI of a Label

To get the URI of malignant hyperthermia, we can use the following query:

Open image in new window

We added the @*[local-name()='about'] to extract the URI specified as an attribute of that class.

The output will be the name of the attribute and its value:

Open image in new window

To extract only the value, we can add the string function to the XPath query:

Open image in new window

Unfortunately, the string function returns only one attribute value, even if many are matched. Nonetheless, we use the string function because we assume that malignant hyperthermia is an unambiguous label, i.e. only one class will match.

The output will now be only the attribute value:

Open image in new window

To get the URI of caffeine is just about the same command:

Open image in new window

We can now write a script that receives multiple labels given as standard input and the OWL file where to find the URIs as argument. Thus, we can create the script named geturi.sh with the following lines:

Open image in new window

Again we cannot forget to save the file in our working directory, and add the right permissions using chmod as we did with our scripts in the previous chapters. The xargs command is used to process each line of the standard input. The tr command was added because xmllint displays all the matches in the same line, so we split the output using the character delimiting the URI, i.e. ". Then we use the grep command to keep only the lines with a URI, i.e. the ones that contain the term http.

Now to execute the script we only need to provide the labels as standard input:

Open image in new window

The output should be the URIs of those classes:

Open image in new window

We can also execute the script using multiple labels, one per line:

Open image in new window

The output will be a URI for each label:

Open image in new window

Open image in new window

Label of a URI

To get the label of the disease entry with the identifier 8545, we can also use the xmllint command:

Open image in new window

We added the @*[local-name()='label'] to select the element within the class that describes the label.

The output should be the label we were expecting:

Open image in new window

We can do the same to get the label of the compound entry with the identifier 27732:

Open image in new window

Again, the output should be the label we were expecting:

Open image in new window

We can now write a script that receives multiple URIs given as standard input and the OWL file where to find the labels. We can create a script named getlabels.sh with the following lines:

Open image in new window

The xargs command is used to process each line of the standard input. The text function does not add a newline character after each match, so if we have multiple matches is almost impossible to separate them. This explains why we removed the text function from the XPath. Then we have to split the result in multiple lines using the tr command and filtering the lines that contain the :label keyword or are empty.

Now to execute the script we only need to provide the URIs as standard input:

Open image in new window

The output should be the labels of those classes:

Open image in new window

We can also execute the script with multiple URIs:

Open image in new window

The output will be a label for each URI:

Open image in new window

Open image in new window

To test both scripts, we can feed the output of one as the input of the other, for example:

Open image in new window

Open image in new window

Open image in new window

The output will be the original input, i.e. the labels given as arguments to the echo command:

Open image in new window

Open image in new window

Now we can use the URIs as input:

Open image in new window

Open image in new window

Again the output will be the original input, i.e. the URIs given as arguments to the echo command:

Open image in new window

Open image in new window

Synonyms

Concepts are not always mentioned using the same official label. Frequently, we can find in text alternative labels. This is why some of the classes also specify alternative labels, such as the ones represented by the element hasExactSynonym.

For example, to find all the synonyms of a disease, we can use the same XPath as used before but replacing the keyword label by hasExactSynonym:

Open image in new window

The output will be the two synonyms of malignant hyperthermia:

Open image in new window

We can also get both the primary label and the synonyms. We only need to add an alternative match to the keyword label:

Open image in new window

The output will include now the two synonyms plus the official label:

Open image in new window

Thus, we can now update the script getlabels.sh to include synonyms:

Open image in new window

We should note that the XPath query and the grep command were modified by adding the hasExactSynonym keyword. We also added the hasRelatedSynonym which is available for some classes.

We can test the script exactly in the same way as before:

Open image in new window

But now the output will display multiple labels for this class:

Open image in new window

Open image in new window

URI of Synonyms

Since the script now returns alternative labels, we may encounter some problems if we send the output to the geturi.sh script:

Open image in new window

The previous command will display XPath warnings for the two synonyms:

Open image in new window

If we do not want to know about these mismatches, we can always redirect them to the null device:

Open image in new window

However, we can update the script geturi.sh to also include synonyms:

Open image in new window

Now we can execute the same command:

Open image in new window

Every label should now be matched exactly with the same class:

Open image in new window

If we want to avoid repetitions, we can add the sort command with the -u option to the end of each command, as we did in previous chapters:

Open image in new window

The output should now be only one URI:

Open image in new window

Parent Classes

Parent classes represent generalizations that may also be relevant to recognize in text. To extract all the parent classes of malignant hyperthermia, we can use the following XPath query:

Open image in new window

The first part of the XPath is the same as the above to get the class element, then [local-name()= 'subClassOf'] is used to get the subclass element, and finally @*[local-name()= 'resource'] is used to get the attribute containing its URI.

The output should be the URIs representing the parents of class 8545:

Open image in new window

We can also execute the same command for caffeine:

Open image in new window

The output will now include two parents:

Open image in new window

We should note that we no longer can use the string function, because ontologies are organized as DAGs using multiple inheritance, i.e. each class can have multiple parents, and the string function only returns the first match. To get only the URIs, we can apply the previous technique of using the tr and grep commands:

Open image in new window

Now the output only contains the URIs:

Open image in new window

We can now create a script that receives multiple URIs given as standard input and the OWL file where to find all the parents as argument. The script named getparents.sh should contain the following lines:

Open image in new window

To get the parents of malignant hyperthermia, we will only need to give the URI as input and the OWL file as argument:

Open image in new window

The output will include the URIs of the two parents:

Open image in new window

Labels of Parents

But if we need the labels we can redirect the output to the getlabels.sh script:

Open image in new window

The output will now be the label of the parents of malignant hyperthermia:

Open image in new window

Again, the same can be done with caffeine:

Open image in new window

And now the output contains the labels of the parents of caffeine:

Open image in new window

Related Classes

If we are interested in using all the related classes besides the ones that represent a generalization (subClassOf), we have to change our XPath to:

Open image in new window

We should note that these related classes are in the attribute resource of someValuesFrom element inside a subClassOf element.

The URIs of the 18 related classes of caffeine are now displayed:

Open image in new window

Labels of Related Classes

To get the labels of these related classes, we only need to add the getlabels.sh script:

Open image in new window

The output is now 18 terms that we could use to expand our text processing:

Open image in new window

Ancestors

Finding all the ancestors of a class includes many chain invocations of the getparents.sh until we get no matches. We also should avoid relations that are cyclic, otherwise we will enter in a infinite loop. Thus, for identifying the ancestors of a class, we will only consider parent relations, i.e. subsumption relations.

Grandparents

In the previous section we were able to extract the direct parents of a class, but the parents of these parents also represent generalizations of the original class. For example, to get the parents of the parents (grandparents) of malignant hyperthermia we need to invoke getparents.sh twice:

Open image in new window

And we will find the URIs of the grandparents of malignant hyperthermia:

Open image in new window

Or to get their labels we can add the getlabels.sh script:

Open image in new window

And we find the labels of the grandparents of malignant hyperthermia:

Open image in new window

Root Class

However, there are classes that do not have any parent, which are called root classes. In Figs. 5.1 and 5.2, we can see that disease and chemical entity are root classes of DO and ChEBI ontologies, respectively. As we can see these are highly generic terms.

To check if it is the root class, we can ask for their parents:

Open image in new window

In both cases, we will get the warning that no matches were found, confirming that they are the root class.

Open image in new window

Recursion

We can now build a script that receives a list of URIs as standard input, and invokes getparents.sh recursively until it reaches the root class.

The script named getancestors.sh should contain the following lines:

Open image in new window

The second line of the script saves the standard input in a variable named CLASSES, because we need to use it twice: (i) to check if the input as any classes or is empty (third line) and (ii) to get the parents of the classes given as input (fourth line). If the input is empty then the script ends, this is the base case of the recursion9. This is required so the recursion stops at a given point. Otherwise, the script would run indefinitely until the user stops it manually.

The fourth line of the script stores the output in a variable named PARENTS, because we need also to use it twice: (i) to output these direct parents (fifth line), and (ii) to get the ancestors of this parents (sixth line). We should note that we are invoking the getancestors.sh script inside the getancestors.sh, which defines the recursion step. Since the subsumption relation is acyclic, we expect that at some time we will reach classes without parents (root classes) and then the script will end.

We should note that the echo of the variables CLASSES and PARENTS need to be inside commas, so the newline characters are preserved.

Iteration

Recursion is most of the times computational expensive, but usually it is possible to replace recursion with iteration to develop a more efficient algorithm. Explaining iteration and how to refactor a recursive script is out of scope of this book, nevertheless the following script represents an equivalent way to get all the ancestors without using recursion:

Open image in new window

The script uses the while command that basically implements iteration by repeating a set of commands (lines 6–8) while a given condition is satisfied (line 4).

To test the recursive script, we can provide as standard input the label malignant hyperthermia:

Open image in new window

The output will be the URIs of all its ancestors:

Open image in new window

We should note that we will still receive the XPath warning when the script reaches the root class and no parents are found:

Open image in new window

To remove this warning and just get the labels of the ancestors of malignant hyperthermia, we can redirect the warnings to the null device:

Open image in new window

The output will now include the name of all ancestors of malignant hyperthermia:

Open image in new window

We should note that the first two ancestors are the direct parents of malignant hyperthermia, and the last one is the root class. This happens because the recursive script print the parents before invoking itself to find the ancestors of the direct parents.

We can do the same with caffeine, but be advised that given the higher number of ancestors in ChEBI we may now have to wait a little longer for the script to end.

Open image in new window

The results include repeated classes that were found by using different branches, so that is why we need to add the sort command with the -u option to eliminate the duplicates.

The script will print the ancestors being found by the script:

Open image in new window

My Lexicon

Now that we know how to extract all the labels and related classes from an ontology, we can construct our own lexicon with the list of terms that we want to recognize in text.

Let us start by creating the file do_8545_lexicon.txt representing our lexicon for malignant hyperthermia with all its labels:

Open image in new window

Ancestors Labels

Now we can add to the lexicon all the labels of the ancestors of malignant hyperthermia by adding the redirection operator:

Open image in new window

We should note that now we use >> and not >, this will append more lines to the file instead of creating a new file from scratch.

Now we can check the contents of the file do_8545_lexicon.txt to see the terms we got:

Open image in new window

We should note that we use the sort command with the -u option to eliminate any duplicates that may exist.

We should be able to see the following labels:

Open image in new window

We can also apply the same commands for caffeine to produce its lexicon in the file chebi_27732_lexicon.txt by adding the redirection operator:

Open image in new window

We should note that it may take a while until it gets all labels.

Now let us check the contents of this new lexicon:

Open image in new window

Now we should be able to see that this lexicon is much larger:

Open image in new window

Merging Labels

If we are interested in finding everything related to caffeine or malignant hyperthermia, we may be interested in merging the two lexicons in a file named lexicon.txt:

Open image in new window

Using this new lexicon, we can recognize any mention in our previous file named chebi_27732_sentences.txt:

Open image in new window

We added the -F option because our lexicon is a list of fixed strings, i.e. does not include regular expressions. The equivalent long form to the -F option is --fixed-strings.

We now get more sentences, including some that do not include a direct mention to caffeine or malignant hyperthermia. For example, the following sentence was selected because it mentions molecule, which is an ancestor of caffeine:

Open image in new window

Another example is the following sentence, which was selected because it mentions disease, which is an ancestor of malignant hyperthermia:

Open image in new window

We can also use our script getentities.sh giving this lexicon as argument. However, since we are not using any regular expressions it would be better to add the -F option to the grep command in the script, so the lexicon is interpreted as list of fixed strings to be matched. Only then we can execute the script safely:

Open image in new window

Ancestors Matched

Besides these two previous examples, we can check if there other ancestors being matched by using the grep command with the -o option:

Open image in new window

We can see that besides the terms caffeine and malignant hyperthermia, only one ancestor of each one of them was matched, molecule and disease, respectively:

Open image in new window

This can be explained because our text is somehow limited and because we are using the official labels and we may be missing acronyms, and simple variations such as the plural of a term. To cope with this issue, we may use a stemmer10, or use all the ancestors besides subsumption. However, if our lexicon is small is better to do it manually and maybe add some regular expressions to deal with some of the variations.

Generic Lexicon

Instead of using a customized and limited lexicon, we may be interested in recognizing any of the diseases represented in the ontology. By recognizing all the diseases in our caffeine related text, we will be able to find all the diseases that may be related to caffeine

All Labels

To extract all the labels from the disease ontology we can use the same XPath query used before, but now without restricting it to any URI:

Open image in new window

We can create a script named getalllabels.sh, that receives as argument the OWL file where to find all labels containing the following lines:

Open image in new window

We should note that this script is similar to the getlabels.sh script without the xargs, since it does not receive a list of URIs as standard input.

Now we can execute the script to extract all labels from the OWL file:

Open image in new window

The output will contain the full list of diseases:

Open image in new window

To create the generic lexicon, we can redirect the output to the file diseases.txt:

Open image in new window

We can check how many labels we got by using the wc command:

Open image in new window

The lexicon contains more than 29 thousand labels.

We can now recognize the lexicon entries in the sentences of the file chebi_27732_sentences.txt by using the grep command:

Open image in new window

However, we will get the following error:

Open image in new window

This error happens because our lexicon contains some special characters also used by regular expressions, such as the parentheses.

One way to address this issue is to replace the -E option by the -F option, that treats each lexicon entry as a fixed string to be recognized:

Open image in new window

The output will show the large list of sentences mentioning diseases:

Open image in new window

Open image in new window

Problematic Entries

Despite using the -F option, the lexicon contains some problematic entries. Some entries have expressions enclosed by parentheses or brackets, that represent alternatives or a category:

Open image in new window

Other entries have separation characters, such as commas or colons, to represent a specialization. For example:

Open image in new window

A problem is that not all have the same meaning. A comma may also be part of the term. For example:

Open image in new window

Other case includes using & to represent an ampersand. For example:

Open image in new window

However, most of the times the alternatives are already included in the lexicon in different lines. For example:

Open image in new window

As we can see by these examples, it is not trivial to devise rules that fully solve these issues. Very likely there will be exceptions to any rule we devise and that we are not aware of.

Special Characters Frequency

To check the impact of each of these issues, we can count the number of times they appear in the lexicon:

Open image in new window

We will be able to see that parentheses and commas are the most frequent, with more than one thousand entries.

Completeness

Now let us check if the ATR acronym representing the alpha thalassemia-X-linked intellectualdisability syndrome is in the lexicon:

Open image in new window

All the entries include more terms than only the acronym:

Open image in new window

Thus, a single ATR mention will not be recognized.

This is problematic if we need to match sentences mentioning that acronym, such as:

Open image in new window

We will now try to mitigate these issues as simply as we can. We will not try to solve them completely, but at least address the most obvious cases.

Removing Special Characters

The first fix we will do, is to remove all the parentheses and brackets by using the tr command, since they will not be found in the text:

Open image in new window

Of course, we may lose the shorter labels, such as Post measles encephalitis, but at least now, the disease Post measles encephalitis disorder will be recognized:

Open image in new window

If we really need these alternatives, we would have to create multiple entries in the lexicon or transform the labels in regular expressions.

Removing Extra Terms

The second fix is to remove all the text after a separation character, by using the sed command:

Open image in new window

We should note that the regular expression enforces a space after the separation character to avoid separation characters that are not really separating two expressions, such as: 46,XY DSD due to LHB deficiency

We can see that now we are able to recognize both ATR and ATR syndrome:

Open image in new window

Removing Extra Spaces

The third fix is to remove any leading or trailing spaces of a label:

Open image in new window

We should note that we added two more replacement expressions to the sed command by separating them with a semicolon.

We can now update the script getalllabels.sh to include the previous tr and sed commands:

Open image in new window

And we can now generate a fixed lexicon:

Open image in new window

We can check again the number of entries:

Open image in new window

We now have a lexicon with about 28 thousand labels. We have less entries because our fixes made some entries equal to others already in the lexicon, and thus the -u option filtered them.

Disease Recognition

We can now try to recognize lexicon entries in the sentences of file chebi_27732_ sentences.txt:

Open image in new window

To obtain the list of labels that were recognized, we can use the grep command:

Open image in new window

We will get a list of 43 unique labels representing diseases that may be related to caffein:

Open image in new window

Performance

The grep is quite efficient but of course when using large lexicons and texts we may start to feel some performing issues. Its execution time is proportional to the size of the lexicon, since each term of the lexicon will correspond to an independent pattern to match. This means that for large lexicons we may face serious performance issues.

Inverted Recognition

A solution for dealing with large lexicons is to use the inverted recognition technique (Couto et al. 2017; Couto and Lamurias 2018). The inverted recognition uses the words of the input text as patterns to be matched against the lexicon file. When the number of words in the input text is much smaller than the number of terms in the lexicon, grep has much fewer patterns to match. For example, the inverted recognition technique applied to ChEBI has shown to be more than 100 times faster than using the standard technique.

Case Insensitive

Another performance issue arises when we use the -i option to perform a case insensitive matching. For instance, in most computers if we execute the following command, we will have to wait much longer than not using the -i option:

Open image in new window

One solution is to convert both the lexicon and text to lowercase (or uppercase), but this may result in more incorrect matches, such as incorrectly matching acronyms in lowercase.

ASCII Encoding

The low performance issue of case insensitive matching is normally due to the usage of UTF-8 character encoding11, instead of ASCII character encoding12. UTF-8 allow us to use special characters, such as the euro symbol, in a standard way so it is interpreted by every computer around the world in the same way. However, for normal text without special characters ASCII works fine and more efficiently. In Unix shells we can normally specify the usage of ASCII encoding by adding the expression LC\_ALL=C before the command (man locale for more information).

So, another solution is to execute the following command:

Open image in new window

We will be able to watch the significant increase in performance.

To check how many labels are now being recognized we can execute:

Open image in new window

We have now 60 labels being recognized.

To check which new labels were recognized, we can compare the results with and without the -i option:

Open image in new window

We are now able to see that the new labels are:

Open image in new window

Correct Matches

Some important diseases could only be recognized by performing a case insensitive match, such as arthrogryposis. This disease was missing because in the lexicon we had the uppercase case version of the labels, but not the lowercase version. We can check it by using the grep command:

Open image in new window

The output does not include the lowercase case version:

Open image in new window

We can also check in the text which versions are used:

Open image in new window

We can see that only the lowercase version is used:

Open image in new window

Another example is dyskinesia:

Open image in new window

The lexicon has only the disease name with the first character in uppercase:

Open image in new window

Incorrect Matches

However, using a case insensitive match may also create other problems, such as the acronym CAN for the disease Crouzon syndrome-acanthosis nigricans syndrome:

Open image in new window

By using a case insensitive grep we will recognize the common word CAN as a disease. For example, we can check how many times CAN is recognized:

Open image in new window

It is recognized 18 times.

And to see which type of matches they are, we can execute the following command:

Open image in new window

We can verify that the matches are incorrect mentions of the disease acronym:

Open image in new window

This means we created at least 18 mismatches by performing a case insensitive match.

Entity Linking

When we are using a generic lexicon, we may be interested in identifying what the recognized labels represent. For example, we may not be aware of what the matched label AD2 represents.

To solve this issue, we can use our script geturi.sh to perform linking (aka entity disambiguation, entity mapping, normalization), i.e. find the classes in the disease ontology that may be represented by the recognized label. For example, to find what AD2 represents, we can execute the following command:

Open image in new window

In this case, the result clearly shows that AD2 represents the Alzheimer disease:

Open image in new window

Modified Labels

However, we may not be so lucky with the labels that were modified by our previous fixes in the lexicon. For example, we can test the case of ATR:

Open image in new window

As expected, we received the warning that no URI was found:

Open image in new window

An approach to address this issue may involve keeping a track of the original label in a lexicon using another file.

Ambiguity

We may also have to deal with ambiguity problems where a label may represent multiple terms. For example, if we check how many classes the acronym ATS may represent:

Open image in new window

We can see that it may represent two classes:

Open image in new window

These two classes represent two distinct diseases, namely Andersen-Tawil syndrome and X-linked Alport syndrome, respectively.

We can also obtain their alternative labels by providing the two URI as standard input to the getlabels.sh script:

Open image in new window

We will get the following two lists, both containing ATS as expected:

Open image in new window

Open image in new window

If we find a ATS mention in the text, the challenge is to identify which of the syndromes the mention refers to. For addressing this challenge, we may have to use advanced entity linking techniques that analyze the context of the text.

Surrounding Entities

An intuitive solution is to select the class closer in terms of meaning to the others classes mentioned in the surrounding text. This assumes that entities present in a piece of text are somehow semantically related to each other, which is normally the case. At least the author assumed some type of relation between them, otherwise the entities would not be in the same sentence.

Let us consider the following sentence about genes and related syndromes from our text file chebi_27732_sentences.txt (on line 436):

Open image in new window

Now assume that the label Andersen-Tawil syndrome been replaced by the acronym ATS:

Open image in new window

Then, to identify the diseases in the previous sentence, we can execute the following command:

Open image in new window

We have a list of labels that can help us decide which is the right class representing ATS:

Open image in new window

To find their URIs we can use the geturi.sh script:

Open image in new window

Open image in new window

The only ambiguity is for ATS that returns two URIs, one representing the Andersen-Tawil syndrome (DOID:0050434) and the other representing the X-linked Alport syndrome (DOID:0110034):

Open image in new window

To decide which of the two URIs we should select, we can measure how close in meaning they are to the other diseases also found in the text.

Semantic Similarity

Semantic similarity measures have been successfully applied to solve these ambiguity problems (Grego and Couto 2013). Semantic similarity quantifies how close two classes are in terms of semantics encoded in a given ontology (Couto and Lamurias 2019). Using the web tool Semantic Similarity Measures using Disjunctive Shared Information (DiShIn)13, we can calculate the semantic similarity between our recognized classes. For example, we can calculate the similarity between LQT1 (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434) (see Fig. 5.6), and the similarity between LQT1 and X-linked Alport syndrome (DOID:0110034) (see Fig. 5.7).
Fig. 5.6

Semantic similarity between LQT1 (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434) using the online tool DiShIn

Fig. 5.7

Semantic similarity between LQT1 (DOID:0110644) and X-linked Alport syndrome (DOID:0110034) using the online tool DiShIn

Measures

DiShIn provides the similarity values for three measures, namely Resnik, Lin and Jiang-Conrath (Resnik 1995; Lin et al. 1998; Jiang and Conrath 1997). The last two measures provide values between 0 and 1, and Jiang-Conrath is a distance measure that is converted to similarity.

We can see that for all measures LQT1 is much more similar to Andersen-Tawil syndrome than to X-linked Alport syndrome. Moreover, Jiang-Conrath’s measure gives the only similarity value larger than zero for X-linked Alport syndrome, since it is a converted distance measure. We obtain similar results if we replace LQT1 by LQT2, LQT3, LQT5, or LQT6. This means that by using semantic similarity we can identify Andersen-Tawil syndrome as the correct linked entity for the mention ATS in this text.

DiShIn Installation

To automatize this process we can also execute DiShIn as a command line14, however we may need to install python (or python3) and SQLite15.

First, we need to install it locally using the git command line:

Open image in new window

The git command automatically retrieves a tool from the GitHub16 software repository.

If everything works fine, we should be able to see something like this in our display:

Open image in new window

If the git command is not available, we can alternatively download the compressed file (zip), extract its contents and then move to the DiShIn folder:

Open image in new window

The option -L enables the curl command to follow a URL redirection17. The equivalent long form to the -L option is --location.

We now have to copy the Human Disease Ontology in to the folder using the cp command, and then enter into the DiShIn folder:

Open image in new window

Database File

To execute DiShIn, we need first to convert the ontology file named doid.owl into a database (SQLite) file named doid.db:

Open image in new window

If the module rdflib is not installed, the following error will be displayed:

Open image in new window

We can try to install it18, but this will still take a few minutes to run.

Alternatively, we can download the latest database version:

Open image in new window

DiShIn Execution

After being installed, we can execute DiShIn by providing the database and two classes identifiers:

Open image in new window

The output of the first command will be the semantic similarity values between LQT1 (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434):

Open image in new window

Open image in new window

The output of the second command will be the semantic similarity values between LQT1 (DOID:0110644) and X-linked Alport syndrome (DOID:0110034):

Open image in new window

In the end, we should not forget to return to our parent folder:

Open image in new window

Learning python19 and SQL20 is out of scope of this book, but if we do not intend to make any modifications the above steps should be quite simple to execute.

Large Lexicons

The online tool MER is based on a shell script21, so it can be easily executed as a command line to efficiently recognize and link entities using large lexicons.

MER Installation

First, we need to install it locally using the git command line:

Open image in new window

If everything works fine, we should be able to see something like this in our display:

Open image in new window

If the git command is not available, we can alternatively download the compressed file (zip), and extract its contents:

Open image in new window

We now have to copy the Human Disease Ontology in to the data folder of MER, and then enter into the MER folder:

Open image in new window

Lexicon Files

To execute MER, we need first to create the lexicon files:

Open image in new window

This may take a few minutes to run. However, we only need to execute it once, each time we want to use a new version of the ontology. If we wait, the output will include the last patterns of each of the lexicon files.

Alternatively, we can download the lexicon files, and extract them into the data folder:

Open image in new window

We can check the contents of the created lexicons by using the tail command:

Open image in new window

These patterns are created according to the number of words of each term.

The output should be something like this:

Open image in new window

Open image in new window

Open image in new window

Open image in new window

MER Execution

Now we are ready to execute MER, by providing each sentence from the file chebi_27732_senten- ces.txt as argument to its get_entities.sh script.

Open image in new window

We removed single quotes from the text, since they are special characters to the command line xargs. We should note that this is the get_entities.sh script inside the MER folder, not the one we created before.

Now we will be able to obtain a large number of matches:

Open image in new window

Open image in new window

The first two numbers represent the start and end position of the match in the sentence. They are followed by the name of the disease and its URI in the ontology.

We can also redirect the output to a TSV file named diseases_recognized.tsv:

Open image in new window

We can now open the file in our spreadsheet application, such as LibreOffice Calc or Microsoft Excel (see Fig. 5.8).
Fig. 5.8

The diseases_recognized.tsv file opened in a spreadsheet application

Again, we should not forget to return to our parent folder in the end:

Open image in new window

Further Reading

To know more about biomedical ontologies, the book entitled Introduction to bio-ontologies is an excellent option, covering most of the ontologies and computational techniques exploring them (Robinson and Bauer 2011).

Another approach is to read and watch the materials of the training course given by Barry Smith22.

Footnotes

References

  1. Couto F, Lamurias A (2018) MER: a shell script and annotation server for minimal named entity recognition and linking. J Cheminfo 10(1):58CrossRefGoogle Scholar
  2. Couto F, Lamurias A (2019) Semantic similarity definition. In: Ranganathan S, Nakai K, Schönbach C, Gribskov M (eds) Encyclopedia of bioinformatics and computational biology, vol 1. Oxford: ElsevierGoogle Scholar
  3. Couto FM, Campos LF, Lamurias A (2017) Mer: a minimal named-entity recognition tagger and annotation server. Proc BioCreative 5:130–7Google Scholar
  4. Grego T, Couto FM (2013) Enhancement of chemical entity identification in text using semantic similarity validation. PloS one 8(5):e62984CrossRefGoogle Scholar
  5. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th research on computational linguistics international conference, pp 19–33Google Scholar
  6. Lin D et al (1998) An information-theoretic definition of similarity. In: Icml, vol 98, pp 296–304. CiteseerGoogle Scholar
  7. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 1, pp 448–453. Morgan Kaufmann Publishers Inc.Google Scholar
  8. Robinson PN, Bauer S (2011) Introduction to bio-ontologies. Chapman and Hall/CRC, Boca RatonCrossRefGoogle Scholar

Copyright information

© The Author(s) 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Francisco M. Couto
    • 1
  1. 1.LASIGE, Department of InformaticsFaculdade de Ciências, Universidade de LisboaLisbonPortugal

Personalised recommendations