Advertisement

Data Retrieval

  • Francisco M. Couto
Open Access
Chapter
Part of the Advances in Experimental Medicine and Biology book series (AEMB, volume 1137)

Abstract

This chapter starts by introducing an example of how we can retrieve text, where every step is done manually. The chapter will describe step-by-step how we can automatize each step of the example using shell script commands, which will be introduced and explained as long as they are required. The goal is to equip the reader with a basic set of skills to retrieve data from any online database and follow the links to retrieve more information from other sources, such as literature.

Keywords

Unix shell Terminal application Web retrieval cURL: Client Uniform Resource Locator Data extraction Data selection Data filtering Pattern matching XML: extensible markup language XPath: XML path language 

Caffeine Example

As our main example, let us consider that we need to retrieve more data and literature about caffeine. If we really do not know anything about caffeine, we may start by opening our favorite internet browser and then searching caffeine in Wikipedia1 to know what it really is (see Fig. 3.1). From all the information that is available we can check in the infobox that there are multiple links to external sources. The infobox is normally a table added to the top right-hand part of a web page with structured data about the entity described on that page.
Fig. 3.1

Wikipedia page about caffeine

From the list of identifiers (see Fig. 3.2), let us select the link to one resource hosted by the European Bioinfomatics Institute (EBI), the link to CHEBI:277322.
Fig. 3.2

Identifiers section of the Wikipedia page about caffeine

CHEBI represents the acronym of the resource Chemical Entities of Biological Interest (ChEBI)3 and 27732 the identifier of the entry in ChEBI describing caffeine (see Fig. 3.3). ChEBI is a freely available database of molecular entities with a focus on “small” chemical compounds. More than a simple database, ChEBI also includes an ontology that classifies the entities according to their structural and biological properties.
Fig. 3.3

ChEBI entry describing caffeine

By analyzing the CHEBI:27732 web page we can check that ChEBI provides a comprehensive set of information about this chemical compound. But let us focus on the Automatic Xrefs tab4. This tab provides a set of external links to other resources describing entities somehow related to caffeine (see Fig. 3.4).
Fig. 3.4

External references related to caffeine

In the Protein Sequences section, we have 77 proteins (in September of 2018) related to caffeine. If we click on show all we will get the complete list5 (see Fig. 3.5). These links are to another resource hosted by the EBI, the UniProt, a database of protein sequences and annotation data.
Fig. 3.5

Proteins related to caffeine

The list includes the identifiers of each protein with a direct link to respective entry in UniProt, the name of the protein and some topics about the description of the protein. For example, DISRUPTION PHENOTYPE means some effects caused by the disruption of the gene coding for the protein are known6.

We should note that at bottom-right of the page there are Export options that enable us to download the full list of protein references in a single file. These options include:
  1. CSV:

    Comma Separated Values, the open format file that enable us to store data as a single table format (columns and rows).

     
  2. Excel:

    a proprietary format designed to store and access the data using the software Microsoft Excel.

     
  3. XML:

    eXtensible Markup Language, the open format file that enable us to store data using a hierarchy of markup tags.

     

We start by downloading the CSV, Excel and XML files. We can now open the files and check its contents in a regular text editor software7 installed in our computer, such as notepad (Windows), TextEdit (Mac) or gedit (Linux).

The first lines of the chebi_27732_xrefs_UniProt.csv file should look like this:

Open image in new window

The first lines of the chebi_27732_xrefs_UniProt.xls file should look like this:

Open image in new window

As we can see, this is not the proprietary format XLS but instead a TSV format. Thus, the file can still be open directly on Microsoft Excel.

The first lines of the chebi_27732_xrefs_UniProt.xml file should look like this:

Open image in new window

Open image in new window

We should note that all the files contain the same data they only use a different format.

If for any reason, we are not able to download the previous files from UniProt, we can get them from the book file archive8.

In the following sections we will use these files to automatize this process, but for now let us continue our manual exercise using the internet browser. Let us select the Ryanodine receptor 1 with the identifier P21817 and click on the link9 (see Fig. 3.6). We can now see that UniProt is much more than just a sequence database. The sequence is just a tiny fraction of all the information describing the protein. All this information can also be downloaded as a single file by clicking on Format and on XML. Then, save the result as a XML file to our computer.
Fig. 3.6

UniProt entry describing the Ryanodine receptor 1

Again, we can use our text editor to open the downloaded file named P21817.xml, which first lines should look like this:

Open image in new window

We can check that this entry represents a Homo sapiens (Human) protein, so if we are interested only in Human Proteins, we will have to filter them. For example, the entry E9PZQ010 in the ChEBI list also represents a Ryanodine receptor 1 protein but for the Mus musculus (Mouse).

Going back to the browser in the top-left side of the UniProt entry we have a link to publications11. If we click on it, we will see a list of publications somehow related to the protein (see Fig. 3.7).
Fig. 3.7

Publications related to Ryanodine receptor 1

Let us assume that we are interested in finding phenotypic information, the first title that may attract our attention is: Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia. To know more about the publication, we can use the UniProt citations service by clicking on the Abstract link12 (see Fig. 3.8).
Fig. 3.8

Abstract of the publication entitled Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia

To check if the abstract mentions any disease we can use an online text mining tool, for example the Minimal Named-Entity Recognizer (MER)13. We can copy and paste the abstract of the publication into MER and select DO – Human Disease Ontology as lexicon (see Fig. 3.9).
Fig. 3.9

Diseases recognized by the online tool MER in an abstract

We will see that MER detects three mentions of malignant hyperthermia, giving us another link14 about the disease found (see Fig. 3.10).
Fig. 3.10

Ontobee entry for the class malignant hyperthermia

Thus, in summary, we started from a generic definition of caffeine and ended with an abstract about hyperthermia by following the links in different databases. Of course, this does not mean that by taking caffeine we will get hyperthermia, or that we will treat hyperthermia by taking caffeine (maybe as a cold drink \(\ddot \smile \)15). However, this relation has a context, a protein and a publication, that need to be further analyzed before drawing any conclusions.

We should note that we only analyzed one protein and one publication, we now need to repeat all the steps to all the proteins and to all the publications related to each protein. And this could even be more complicated if we were interested in other central nervous system stimulants, for example by looking in the ChEBI ontology16. This is of course the motivation to automatize the process, since it is not humanly feasible to deal with such large amount of data, that keeps evolving every day.

However, if the goal was to find a relation between caffeine and hyperthermia, we could simply have searched these two terms in PubMed. We did not do that because some relations are not explicitly mention in the text, thus we have to navigate through database links. The second reason is because we needed an example using different resources and multiple entries to explain how we can automate most of these steps using shell scripting. The automation of the example will introduce a comprehensive set of techniques and commands, which with some adaptation Life and Health specialists can use to address many of their text and data processing challenges.

Unix Shell

The first step is to open a shell in our personal computer. A shell is a software program that interprets and executes command lines given by the user in consecutive lines of text. A shell script is a list of such command lines. The command line usually starts by invoking a command line tool. This manuscript will introduce a few command line tools, which will allow us to automatize the previous example. Unix shell was developed to manage Unix-like operating systems, but due to their usefulness nowadays they are available is most personal computers using Linux, macOS or Windows operating systems. There are many types of Unix shells with minor differences between them (e.g. sh, ksh, csh, tcsh and bash), but the most widely available is the Bourne-Again shell (bash17). The examples in this manuscript were tested using bash.

So, the first step is to open a shell in our personal computer using a terminal application (see Fig. 3.11). If we are using Linux or macOS then this is usually not new for us, since most probably we have a terminal application already installed, that opens a shell for us. In case we are using a Microsoft Windows operating system, then we have several options to consider. If we are using Windows 10, then we can install a Windows Subsystem for Linux18 or just install a third-party application, such as MobaXterm19. No matter which terminal application we end up using, the shell will always have a common look: a text window with a cursor blinking waiting for our first command line. We should note that most terminal applications allow the usage of the up and down cursor keys to select, edit, and execute previous commands, and the usage of the tab key to complete the name of a command or a file.
Fig. 3.11

Screenshot of a Terminal application (Source: https://en.wikipedia.org/wiki/Unix)

Current Directory

As our first command line, we can type:

Open image in new window

After hitting enter, the command will show the full path of the directory (folder) of our computer in which the shell is working on. The dollar sign in the left is only to indicate that this is a command to be executed directly in the shell.

To understand a command line tool, such as pwd, we can type man followed by the name of the tool. For example, we can type man pwd to learn more about pwd (do not forget to hit enter, and press q to quit). We can also learn more about man by typing man man. A shorter alternative to man, is to add the --help option after any command tool. For example, we can type pwd --help to have a more concise description of pwd.

As our second command line, we can type ls and hit enter. It will show the list of files in the current directory. For example, we can type ls --help to have a concise description of ls. Since we will work with files, that we need to open with a text editor or a spreadsheet application20, such as LibreOffice Calc or Microsoft Excel, we should select a current directory that we can easily open in our file explorer application. A good idea is to open our favorite file explorer application, select a directory, and then check its full path21.

Windows Directories

Notice that in Windows the full path to a directory each name is separated by a backslash (\\) while in a Unix shell is a forward slash (/).

For example, a Windows path to the Documents folder may look like:

Open image in new window

If we are using the Windows Subsystem for Linux22, the previous folder must be accessed using the path:

Open image in new window

If we are using MobaXterm23, the following path should be used instead:

Open image in new window

Change Directory

To change the directory, we can use another command line tool, the cd (change directory) followed by the new path. In a Linux system we may want to use the Documents directory. If the Documents directory is inside our current directory (shown using ls), we only need to type:

Open image in new window

Now we can type pwd to see what changed.

And if we want to return to the parent directory, we only need to use the two dots ..:

Open image in new window

And if we want to return to the home directory, we only need to use the tilde character (∼):

Open image in new window

Again, we should type pwd to double check if we are in the directory we really want.

In Windows we may need to use the full path, for example:

Open image in new window

We should note that we need to enclose the path within single (or double) quotes in case it contains spaces:

Open image in new window

Later on, we will know more about the difference between using single or double quotes. For now, we may assume that they are equivalent. To know more about cd, we can type cd --help.

Useful Key Combinations

Every time the terminal is blocked by any reason, we can press both the control and C key at the same time24. This usually cancels the current tool being executed. For example, try using the cd command with only one single quote:

Open image in new window

This will block the terminal, because it is still waiting for a second single quote that closes the argument. Now press control-C, and the command will be aborted.

Now we can type again the previous command, but instead of pressing control-C we may also press control-D25. The combination control-D indicates the terminal that it is the end of input. So, in this case, the cd command will not be canceled, but instead it is executed without the second single quote and therefore a syntax error will be shown on our display.

Other useful key combinations are the control-L that when pressed cleans the terminal display, and the control-insert and shift-insert that when pressed copy and paste the selected text, respectively.

Shell Version

The following examples will probably work in any Unix shell, but if we want to be certain that we are using bash we can type the following command, and check if the output says bash.

Open image in new window

ps is a command line tool that shows information about active processes running in our computer. The -p option selects a given process, and in this case \$\$ represents the process running in our terminal application. In most terminal applications bash is the default shell. If this is not our case, we may need to type bash, hit enter and now we are using bash.

Now that we know how to use a shell, we can start writing and running a very simple script that reverse the order of the lines in a text file.

Data File

We start by creating a file named myfile.txt using any text editor, and adding the following lines:

Open image in new window

We cannot forget to save it in our working directory, and check if it has the proper filename extension.

File Contents

To check if the file is really on our working directory, we can type:

Open image in new window

The contents of the file should appear in our terminal. cat is a simple command line tool that receives a filename as argument and displays its contents on the screen. We can type man cat or cat --help to know more about this command line tool.

Reverse File Contents

An alternative to cat tool is the tac tool. To try it, we only need to type:

Open image in new window

The contents of the file should also appear in our terminal, but now in the reverse order. We can type man tac or tac --help to know more about this command line tool.

My First Script

Now we can create a script file named reversemyfile.sh by using the text editor, and add the following lines:

Open image in new window

We cannot forget to save the file in our working directory. \$1 represents the first argument after the script filename when invoking it. Each script file presented in this manuscript will include the line numbers in the left. This will helps us not only to identify how many lines the script contains, but also to distinguish a script file from the commands to be executed directly in the shell.

Line Breaks

A Unix file represents a single line break by a line feed character, instead of two characters (carriage return and line feed) used by Windows26. So, if we are using a text editor in Windows, we must be careful to use one that lets us save it as Unix file, for example the open source Notepad++27.

In case we do not have such text editor, we can also remove the extra carriage return by using the command line tool tr, that replaces and deletes characters:

Open image in new window

The -d option of tr is used to remove a given character from the input, in this case tr will delete all carriage returns (\\r). Many command line options can be used in short form using a single dash (-), or in a long form using two dashes (--). In this tool, using the --delete option is equivalent to the -d option. Long forms are more self-explanatory, but they take longer to type and occupy more space. We can type man tr or tr --help to know more about this command line tool.

Redirection Operator

The > character represents a redirection operator28 that moves the results being displayed at the standard output (our terminal) to a given file. The < character represents a redirection operator that works on the opposite direction, i.e. opens a given file and uses it as the standard input.

We should note that cat received the filename as an input argument, while tr can only receive the contents of the file through the standard input. Instead of providing the filename as argument, the cat command can also receive the contents of a file through the standard input, and produce the same output:

Open image in new window

The previous tr command used a new file for the standard output, because we cannot use the same file to read and write at the same time. To keep the same filename, we have to move the new file by using the mv command:

Open image in new window

We can type man mv or mv --help to know more about this command line tool.

Installing Tools

These two last commands could be replaced by the dos2unix tool:

Open image in new window

If not available, we have to install the dos2unix tool. For example, in the Ubuntu Windows Subsystem we need to execute:

Open image in new window

The apt (Advanced Package Tool) command is used to install packages in many Linux systems29. Another popular alternative is the yum (Yellowdog Updater, Modified) command30.

To avoid fixing line breaks each time we update our file when using Windows, a clearly better solution is to use a Unix friendly text editor.

When we are not using Windows, or we are using a Unix friendly text editor, the previous commands will execute but nothing will happen to the contents of reversemyfile.sh, since the tr command will not remove any character. To see the command working replace '\\r' by '\$' and check what happens.

Permissions

A script also needs permission to be executed, so every time we create a new script file we need to type:

Open image in new window

The command line tool chmod just gave the user (u) permissions to execute (+x). We can type man chmod or chmod --help to know more about this command line tool.

Finally, we can execute the script by providing the myfile.txt as argument:

Open image in new window

The contents of the file should appear in our terminal in the reverse order:

Open image in new window

Congratulations, we made our first script work \(! \, \ddot \smile \)

If we give more arguments, they will be ignored:

Open image in new window

The output will be exactly the same because our script does not use \$2 and \$3, that in this case will represent myotherfile.txt and my other file.txt, respectively. We should note that when containing spaces, the argument must be enclosed by single quotes.

Debug

If something is not working well, we can debug the entire script by typing:

Open image in new window

Our terminal will not only display the resulting text, but also the command line tools executed preceded by the plus character (+):

Open image in new window

Alternatively, we can add the set -x command line in our script to start the debugging mode, and set +x to stop it.

Save Output

We can now save the output into another file named mynewfile.txt by typing:

Open image in new window

Again, to check if the file was really created, we can use the cat tool:

Open image in new window

Or, we can reverse it again by typing:

Open image in new window

Of course, the result should exactly be the original contents of myfile.txt.

Web Identifiers

The input argument(s) of our retrieval task is the chemical compound(s) of which we want to retrieve more information. For the sake of simplicity, we will start by assuming that the user knows the ChEBI identifier(s), i.e. the script does not have to search by the name of the compounds. Nevertheless, to find the identifier of a compound by its name is also possible, and this manuscript will describe how to do it later on.

So, the first step, is to automatically retrieve all proteins associated to the given input chemical compound, that in our example was caffeine (CHEBI:27732). In the manual process, we downloaded the files by manually clicking on the links shown as Export options, namely the URLs:

Open image in new window

for downloading a CSV, Excel, or XML file, respectively.

We should note that the only difference between the three URLs is a single numerical digit (1, 2, and 3) after the first equals character (=), which means that this digit can be used as an argument to select the type of file. Another parameter that is easily observable is the ChEBI identifier (27732). Try to replace 27732 by 17245 in any of those URLs by using a text editor, for example:

Open image in new window

Now we can use this new URL in the internet browser, and check what happens. If we did it correctly, our browser downloaded a file with more than seven hundred proteins, since the 17245 is the ChEBI identifier of a popular chemical compound in life systems, the carbon monoxide.

In this case, we are not using a fully RESTful web service, but the data path is pretty modular and self-explanatory. The path is clearly composed of:
  • the name of the database (chebi);

  • the method (viewDbAutoXrefs.do);

  • and a list of parameters and their value (arguments) after the question mark character (?).

The order of the parameters in the URL is normally not relevant. They are separated by the ampersand character (&) and the equals character (=) is used to assign a value to each parameter (argument). This modular structure of these URLs allows us to use them as data pipelines to fill our local files with data, like pipelines that transport oil or gas from one container to another.

Single and Double Quotes

To construct the URL for a given ChEBI identifier, let us first understand the difference between single quotes and double quotes in a string (sequence of characters). We can create a script file named getproteins.sh by using a text editor to add the following lines:

Open image in new window

The command line tool echo displays the string received as argument. Do not forget to save it in our working directory and add the right permissions with chmod as we did previously with our first script.

Now to execute the script we will only need to type:

Open image in new window

The output on the terminal should be:

Open image in new window

This means that when using single quotes, the string is interpreted literally as it is, whereas the string within double quotes is analyzed, and if there is a special character, such as the dollar sign (\$), the script translates it to what it represents. In this case, \$1 represents the first input argument. Since no argument was given, the double quotes displays nothing.

To execute the script with an argument, we can type:

Open image in new window

The output on our terminal should be:

Open image in new window

We can check now that when using double quotes \$1 is translated to the string given as argument.

Now we can update our script file named getproteins.sh to contain only the following line:

Open image in new window

Comments

Instead of removing the previous lines, we can transform them in comments by adding the hash character (\ #) to the beginning of the line:

Open image in new window

Commented lines are ignored by the computer when executing the script.

Now, we can execute the script giving the ChEBI identifier as argument:

Open image in new window

The output on our terminal should be the link that returns the CSV file containing the proteins associated with caffeine.

Data Retrieval

After having the link, we need a web retrieval tool that works like our internet browser, i.e. receives as input a URL for programmatic access and retrieves its contents from the internet. We will use Client Uniform Resource Locator (cURL), which is available as a command line tool, and allows us to download the result of opening a URL directly into a file (man curl or curl --help for more information).

For example, to display in our screen the list of proteins related to caffeine, we just need to add the respective URL as input argument:

Open image in new window

In some systems the curl command needs to be installed31. Since we are using a secure connection https, we may also need to install the ca-certificates package32.

An alternative to curl is the command wget , which also receives a URL as argument but by default wget writes the contents to a file instead of displaying it on the screen (man wget or wget --help for more information). So, the equivalent command, is to add the -O- option to select where the contents is placed:

Open image in new window

We should note that dash - character after -O represents the standard output. The equivalent long form to the -O option is --output-document=file.

The output on our terminal should be the long list of proteins:

Open image in new window

Instead of using a fixed URL, we can update the script named getproteins.sh to contain only the following line:

Open image in new window

We should note that now we are using double quotes, since we replaced the caffeine identifier by \$1.

Now to execute the script we only need to provide a ChEBI identifier as input argument:

Open image in new window

The output on our terminal should be the long list of proteins:

Open image in new window

Or, if we want the proteins related to carbon monoxide, we only need to replace the argument:

Open image in new window

And the output on our terminal should be an even longer list of proteins:

Open image in new window

If we want to analyze all the lines we can redirect the output to the command line tool less, which allows us to navigate through the output by using the arrow keys. To do that we can add the bar character (|) between two commands, which will transfer the output of the first command as input of the second:

Open image in new window

To exit from less just press q.

However, what we really want is to save the output as a file, not just printing some characters on the screen. Thus, what we should do is redirect the output to a CSV file. This can be done by adding the redirect operator > and the filename, as described previously:

Open image in new window

We should note that curl still prints some progress information into the terminal.

Standard Error Output

This happens because it is displaying that information into the standard error output, which was not redirected to the file33. The > character without any preceding number by default redirects the standard output. The same happens if we precede it by the number 1. If we do not want to see that information, we can also redirect the standard error output (2), but in this case to the null device (/dev/null):

Open image in new window

We can also use the -s option of curl in order to suppress the progress information, by adding it to our script file named getproteins.sh:

Open image in new window

The equivalent long form to the -s option is --silent.

Now when executing the script, no progress information is shown:

Open image in new window

To check if the file was really created and to analyze its contents, we can use the less command:

Open image in new window

We can also open the file in our spreadsheet application, such as LibreOffice Calc or Microsoft Excel.

As an exercise execute the script to get the CSV file with the associated proteins of water34 and gold35.

Data Extraction

Some data in the CSV file may not be relevant regarding our information need, i.e. we may need to identify and extract relevant data. In our case, we will select the relevant proteins (lines) using the command line tool grep, and secondly, we will select the column we need using the command line tool gawk, which is the GNU implementation of awk36. We should note that if we are using MobaXterm we may need to install the gawk package37. We can also replace gawk by awk in case another implementation is available38.

Since our information need is about diseases related to caffeine, we may assume that we are only interested in proteins that have one of these topics in the third column:

Open image in new window

Extracting lines from a text file is the main function of grep. The selection is performed by giving as input a pattern that grep tries to find in each line, presenting only the ones where it was able to find a match. The pattern is the same as the one we normally use when searching for a word in our text editor. The grep command also works with more complex patterns such as regular expressions, that we will describe later on.

Single and Multiple Patterns

We can execute the following command that selects the proteins with the topic CC - MISCELLANEOUS, our pattern, in our CSV file:

Open image in new window

The output will be a shorter list of proteins, all with CC - MISCELLANEOUS as topic:

Open image in new window

To use multiple patterns, we must precede each pattern with the -e option:

Open image in new window

The equivalent long form to the -e option is --regexp=PATTERN.

The output on our terminal should be a longer list of proteins:

Open image in new window

We should note that as previously, we can add | less to check all of them more carefully. The less command also gives the opportunity to find lines based on a pattern. We only need to type / and then a pattern.

We can now update our script file named getproteins.sh to contain the following lines:

Open image in new window

We should note that we added the -s option to suppress the progress information of curl, and the characters | \\ to the end of line to redirect the output of that line as input of the next line, in this case the grep command. We need to be careful in ensuring that \\ is the last character in the line, i.e. spaces in the end of the line may cause problems.

We can now execute the script again:

Open image in new window

The output should be similar of what we got previously, but the script downloads the data and filters immediately.

To save the file with the relevant proteins, we only need to add the redirection operator:

Open image in new window

Data Elements Selection

Now we need to select just the first column, the one that contains the protein identifiers. Selecting columns from a tabular file is one easy task for gawk, that besides performing pattern scanning also provides a complex processing language (AWK39). This processing language can be highly complex40 and it is out of our scope for this introductory manuscript. The gawk command can receive as arguments the character that divides each data element (column) in a line using the -F option, and an instruction of what to do with it enclosed by single quotes and curly brackets. The equivalent long form to the -F option is --field-separator=fs.

For example, we can get the first column of our CSV file:

Open image in new window

We should note that comma (,) is the character that separates data elements in a CSV file, and that print is equivalent to echo, and \$1 represents the first data element.

The command will display only the first column of the file, i.e. the protein identifiers:

Open image in new window

For example, we can get the first and third columns separated by a comma:

Open image in new window

Now, the output contains both the first and third column of the file:

Open image in new window

We can update our script file named getproteins.sh to contain the following lines:

Open image in new window

The last line is the only that changes, except the | \\ in the previous line to redirect the output.

To execute the script, we can type again:

Open image in new window

The output should be similar of what we got previously, but now only the protein identifiers are displayed.

To save the output as a file with the relevant proteins’ identifiers, we only need to add the redirection operator:

Open image in new window

Task Repetition

Given a protein identifier we can construct the URL that will enable us to download its information from UniProt. We can use the RESTful web services provided by UniProt41, more specifically the one that allow us to retrieve a specific entry42. The construction of the URL is simple, it starts always by https://www.uniprot.org/uniprot/, followed by the protein identifier, ending with a dot and the data format. For example, the link for protein P21817 using the XML format is: http://www.uniprot.org/uniprot/P21817.xml

Assembly Line

However, we need to construct one URL for each protein from the list we previously retrieved. The size of the list can be large (hundreds of proteins), varies for different compounds and evolves with time. Thus, we need an assembly line in which a list of proteins identifiers, independently of its size, are added as input to commands that construct one URL for each protein and retrieve the respective file.

The xargs command line tool works as an assembly line, it executes a command per each line given as input. We should note that if we are using MobaXterm we may need to install the findutils package43, since the default xargs only has minimal options44.

We can start by experimenting the xargs command by giving as input the list of protein identifiers in file chebi_27732_xrefs_UniProt_relevant_identifiers.csv, display each identifier on the screen in the middle of a text message by providing the echo command as argument:

Open image in new window

The xargs command received as input the contents our CSV file, and for each line displayed a message including the identifier in that line. The -I option tells xargs to replace \\{}  in the command line given as argument by the value of the line being processed. The equivalent long form to the -I option is --replace=R.

The output should be something like this:

Open image in new window

Open image in new window

Instead of creating inconsequential text messages, we can use xargs to create the URLs:

Open image in new window

The output should be something like this:

Open image in new window

We can try to use these links in our internet browser to check if those displayed URLs are working correctly.

Now that we have the URLs, we can automatically download the files using the curl command instead of echo:

Open image in new window

We should note that we now use the -o option to save the output to a given file, named after each protein identifier. The equivalent long form to the -o option is --output <file>.

To check if everything worked as expected we can use the ls command to view which files were created:

Open image in new window

The asterisk character (*) character is here used to represent any file whose name starts with chebi\_27732\_ and ends with .xml.

To check the contents of any of them, we can use the less command:

Open image in new window

File Header

We should note that the content of every file has to start with <?xml otherwise there was a download error, and we have to run curl again for those entries. To check the header of each file, we can use the head command together with less.

Open image in new window

The -n option specifies how many lines to print, in the previous command just one.

If for any reason, we are not able to download the files from UniProt, we can get them from the book file archive45.

Variable

We can now update our script file named getproteins.sh to contain the following lines:

Open image in new window

We should note that the last line now includes the xargs and curl commands, and the \$ID variable. This new variable is created in the first line to contain the first value given as argument (\$1). So, every time we mention \$ID in the script we are mentioning the first value given as argument. This avoids ambiguity in cases where \$1 is used for other purposes, like in the gawk command. Since the preceding character of \$ID is an underscore (\_), we have to add a backslash (\\) before it. The second line uses the rm command to remove any files that were downloaded in a previous execution. We also now added two comments after the hash character, so we humans do not forget why these commands are needed for.

To execute the script once more:

Open image in new window

And again, to check the results:

Open image in new window

XML Processing

Assuming that our information need only concerns human diseases, we have to process the XML file of each protein to check if it represents a Homo sapiens (Human) protein.

Human Proteins

For performing this filter, we can again use the grep command, to select only the lines of any XML file that specify the organism as Homo sapiens:

Open image in new window

We should get in our display the filenames that represent a human protein, i.e. something like this:

Open image in new window

Open image in new window

We should note that since the asterisk character (*) provides multiple files as argument to grep, the ones whose name starts with chebi\_27732\_ and ends with .xml, the output now includes the filename (followed by a colon) where each line was matched.

We can use the gawk command to extract only the filename, but grep has the -l option to just print the filename:

Open image in new window

The equivalent long form to the -l option is --files-with-matches.

The output will now show only the filenames:

Open image in new window

These four files represent the four Human proteins related to caffeine.

PubMed Identifiers

Now we need to extract the PubMed identifiers from these files to retrieve the related publications. For example, if we execute the following command:

Open image in new window

The output is a long list of publications related to protein P21817:

Open image in new window

Open image in new window

To extract just the identifier, we can again use the gawk command:

Open image in new window

We should note that " is used as the separation character and, since the PubMed identifier appears after the third ", the \$4 represents the identifier.

Now the output should be something like this:

Open image in new window

PubMed Identifiers Extraction

Now to apply to every protein we may again use the xargs command:

Open image in new window

This may provide a long list of PubMed identifiers, including repetitions since the same publication can be cited in different entries.

Duplicate Removal

To help us identify the repetitions, we can add the sort command (man sort or sort --help for more information), which will display the repeated identifiers in consecutive lines (due by sorting all identifiers):

Open image in new window

For example some repeated PubMed identifiers that we should easily be able to see:

Open image in new window

Fortunately, we also have the -u option that removes all these duplicates:

Open image in new window

To easily check how many duplicates were removed, we can use the word count wc command with and without the usage of the -u option:

Open image in new window

In case we have in our folder any auxiliary file, such as chebi\_27732\_P21817\_entry.xml, we should add the option --exclude *entry.xml to the first grep command.

The output should be something like:

Open image in new window

wc prints the numbers of lines, words, and bytes, thus in our case we are interested in first number (man wc or wc --help for more information). We can see that we have removed 255 − 129 = 126 duplicates.

Just for curiosity, we can also use the shell to perform simple mathematical calculations using the expr command:

Open image in new window

Now let us create a script file named getpublications.sh by using a text editor to add the following lines:

Open image in new window

Again, do not forget to save it in our working directory, and add the right permissions with chmod as we did previously with the other scripts.

To execute the script again:

Open image in new window

We can verify how many unique publications were obtained by using the -l option of wc, that provides only the number of lines:

Open image in new window

The output will be 129 as expected.

Complex Elements

Not always the XML elements are in the same line, as fortunately was the case of the PubMed identifiers. In those cases, we may have to use the xmllint command, a parser that is able to extract data through the specification of a XPath query, instead of using a single line pattern as in grep.

XPath

XPath (XML Path Language) is a powerful tool to extract information from XML and HTML documents by following their hierarchical structure. Check W3C for more about XPath syntax46. We should note that xmllint may not be installed by default depending on our operating system, but it should be very easy to do it47 If we are using MobaXterm, then we need to install the xmllint plugin48.

Namespace Problems

In the case of our protein XML files, we can see that their second line defines a specific namespace using the xmlns attribute49:

Open image in new window

This complicates our XPath queries, since we need to explicitly specify that we are using the local name for every element in a XPath query. For example, to get the data in each reference element:

Open image in new window

We should note that // means any path in the XML file until reaching a reference element. The square brackets in XPath queries normally represent conditions that need to be verified.

Only Local Names

If we are only interested in using local names there is a way to avoid the usage of local-name() for every element in a XPath query. We can identify the top-level element, in our case entry, and extract all the data that it encloses using a XPath query. For example, we can create the auxiliary file chebi\_27732\_P21817\_entry.xml by adding the redirection operator:

Open image in new window

The new XML file now starts and ends with the entry element without any namespace definition:

Open image in new window

Now we can apply any XPath query, for example //reference, on the auxiliary file without the need to explicitly say that it represents a local name:

Open image in new window

The output should contain only the data inside of each reference element:

Open image in new window

Queries

The XPath syntax allow us to create many useful queries, such as:
  • //dbReference – elements of type dbReference that are descendants of something; Result:

    Open image in new window

  • /entry//dbReference – equivalent to the previous query but specifying that the dbReference elements are descendants of the entry element;

  • /entry/reference/citation/dbReference– equivalent to the previous query but specifying the full path in the XML file;

  • //dbReference/* – any child elements of a dbReference element; Result:

    Open image in new window

  • //dbReference/property[1] – first property element of each dbReference element; Result:

    Open image in new window

  • //dbReference/property[2] – second property element of each dbReference element; Result:

    Open image in new window

  • //dbReference/property[3] – third property element of each dbReference element; Result:

    Open image in new window

  • //dbReference/property/@type – all type attributes of the property elements; Result:

    Open image in new window

  • //dbReference/property[@type="protein sequence ID"] – the previous property elements that have an attribute type equal to protein sequence ID; Result:

    Open image in new window

  • //dbReference/property[@type="protein sequence ID"]/@value – the string assigned to each attribute value of the previous property elements; Result:

    Open image in new window

  • //sequence/text() – the contents inside the sequence elements; Result:

    Open image in new window

We should note that to try the previous queries we only need to replace the string after the --xpath option of the previous xmllint command, such as:

Open image in new window

Thus, an alternative way to extract the PubMed identifiers using xmllint instead of grep, would be something like this:

Open image in new window

Open image in new window

However, the output contains all identifiers in the same line and with the id label:

Open image in new window

Extracting XPath Results

To extract the identifiers, we need to apply the tr command to split the output in multiple lines (one line per identifier), and then the gawk command:

Open image in new window

The tr command replaces each space by a newline character, and the gawk command extracts the value inside the double quotes. We should note that NF >0 is used to only select lines with at least a separation character ", i.e. in our case it ignores empty lines.

Text Retrieval

Now that we have all the PubMed identifiers, we need to download the text included in the titles and abstracts of each publication.

Publication URL

To retrieve from the UniProt citations service the publication entry of a given identifier, we can again use the curl command and a link to the publication entry. For example, if we click on the Format button of the UniProt citations service entry50, we can get the link to the RDF/XML version. RDF51 is a standard data model that can be serialized in a XML format. Thus, in our case, we can deal with this format like we did with XML.

We can retrieve the publication entry by executing the following command:

Open image in new window

Thus, we can now update the script getpublications.sh to have the following commands:

Open image in new window

We should note that only the second and last lines were updated to remove and retrieve the files, respectively.

Now let us execute the script:

Open image in new window

It may take a while to download all the entries, but probably no more than one minute with a standard internet connection.

To check if everything worked as expected we can use the ls command to view which files were created:

Open image in new window

If for any reason, we are not able to download the abstracts from UniProt, we can get them from the book file archive52.

Title and Abstract

Each file has the title and abstract of the publication as values of the title and rdfs:comment elements, respectively. To extract them we can again use the grep command:

Open image in new window

The output should be something like these two lines:

Open image in new window

To remove the XML elements, we can again use gawk:

Open image in new window

We should note that we now use two characters as field separators < and > to get the text between the first > and the second <. The first field separator is < so \$2 contains the string title or rdfs:comment while \$1 is empty. The second field separator is > so \$3 contains the string we want to keep.

The output should now be free of XML elements:

Open image in new window

Thus, let us create the script gettext.sh to have the following commands:

Open image in new window

Open image in new window

Again do not forget to save it in our working directory, and add the right permissions.

Now to execute the script and see the retrieved text:

Open image in new window

We can save the resulting text in a file named chebi_27732.txt that we may share or read using our favorite text editor, by adding the redirection operator:

Open image in new window

Disease Recognition

Instead of reading all that text to find any disease related with caffeine, we can try to find sentences about a given disease by using grep:

Open image in new window

To save the filtered text in a file named chebi_27732_hyperthermia.txt, we only need to add the redirection operator:

Open image in new window

This is a very simple way of recognizing a disease in text. The next chapters will describe how to perform more complex text processing tasks.

Further Reading

If we really want to become an expert in shell scripting we may be interested in reading a book specialized in the subject, such as the book entitled The Linux command line: a complete introduction (Shotts Jr 2012).

A more pragmatic approach is to explore the vast number of online tutorials about shell scripting and web technologies, such as the ones provided by W3Schools53.

Footnotes

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.

    apt install curl

  32. 32.

    apt install ca-certificates

  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.

    apt install gawk

  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.

    apt install findutils

  44. 44.

    In some versions the scripts may have to use xargs.exe to invoke the new version. Or rename the xargs shortcut in the bin folder to other name, that way the right version will always be invoked.

  45. 45.
  46. 46.
  47. 47.

    apt install libxml2-utils

  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.

References

  1. Shotts WE Jr (2012) The Linux command line: a complete introduction. No Starch Press, San FranciscoGoogle Scholar

Copyright information

© The Author(s) 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Francisco M. Couto
    • 1
  1. 1.LASIGE, Department of InformaticsFaculdade de Ciências, Universidade de LisboaLisbonPortugal

Personalised recommendations