Keywords

1 Introduction

It is a common fact that native speakers do not manage completely the lexicon (all the words used in a particular language or subject), and it may produce some difficulties for communication among individuals. However, it is normal to improve the lexicon of an individual through educational processes continuously. It is expectable that students develop their lexicon according to contents to which they are exposed. Unfortunately, to acquire new words does not guarantee an accordingly understanding of the concepts involved. To determine which are the concepts about which an individual has better management the available lexicon can be used, considering the first words an individual remember when asked to think about a particular field of knowledge. These opening words represent the concepts an individual understand well.

The development of Lexmath [5, 6] allows teachers to study the student’s lexicon, as a mechanism for identifying those topics that did not achieve a good understanding. The platform helps to describe and quantify student’s lexicon in different mathematical subjects (arithmetic, algebra, statistics, geometry, and probabilities). The platform is easily extended to deal with various topics like home, transport, city, and so on.

The objective of the present proposal is to develop a software system for using the lexicon information present in the platform to determine, through the use of Bayesian networks, the missing lexicon for a specific subset of students. In doing so, the system will generate a set of probabilistic indexes for every word in the lexicon that the student does not know. The set of collected (unknown or forgotten) words, will help teachers to identify the concepts that require additional work.

Lexmath is a platform for studying the lexicon through a set of diverse tools. The platform collects data from students using a set of lexical availability tests. The set of experiments aimed to determine the words that a specific group of students could remember when a particular interest center (the topic under consideration) is mentioned, in a reduced space of time.

This paper is structured as follows; Sect. 1 consists of the current introduction, Sect. 2 presents the theoretical frame. Section 3 describes the problem to be solved. Experiments and results are detailed in Sect. 4 and, finally, Sect. 5 highlight the conclusions derived of the work.

2 Theoretical Frame

This work is supported by two significant issues taken into account, the lexical availability, as a measure for detecting weaknesses in the group of students under study, and Bayesian network [2], as a tool for obtaining an answer to the problem.

2.1 Available Lexicon and Lexical Availability

The lexicon is usually defined as the way of expression in a language bounded to a specific social group. However, a more specific meaning is used in this work: lexicon is the vocabulary associated with a field of knowledge (interest center). It is accepted that the high is the lexicon, in a particular domain, the higher is the understanding on that domain.

A fundamental concept is the available lexicon, which refers to the set of words that speakers have learned during their interaction with other speakers, and may use under the context of a specific topic with which they are dealing. The difference with the basic lexicon is that the later is composed more usual words in a language, and does not depend on the particular topic.

The lexical availability study allows identifying the lexicon used by a group of people about a specific topic, helping to understand the way in which concepts and issues are related.

Studies on lexical availability include the work of Echeverría, [1], which presents a quantitative and a qualitative approach to lexical availability in mathematics for a specific group of students.

2.2 Bayesian Networks

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph [3], [4]. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In this work, the Bayesian model is used to compute the probability of the presence of a particular word, given different variables: words previously said, type of school of a student and sex of an individual. This particular word exists in the lexicon of a specific population.

For testing the model, a test was applied to 1500 students in three different types of school in the city of Concepción, Chile. During the tests, students are asked to write, in two minutes, all the words they could remember associated with a particular interest center. Then data was computationally processed for discarding non-relevant terms (nodes in a graph).

3 The Problem

Lexmath contains different lexical availability tests which correspond to individual tests of students for a specific interest center. By using Lexmath, it is possible to infer the probabilistic order in which different words should appear in the student lexicon. In doing so, it is necessary to take into account the frequency in which words appear and their semantic relationship with the other words.

For modeling the set of lexical availability tests, it is necessary to manage some indexes for correctly relate the group of students under study and the different Bayesian networks to be generated. The indexes are NDW (number of different words), and RF (relative frequency), which describes the frequency index of words. RF is computed dividing the number of times a word appears into the number of students, as indicated in Eq. 1. \(n_{i}\) is the number of times the same word appears in a sample, and N is the size of the sample (number of students considered).

$$\begin{aligned} RF = \frac{n_{i}}{N} \end{aligned}$$
(1)

Other indexes are NUW, number of unknown words, i.e., the number of words the student did not remember during the test, in other words, the number of different words in the group of students minus the number of different words for the student.

Another important index is TCP, which denotes the number of tables of conditional probability. We use the name father to indicate every word that precedes another word, taking into account the complete set of tests in a specific class. A word can have many fathers or eventually no one. These tables are required to generate the a priori probabilities, which depend on their parents. This index allows knowing in advance the number of conditional probability tables to assign to every word. This index is computed as shown in Eq. 2.

$$\begin{aligned} TCP (node_{i}) = 2^{n+1} \end{aligned}$$
(2)

3.1 Global Graph Generation

The collected data is represented in a graph which takes into account every word a student in a class has remembered for a specific interest center. Graphs are a set of nodes which are connected through edges representing relationships among nodes. Every node represents a word, and every edge represents a sequence of two words, the first one of them mentioned in position i and the other one mentioned in position \(i+1\).

The global graph denotes the graph generated by considering every word mentioned in every lexical availability test in a specific class and for a particular interest center. The global graph is the basis for building the Bayesian networks for every word a student does not know. Figure 1 shows a global graph generated taking into account a set of four tests.

Fig. 1.
figure 1

Example of global graph

3.2 Bayesian Network Generation

The Bayesian network for every word a student does not know is based on the global graph. Each one of these words is a node in the graph, and their parents are iteratively considered giving birth to a subgraph in which cycles are removed (according to their relevance and order). Figure 2 shows an example of Bayesian network for the word square.

Fig. 2.
figure 2

Bayesian network for the word Square

Every node holds two values, present, which indicates the probability that the corresponding word is mentioned; and not present, which indicates the probability that the corresponding word is not mentioned. Based on the values (present and not present) and the parents for every node, probability tables are fulfilled as a previous step for the a priori Bayesian network.

To compute the a priori probability, for nodes which do not have parents, it is necessary to know the relative frequency for the node in the set which contains all the words mentioned in the class. However, to compute the conditional probability for nodes which have a parent, it is necessary to know, additionally, how many times appears the parent. For example, if A is the father of B, the probability that B occurs is the number of (AB) relationships, divided into the number of times that B in mentioned in the whole sample (Table 1).

Table 1. Conditional table for node B with parent A

For building the a priori bayesian network, it is necessary to add evidence. The evidence is the set of words the student mention; hence every test is a set of evidence, and every student will probably have different pieces of evidence. Algorithm 1 shows the general structure of the procedure.

figure a

GlobalGraph. For every class, a Global Graph is created, taking into account all the words mentioned in the class.

GlobalGraph. For every student, it is created a list containing all the words the student did not mention. It corresponds to the difference between the words in the global graph and the words mentioned by the student.

SubgraphCreation. It is built a subgraph, based on the global graph, considering each word the student did not mention.

Transformation. Every subgraph is converted to a network removing edges which belong to cycles, taking into account the frequency and order for each node.

AddProbabilities. A priori probabilities are assigned to every network, obtaining an a priori Bayesian network.

AddEvidence. The evidence added, to every a priori Bayesian network. Evidence corresponds to a word the student mentioned during the test.

NodeAnalysis. The specific node is analyzed. If the network has been created starting with node B, the B node is analyzed, and the probability that B occurs is stored. In the end, nodes are ordered in decreasing probability of appearance, and the result is informed as a group taking into account the ID of the student.

4 Experiments and Results

Different experiments were conducted for testing the prediction capability of the system. It means, the ability to predict which terms (concepts) are not familiar to the students. A set of 60 tests was realized with coverage of fifteen different schools, considering last four year of a high school level. The selected center of interest is Geometry and there is not a gender classification.

The software uses a three-layered structure because it simplifies the understanding and organization of complex systems. The term layer refers to the strategy for segmenting a solution from a logical point of view. Every one of the three layers into which this architecture pattern is divided holds a set of clearly defined interfaces (see Fig. 3).

Fig. 3.
figure 3

Three layers architecture

On the execution time, it depends on the number of words and students involved, as expected. It is normal if we consider that every word mentioned implies the creation and processing of a Bayesian network. When execution time increases, it is possible to pre-process the set of data by removing those words which appear in less than 1\(\%\) of cases.

The experiments are grouped in the four following families:

  • Class with a high number of students. This test family considered classes from 35 to 45 students. Additionally, this group is divided into two subgroups, one of them containing students which answered from 1 to six words, and the other one, with students who answered seven or more words.

  • Tests with a high number of words. This test family remarks the fact that students with a high number of answers generally corresponded with reduced and well-known concepts; i.e., words predicted by the software correspond to the same concepts to which students answers belong.

  • Tests with a reduced number of words. In this case, results have a low probability of occurrence, and therefore, coherence is low (weak correspondence between words and a specific concept).

Table 2 summarizes results obtained.

Table 2. Relationships among number of students and number of words

5 Conclusions

The software developed works as planned, allowing the teacher to obtain the set of words a student should know in a specific center of interest. By using this software, it is possible to detect which are the concepts the student knows and which is the topic which requires reinforcing.

Although the software was thought of as a product for helping students to improve their performance, it can be used in other fields. As an example, it could be used to hire personnel in a particular company to obtain information on how high is the degree of knowledge of an applicant about the central topics of the position to be filled.