Data Mining for Educational Management
Based on computer information systems, data mining (DM) is a technique designed to scan huge data repositories, generate information, and discover knowledge (Vlahos et al. 2004). By applying different tools, DM seeks hidden relationships in raw data in order to discover data patterns. Therefore, DM can play an important role in unveiling a broad set of findings and, consequently, offers valuable support in decision-making. The incorporation of DM into the educational arena has given rise to a new research field called educational data mining (EDM) (Anjewierden et al. 2011). In this case, the aim is to design models, tasks, methods, and algorithms for exploring data from educational settings (Peña-Ayala 2014). Altogether, they can help to improve management activities in educational institutions, thus empowering the performance of educational managers.
Many evidences support the statement that knowledge is among the main assets of organizations (Nonaka et al. 2008). Part of the organizational knowledge resides in the minds of employees, as intrinsic knowledge, and another part is stored as data in companies’ repositories. In both cases, they consist of hidden knowledge, and, as with any other resource, the organizations cannot afford to misuse it. This way, as the knowledge management approach has proven to be an effective means of gathering the intrinsic knowledge of the organization’s personnel and convert it into explicit knowledge, it is also important to explore the hidden knowledge in an organization’s data and transform it into explicit knowledge. All these efforts can greatly contribute toward improving the decision-making process. DM arises as a powerful approach to aid in the accomplishment of this goal. Consequently, it is worth managers being knowledgeable about this technique.
The starting point of a DM approach is the availability of data repositories. The development and ubiquitous use of information technology have led to all kinds of organizations, almost inadvertently, disposing of datasets resulting from their core activities. Information about customers, personnel, transactions, etc. is stored in electronic records. In some cases, repositories are comprised of heterogeneous datasets; sometimes their size is so huge that it makes it difficult to extract useful and comprehensive knowledge from them. Thus, the raw material – the data – can be assimilated into a “mine” with potential treasures within. The challenge is to develop and apply the appropriate tools needed “to extract” the hidden wealth as useful and comprehensible knowledge. In turn, this knowledge constitutes the basis for improved decision-making. The main difference with statistical analysis lies in the fact that DM involves methods that search for new and generalizable relationships and findings, rather than attempting to test prior hypotheses. This difference in the approach followed is the reason why data mining is also referred to as “knowledge discovery” in databases (Collins et al. 2004).
DM technique originated within this context, with the aim of discovering hidden and nontrivial relationships in information, of various types, extracted from large amounts of data (Campagni et al. 2015). Many areas have benefited from adapting the DM technique to solve their problems – among them, finance, healthcare systems, marketing, stock markets, telecommunication, manufacturing, and customer relations. In fact, it can be recognized as a contemporary tool for building knowledge management systems (Jashapara 2011). DM is a broad concept grounded on a set of disciplines, such as statistics, artificial intelligence, and computer science. Different subsets contribute to the development of DM – among them, probability, machine learning, natural language, neural networks, database management systems, etc.
The literature offers a broad set of experiences, demonstrating that statistical methods are well-established tools for analyzing data and extracting useful information. However, DM emerges as a fresh approach to understanding hidden patterns and data prediction. A reason explaining why DM has turned out to be very popular among researchers lies in the many standalone or desktop data mining tools available on the market. The following can be cited as examples: Microsoft Excel, SPSS, Weka, Protégé as Knowledge Acquisition System, and Rapid Miner. Some of them (e.g., MS Excel Mining tool) are normally available to instructors and educational managers, and they can benefit greatly from the existing knowledge of Excel.
The development of the DM technique usually follows one of two main approaches. The descriptive approach focuses on producing patterns that explain or generalize the intrinsic structure, relations, and interconnections of available data (Peng et al. 2008). The predictive approach centers on estimating unknown or future values of dependent variables based on the values of related independent variables (Hand et al. 2001). To follow either of these approaches, a number of methods and techniques can be adopted. A sample of them includes Markov models, Bayes theorem, decision trees, linear regression, frequencies, and hierarchical clustering.
Educational Data Mining
The International Educational Data Mining Society (2018) defines educational data mining (EDM) as an “emerging discipline, concerned with developing methods for exploring the unique and increasingly large-scale data that comes from educational settings and using those methods to better understand students and the settings which they learn in.” Thus, EDM is a relatively recent research area that explores and analyzes the information stored in student and institutional databases in order to understand and improve the performance both of the student learning process and of educational institutions. Data is analyzed by using statistical and algorithms with the aim of resolving problems of an educational nature and improving the entire educational process. EDM is a growing research area that involves researchers from all over the world from both different and related areas (Campagni et al. 2015).
According to Peña-Ayala (2014), close to 98% of the published works about EDM have appeared since 2000. EDM has shifted from isolated papers published in conferences and journals to dedicated workshops, an annual international conference on educational data mining (http://www.educationaldatamining.org), a specialized journal on EDM (Journal of Educational Data Mining), as well as books and handbooks. Although focused on a specific scope, the educational setting, EDM shares with other disciplines its basic principles. Therefore, most of the processes that format and refine data, as well as the tools adopted to handle the datasets, are the same that can be found in other instances where the DM technique is applied.
Process and Tools in EDM
The data used during the EDM process should respond to the objective of the research; thus, the correct adoption of the data source is of great importance, especially in environments with different databases which can misdirect the focus of the researcher. In educational settings, data is stored in repositories that might appear in forms and formats that do not enable them to be analyzed directly. Often, educational researchers and practitioners work with data recorded in forms that are not immediately amenable to analysis, as could be the case with data retrieved from log files or learning management systems (LMS). Some characteristics of these types of educational data mean that it may be defined as messy; sometimes incomplete; fragmented in several parts that must be merged; and occasionally unfamiliar, inconvenient, or in highly unusual formats. Before performing an analysis, the data should offer a meaningful format. In addition, data needs to be cleaned and cases and values that are not simply outliers, but clearly incorrect, removed (i.e., cases where birth dates have impossible values). An example of tools well suited for the manipulation, cleaning, and formatting of data are Microsoft Excel, Google Sheets, and the EDM Workbench.
The next step after data adaptation to a workable format is the analysis and modelling of the dataset and the validation of the resulting models. There is a full range of DM algorithms. According to the objectives pursued, a certain algorithm is to be adopted. The knowledge of the data miner is crucial for choosing the correct approach. Furthermore, a high degree of knowledge of application domains is required to interpret results and evaluate whether further exploration is needed. Additionally, data expertise is required to explain strange patterns that may be due to data pollution or other causes, such as data conversions. Therefore, the analysis step may be subject to a number of uncertainties, and, consequently, the data miner is advised to follow a holistic approach. A set of tools that are appropriate for this task include RapidMiner, Waikato Environment for Knowledge Analysis (WEKA), KEEL, KoNstanz Information MinEr (KNIME), Orange, and SPSS.
A final step, once the analysis has been conducted and the results validated, is that of disseminating the output in legible and informative visualizations. Occasionally, good visualization schemes are crucial for deriving meaning from data. Therefore, the adoption of appropriate tools and methods for visual analytics can effectively support academics and practitioners in gaining knowledge and insight from data, as well as communicating its implications. Although many of the aforementioned tools are accrued with graphical data displays, researchers and practitioners can make use of a set of tools designed to create polished and informative graphs, charts, models, networks, diagrams, and other forms of visualized information. Examples of this kind of tool are Tableau, D3.js, and InfoVis. A final consideration for researchers and practitioners of EDM, as pointed out by Slater et al. (2017), is that no one tool is ideally suited to conducting the entire process of analyzing datasets from start to finish. As different tools are uniquely suited to different tasks, they must be carefully chosen in order to take advantage of their potential.
Limitations of EDM
Most data mining techniques work best with very large samples. In this regard, Andonie (2010) pointed out that some data mining tools, such as neural networks, may not be able to accomplish the goal of understanding hidden patterns, since small datasets cannot provide enough data to fill the gaps. Several authors concluded that small datasets limit the scope of the DM technique (Yuan and Fine 1998). However, in the daily life of academic institutions, there are many situations where the most common is the availability of small datasets. The data which is collected on courses that students take is a good example of such a situation; even if a relatively large group of students attends the course, the relevant data is usually considered a small dataset. This constraint raises an important barrier to adoption of the DM technique in the educational arena. Therefore, a likely scenario could be a wide availability of contemporary DM tools but the impossibility of using them reliably because the available data is limited and clearly falls into the category of small data sets. However, on the contrary, some researchers clearly support DM generally not being limited to large datasets. Specific use of DM tools for structured small datasets can also offer reliable results (Nooraei et al. 2011; Natek and Zwilling 2014). Another limiting factor concerns privacy issues; in data mining projects where personal data is used, it is important for the educational management team to be aware of the legislation, all the more because data may belong to underage students.
Classification of EDM Functionalities
Analysis of student modelling. This group includes the highest number of works in EDM. Within this group, all kinds of student traits, actions, and achievements are considered as part of this functionality. The issues addressed by works in this category include instruction and learning styles, resource usage, analysis and prediction of academic achievements, student success factors, students’ mental states, domain knowledge, learning trajectories, knowledge tracing, and skills.
Analysis of student behavior modelling. This group includes research focused on the analysis, description, and evaluation of student behavior. Among the issues that are subject to study and characterization, the analysis of students’ contributions, persistence in online activity, careless attitudes, user-system interaction, self-adaptation, collaborative activities, solving styles, ability, outcomes, understanding, behavior, task completion, and final marks stand out.
Analysis of student performance modelling. This functionality covers research and practical works oriented toward dealing with failure, success, students’ response times, time needed to solve a problem, preparation for future learning, knowledge mastered, learning progression, response patterns, and learning achievements. The goal is to aid educational staff to supervise and assess, in a timely manner, students, with the aim of anticipating adjustments. Educational management teams could find it of interest to promote this facet of DM among teaching staff to improve student performance.
Analysis of assessment. The group of EDM applications, centered on assessment functionality, has as their main goal the evaluation and control of efficacy, efficiency, and quality of the evaluation systems, as well as the inclusion of instances aimed at assessing the degree of user satisfaction of all kinds. The issues covered in this group include, among other things, the inquiry process, learned skills, discovered relationships among responses, difficulty levels of problems, student accuracy, learning activities, misunderstandings, and the merits and pitfalls of standardized tests.
Analysis of student support and feedback. This functionality aims to develop enhanced computer educational systems by means of their personalization and customization to meet students’ demands. Some of the instances considered under this aggrupation include dialogue analysis, generation of hints, decision-making, customized feedback, reinforcement, recommendations, opinion about teaching behaviors, advice content, student annotations, dealing with emotions, and stimulation of competences.
Analysis of curriculum, domain knowledge, sequencing, and teacher support. This functionality represents heterogeneous tasks and components. This group includes those instances that do not fit into previous classification, although they cover important facets of the educational process. Among the topics addressed include content authoring, knowledge description, teachers’ collaboration in tailoring curricula, personalized searching of educational content, user-tool interaction, curriculum analysis, scheduling of learning activities, design of hierarchical content structures, and teacher mentoring behaviors. Peña-Ayala (2014) expects that the evolution of this sort of functionality will see an increasing demand in academic interest.
Implications for Educational Management
An educational institution maintains and stores various types of student data; it can range from student academic data to their personal records, including parents’ incomes, qualifications, etc. This repository can potentially allow managers and teachers to extract useful information in order to make management-level decisions. Although the scope and implications of managerial decisions is not the same, depending on the decision-maker (e.g., educational manager or teacher), both actors can greatly influence the quality of the teaching-learning process. Decisions at the educational management level can lead to actions such as the construction of new facilities to respond to social and educational demands, hiring new staff with specific expertise or skills, developing programs for the deeper involvement of parents, etc. On the other hand, decisions at the teacher/instructor level are more limited in scope but can equally contribute to the improvement in quality of the teaching-learning process in the classroom (e.g., rapid response to abandonment attitudes, student success predictions, etc.). Therefore, EDM should not be conceived of as a technique aimed at a specific actor within the educational arena; it is of interest to all those who can, at their own level of influence, contribute to a better performance of the educational process.
The functionalities of EDM described in the previous section obey actions that can be promoted either at the teacher/instructor level or at the managerial level. The management of schools and colleges and, of course, the management of government bodies have a wider capacity to influence and convince researchers and practitioners to learn about and adopt the EDM technique. The literature shows that the adoption of EDM has contributed to improving the educational process by, for example, (i) predicting students’ performances using a dataset consisting of students’ gender, parental education, their financial background, etc.; (ii) predicting student learning outcomes based on attributes such as attendance and performance in class tests and assignments; (iii) student modelling using the educational history of students; and (iv) predicting the academic dismissal of students and the GPA of graduated students in e-learning, using regression analysis and classification (Dutt et al. 2017). These evidences show that EDM can, in a broad sense, arise as an important tool for improving the work of educational managers.
EDM has emerged as a paradigm oriented toward designing models, tasks, methods, and algorithms for exploring data from educational settings. The potential opportunities provided by EDM, which claims new paradigms for enhancing the scope, quality, efficiency, and achievements of educational systems, are promising in the field of education due to it being a high priority for global society. In recent years, a wide array of tools has emerged for the purpose of conducting educational data mining (EDM). These tools have proven their utility with respect to common data pre-processing and analysis steps in a typical EDM project. Educational management should consider EDM as an opportunity for improving the quality of decision-making, both at their level and at the teacher/instructor level, thanks to its potential for discovery of hidden patterns of information; thus, their involvement is crucial in order to take advantage of this paradigm. However, this technique shows some limitations that must be taken in consideration so as not to encounter the potential pitfalls that, in the long-term, may end up causing mistrust and doubt regarding the usefulness of EDM.
- Anjewierden A, Gijlers H, Saab N, De-Hoog R (2011) Brick: mining pedagogically interesting sequential patterns. In: Proceedings of the 4th international conference on educational data mining, pp 341–342Google Scholar
- Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge, MAGoogle Scholar
- International Educational Data Mining Society (2018) Information retrieved on May 2018 [online]. http://www.educationaldatamining.org/
- Jashapara A (2011) Knowledge management, an integrated approach, 2nd edn. FT Prentice Hall, HarlowGoogle Scholar
- Nonaka I, Toyama R, Hirata T (2008) Managing flow, a process theory of knowledge-based firm. Palgrave Macmillan, New YorkGoogle Scholar
- Nooraei B, Pardos ZA, Heffernan NT, Baker RSJ (2011) Less is more: improving the speed and prediction power of knowledge tracing by using less data. In: Proceedings of the 4th international conference on educational data mining, pp 101–109Google Scholar