1 Introduction

Data mining is a process of computational work of finding patterns in huge data sets. It is often confused with Knowledge Discovery in Databases (KDD) but it is actually a crucial part of KDD. The ability to mine the data so as to extract the helpful knowledge is presently one of the most crucial and significant hurdles in scientific communities and governments. We have learned much from processing of data which represents a set of separate, independent entities, and their relevant attributes, but we can still discover interesting knowledge from the relationships between such entities. There are many forms of such relational knowledge which range from recurring patterns of transactions to that of complicated structural patterns of interlinked transactions. The extraction of such knowledge requires that the data is represented in such a way that captures the relational knowledge and also supports effective and efficient data mining and thus helps the comprehensibility of the resulting knowledge [1]. Data mining encompasses many techniques such as classification, clustering [2, 3], etc.

Data in its graphical representation is fairly common in many practical fields. For example, chemical compounds, Internet information flow patterns, social networks, and citation networks. Graph data mining has already been applied to various domains such as link or citation analysis [4], chemical compound analysis [5], and Web searching [6]. This ubiquitousness of graphs provides us with the opportunity to extract novel information from them. One of the main reasons graphs are popular is because they are easy to visualize and comprehend. The comprehensiveness of graphs allows them to convey large, messy data in a simplistic format. The importance of the application of graph mining in fields of medical research and business analysis is immense. Graph mining is quickly broadening the scope of its application in many other fields such as social networks, big data analysis, and even cloud computing [7, 8]. This graph representation can also be used for transportation networks and in context of this paper; airport networks. Airport networks consist of airports as the vertices and the routes of airplanes as the edges. Mining airport data is useful in analysis of routes and city connections which can be used to improve quality of services provided by airlines on those routes as well as increase safety precautions. This also provides airport authorities information on routes that are the most important and help in flight delay management.

2 Theoretical Work

2.1 Types of Data Mining

2.1.1 Linked Data Mining

Most traditional data mining tasks find patterns in datasets which generally contain a group of instances of a single relation. Mining richly structured, heterogeneous data sets are a key challenge for data mining. The key similarities are that the data in the domain is made up of a large variety of objects and object types that can be connected in some way. The link between the objects, for example, may be a URL, or an operation between the tables of the database. A URL is a form of explicit link while an operation represents a constructed link.

Traditional inference procedures cannot be applied on these data sets by assuming they are independent. This can lead to false conclusions and incorrect results. The correlations due to links must be handled carefully to avoid such a situation. In fact, the link is information which can be utilized to progress the prediction accuracy of the learned models as, usually, there are correlations between the attributes of objects that are linked and such links commonly exist between objects that have some common factors [9].

2.1.2 Web Data Mining

Web mining is, simply, the utilization of data mining techniques on the World Wide Web (WWW) to discover pattern. It is comprised of Web content mining, Web structure mining, and Web usage mining. Web usage mining discovers usage patterns for Web-based applications. It collects the required information from the users such as identity, origin, browsing behavior, and any other relevant and useful information. It is mostly used in e-commerce Websites to suggest products that a user has searched for in the past and may buy it the future. This allows companies to target customers and increase profits [10]. Web content mining mines the content on Web pages to extract useful data and information. Web content mining can be further divided into Information Retrieval View and Database View. Web structure mining uses graph theory to analyze the links and connections of a Website. The structure of a Website is the relevant data and patterns in these links and connections are mined.

Web mining can be divided into the resource finding, information selection and preprocessing, generalization, analysis.

2.1.3 Graph Data Mining

The extraction of helpful and new information from the data graph representation is called graph data mining. Graphs are sets of nodes and edges which can be directed or undirected. While data can have many forms which may be simple or complex to varying degrees, graph data is used to represent the relationships crucial to the domain. The patterns that are often discovered from mining graph data are also in the form of graphs. Graph data mining is used to mine structured data and find the frequently appearing substructures present in such data. The most common use of graph data mining is its use in cheminformatics, bioinformatics, and social networking but it has also been used for citation analysis and in fields of privacy preservation [11] and cloud computing [7].

2.2 Graph Mining Approaches

2.2.1 Inductive Logic Programming (ILP)

Inductive Logic Programming is used to create predicate descriptions or hypotheses from background knowledge and examples. It is a subfield of machine learning and uses logic programming to obtain results. There have been some applications of ILP in data mining but it has been mainly used in mining data and databases related to chemical compounds. It has been used to find frequent substructures in such databases. For example, the ILP is used in the data mining algorithm WARMR. WARMR was built to mine structural chemical data and used to discover the frequently appearing substructures in a database of chemical compounds. The frequently appearing substructures were used to create prediction rules which related compound descriptions with carcinogenesis. These were fairly accurate rules and provided insight into the relationships present in the database. WARMR is a great data mining tool for analyzing chemical databases as it can accurately provide probabilistic prediction rules and provide knowledge about the relationships present in the database [12].

2.2.2 Incomplete Beam Search

The beam search algorithm expands the best or the most promising node of the graph first. It is a type of best-first search which is a search which orders partial solution according to some conditions and attempts to predict how close a partial solution is to a complete solution. Subdue is an IBS greedy relational learning system which discovers substructures which are both frequent and compress the dataset. It starts with a single vertex in the graph and then expands the best substructure present in the graph with a new edge. It then limits the number of most optimal substructures and then evaluates them on the bases of their ability to compress the input graph by using the minimum description length (MDL). It terminates after unique substructures are no longer discoverable. It is called an incomplete beam search because it limits the amount of best or most promising substructures [13].

2.2.3 Graph Theory Based Approaches

  • Apriori Algorithm

The Apriori algorithm operates on transactional databases to discover and identify frequently appearing items in item sets. It repeats this bigger item sets given that the items appear sufficiently often [14]. This algorithm is able to find association rules which show the general trends present in the database [15]. In context of graph data mining, item sets can be considered graphs and items can be considered the nodes of the graph. There are two important factors which are can be controlled in the Apriori algorithm; the support threshold and confidence. The support is the number of occurrences of each individual item and the support threshold is the minimum number of occurrences of an item that should be in the set. The confidence is the how often the left side of the transaction implies the right side. To apply the Apriori algorithm to graphs, we first discover all the frequently appearing subgraphs of size K. Then we discover all candidates of size k + 1 edges by combining the candidates of size k edges. They must share a common subgraph of k – 2 edges.

  • Pattern Growth

This algorithm is also known as the FP-growth algorithm. The FP stands for frequent pattern. It uses the depth-first approach and grows a frequent subgraph recursively and finds frequent item sets. This algorithm uses an extended prefix-tree to store crucial information in compressed form. This tree is called the frequent pattern tree. It is both efficient and scalable and has been proven to be more effective than other algorithms at mining frequent patterns. The algorithm works by compressing the database into an FP-tree and then dividing the FP-tree into sets of conditional databases, one for each frequently appearing pattern. Then these divided databases and mined separately which reduces the cost of searching for smaller patterns repeatedly. They are also concatenated to form longer frequently appearing patterns [5].

2.3 Previous Research

A large amount of research work has been done of graph mining. Some of the research work done in the graph mining is shown in Table 1.

Table 1 Table of certain research work done in the field

The bar charts shown above are extracted from Table 1 and show which fields of graph data mining that have been researched the most. Figure 1 shows that frequent pattern growth is considered the most efficient and effective method for mining graph data. It is followed closely by Apriori-based approaches which build on the original Apriori algorithm. Figure 2 shows popular applications of graph mining. Cheminformatics and Bioinformatics are two closely related fields which use graph mining on chemical compound databases to find frequent patterns. Social network analysis is a resurging field in which graph mining is used to find patterns in users of social networking Websites and the relationships between these users. These are among many others the most popular applications of graph mining techniques.

Fig. 1
figure 1

Graph showing the most popular application of graph data mining

Fig. 2
figure 2

Graph showing the most popular graph mining techniques

3 Methodology

The initial objective in a data analysis/mining project is the search for relevant data and collection of that data. The data was collected from the U.S. Bureau of Transportation Statistics Website. One of the dataset is a collection of on-time flights during January 2016. The data consists of a few attributes which have missing values. These attributes are removed from the dataset and missing values are removed. Then the data was processed using Rattle to find frequent item sets in the dataset. Rattle uses a modified form of the Apriori algorithm to find frequently appearing item sets and generate association rules for the dataset. These results are plotted onto a graph where the x-axis represents the items and the y-axis represents the relative frequency of items. Due to the direct correlation between a city and state (For example, Los Angeles and California), the most frequent item sets are states. However, relevant statistics can be mined from the dataset by comparing the correct attributes (For example, ignoring states when processing cities and vice versa). Then we constructed a weighted graph of the T-100 Market All Carriers dataset. It was weighted on the basis of number of passengers on each route. Then network analysis techniques were used on this weighted network to find the betweenness, degree and closeness of airports. The R packages tnet and igraph were used to perform the network analysis.

4 Experimental Setup

4.1 Data Set

The dataset used is taken from the Website of United States Bureau of Transportation Statistics which is a part of the United States Department of Transportation and Research and Innovative Technology Administration. The dataset consists of around 440,000 instances of on-time flights during the month of January in 2016. There are about a 100 provided attributes which range from the date, origin city and state, destination city and state, market IDs, airport names, airport IDs to delay time, delay causes to diverted airport information attributes. The data set consists of 290 unique cities and 294 unique airports. We also use the BTS Master Coordinate database from January 2016 to May 2016 and the BTS Air Carrier Statistics T-100 Market All Carriers dataset for network analysis [16] (Fig. 3).

Fig. 3
figure 3

RITS/BTS Jan 2016 on-time flights dataset

4.2 Attribute Selection

We only consider the attributes which represent the date, number of flights, distance, airport ID, origination city and state of flight and destination city and state of flight. We also used PASSENGERS attribute to create weighted network.

No. of instances/rows—445,829

No. of attributes—17

4.3 Tools Used

R is programming language used for data analysis and statistical computing. RStudio is an integrated development environment (IDE) for the R programming language which allows the user to create and load R projects easily. The R programming language contains many packages which aid data mining and while they can be implemented directly, there is an option to use user-developed GUIs for ease of use. Rattle is an open-source graphical user interface (GUI) made in R for the purposes of data mining. The Rattle GUI allows the user to easily load a dataset and perform data analysis and mining as well as create models, evaluate, associate, cluster, and transform the data in many ways. We use Rattle to find frequent item sets and use R packages igraph and tnet to perform network analysis.

4.4 Measures

The dataset used contains all the on-time flights in January 2016. This, however, means that the support for each item set is quite low as the frequencies of the flights have less than 0.07 support. Thus the measure for support used is 0.0300 and the confidence used is 0.4000.

5 Result and Analysis

Case Study 1Graph Mining using Apriori algorithm: Processing all the attributes concurrently, we get the following results from the frequent item plot. We can see that most frequently found item set is California as the origin and destination state. It followed closely by Texas and then Florida. To make the frequently appearing item clearer, we selected corresponding attributes in absence of all others. It is significant to note down that the graphs like the U.S. airport network are highly symmetric in nature.

Ignoring all the attributes except ORIGIN_STATE_NAME and DEST_STATE_NAME, we get the following graph. This graph clarifies that the most commonly found state during the month of January in 2016 was California. It also gives us insight onto the most commonly found states during that time period. This can be attributed to a large number of interstate flights (Figs. 4, 5 and 6).

Fig. 4
figure 4

Graph of the most frequently found states

Fig. 5
figure 5

Graph of most frequently found cities

Fig. 6
figure 6

Graph of most frequently found airports

To calculate the most frequently appearing cities, we ignore all other attributes except ORIGIN_CITY_NAME and DEST_CITY_NAME. From the graph, we see that the most commonly found city is Atlanta. It is followed by Chicago and Denver. We see that even though California is the most commonly found state, neither of the three most commonly found cities are in California. Atlanta lies at the heart of this network and thus is often found on cross-country routes such as those which go from the western coast to the eastern coast and vice versa.

Similarly, we can calculate the most frequently appearing airport by only selecting the ORIGIN and DEST attributes. From this graph, we can see that the most commonly found airport is Hartsfield–Jackson Atlanta International Airport in Atlanta, Georgia. It is then followed by O’Hare International Airport in Chicago, Illinois and the Dallas/Fort Worth International Airport and then the Los Angeles International Airport (Figs. 7, 8 and 9).

Fig. 7
figure 7

U.S. airports with the highest closeness

Fig. 8
figure 8

U.S. airports with the highest degree

Fig. 9
figure 9

Binary and weighted analysis of betweenness in U.S. airport network

Case Study 2––Networking analysis of airport network: Using link analysis techniques, we also perform network analysis of the airport network. Using R tnet package, we can convert the dataset into an R network and calculate several important features.

We calculate the airports in the network with the highest closeness, betweenness and degree. These factors allow us to find the airports with the most contacts which are the node with the highest degree. We can see that the Hartsfield–Jackson Atlanta International Airport is the airport with the most contacts. This can be attributed to its central position in the U.S. airport network. The closeness score allows us to see which airports are the most easily accessible from other airports. We can see that the LAX Airport has the highest closeness score and thus is, on average, closer to most airports. The betweenness score shows us which airport bridges the shortest path between the two airports. Binary analysis shows us that Ted Stevens Anchorage International Airport (ANC) is the airport with the highest betweenness but this does not take into account the weight along each route. Weighted analysis shows that the LAX airport has the highest betweenness and shows that for most routes LAX acts as an intermediary airport. It is followed closely by ATL and SEA airports.

6 Conclusions

The results from mining the dataset of on-time flights during January, 2016 allow us to find out which airports were the busiest during that time and which cities and states were visited most frequently. We can see that the Hartsfield–Jackson Atlanta International Airport was the busiest airport. The most frequently traveled to city is also Atlanta, Georgia and the most frequently traveled state was California. The ATL airport also has the highest degree which means it is the most connected airport and thus it is important to ensure that it functions properly. This will also help in identifying which routes should avoid disruption the most. Routes with LAX as one of the nodes are examples of one such route. Flights are seasonal in nature and therefore change month to month, therefore it is important to determine which airports, cities, and states are frequently visited and when, to improve quality of service and safety on flights that take place on those routes. It is also crucial to figure out that the loss of which airport causes the most disconnections in the airport network in case of emergencies or attack.