An Efficient Way to Find Frequent Patterns Using Graph Mining and Network Analysis Techniques on United States Airports Network

Joshi, Anant; Bansal, Abhay; Sai Sabitha, A.; Choudhury, Tanupriya

doi:10.1007/978-981-10-5547-8_32

An Efficient Way to Find Frequent Patterns Using Graph Mining and Network Analysis Techniques on United States Airports Network

Anant Joshi⁶,
Abhay Bansal⁶,
A. Sai Sabitha⁶ &
…
Tanupriya Choudhury⁶

Conference paper
First Online: 29 October 2017

1672 Accesses
3 Citations

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 78))

Abstract

We are currently in the Information Age where massive amounts of data is being collected and analyzed to find interesting and frequent patterns. The need for mining data has been steadily increasing over the past few years. Graphs are one of the best studied data structures in the fields of mathematics and computer science. And due to this, in the recent years graph-based data mining has become quite popular. Graph data mining uses the graph nodes and the links between them to represent the entities, their relationships with other entities and their attributes and discovers interesting patterns in the graphs. Transportation networks are networks of routes from one location to another through various modes of travel. In this article, we use a transportation network of airports in United States of America and apply graph data mining techniques and network analysis techniques on US airports and flights datasets.

Download conference paper PDF

1 Introduction

Data mining is a process of computational work of finding patterns in huge data sets. It is often confused with Knowledge Discovery in Databases (KDD) but it is actually a crucial part of KDD. The ability to mine the data so as to extract the helpful knowledge is presently one of the most crucial and significant hurdles in scientific communities and governments. We have learned much from processing of data which represents a set of separate, independent entities, and their relevant attributes, but we can still discover interesting knowledge from the relationships between such entities. There are many forms of such relational knowledge which range from recurring patterns of transactions to that of complicated structural patterns of interlinked transactions. The extraction of such knowledge requires that the data is represented in such a way that captures the relational knowledge and also supports effective and efficient data mining and thus helps the comprehensibility of the resulting knowledge [1]. Data mining encompasses many techniques such as classification, clustering [2, 3], etc.

Data in its graphical representation is fairly common in many practical fields. For example, chemical compounds, Internet information flow patterns, social networks, and citation networks. Graph data mining has already been applied to various domains such as link or citation analysis [4], chemical compound analysis [5], and Web searching [6]. This ubiquitousness of graphs provides us with the opportunity to extract novel information from them. One of the main reasons graphs are popular is because they are easy to visualize and comprehend. The comprehensiveness of graphs allows them to convey large, messy data in a simplistic format. The importance of the application of graph mining in fields of medical research and business analysis is immense. Graph mining is quickly broadening the scope of its application in many other fields such as social networks, big data analysis, and even cloud computing [7, 8]. This graph representation can also be used for transportation networks and in context of this paper; airport networks. Airport networks consist of airports as the vertices and the routes of airplanes as the edges. Mining airport data is useful in analysis of routes and city connections which can be used to improve quality of services provided by airlines on those routes as well as increase safety precautions. This also provides airport authorities information on routes that are the most important and help in flight delay management.

2 Theoretical Work

2.1 Types of Data Mining

2.1.1 Linked Data Mining

Most traditional data mining tasks find patterns in datasets which generally contain a group of instances of a single relation. Mining richly structured, heterogeneous data sets are a key challenge for data mining. The key similarities are that the data in the domain is made up of a large variety of objects and object types that can be connected in some way. The link between the objects, for example, may be a URL, or an operation between the tables of the database. A URL is a form of explicit link while an operation represents a constructed link.

Traditional inference procedures cannot be applied on these data sets by assuming they are independent. This can lead to false conclusions and incorrect results. The correlations due to links must be handled carefully to avoid such a situation. In fact, the link is information which can be utilized to progress the prediction accuracy of the learned models as, usually, there are correlations between the attributes of objects that are linked and such links commonly exist between objects that have some common factors [9].

2.1.2 Web Data Mining

Web mining is, simply, the utilization of data mining techniques on the World Wide Web (WWW) to discover pattern. It is comprised of Web content mining, Web structure mining, and Web usage mining. Web usage mining discovers usage patterns for Web-based applications. It collects the required information from the users such as identity, origin, browsing behavior, and any other relevant and useful information. It is mostly used in e-commerce Websites to suggest products that a user has searched for in the past and may buy it the future. This allows companies to target customers and increase profits [10]. Web content mining mines the content on Web pages to extract useful data and information. Web content mining can be further divided into Information Retrieval View and Database View. Web structure mining uses graph theory to analyze the links and connections of a Website. The structure of a Website is the relevant data and patterns in these links and connections are mined.

Web mining can be divided into the resource finding, information selection and preprocessing, generalization, analysis.

2.1.3 Graph Data Mining

The extraction of helpful and new information from the data graph representation is called graph data mining. Graphs are sets of nodes and edges which can be directed or undirected. While data can have many forms which may be simple or complex to varying degrees, graph data is used to represent the relationships crucial to the domain. The patterns that are often discovered from mining graph data are also in the form of graphs. Graph data mining is used to mine structured data and find the frequently appearing substructures present in such data. The most common use of graph data mining is its use in cheminformatics, bioinformatics, and social networking but it has also been used for citation analysis and in fields of privacy preservation [11] and cloud computing [7].

2.2 Graph Mining Approaches

2.2.1 Inductive Logic Programming (ILP)

Inductive Logic Programming is used to create predicate descriptions or hypotheses from background knowledge and examples. It is a subfield of machine learning and uses logic programming to obtain results. There have been some applications of ILP in data mining but it has been mainly used in mining data and databases related to chemical compounds. It has been used to find frequent substructures in such databases. For example, the ILP is used in the data mining algorithm WARMR. WARMR was built to mine structural chemical data and used to discover the frequently appearing substructures in a database of chemical compounds. The frequently appearing substructures were used to create prediction rules which related compound descriptions with carcinogenesis. These were fairly accurate rules and provided insight into the relationships present in the database. WARMR is a great data mining tool for analyzing chemical databases as it can accurately provide probabilistic prediction rules and provide knowledge about the relationships present in the database [12].

2.2.2 Incomplete Beam Search

The beam search algorithm expands the best or the most promising node of the graph first. It is a type of best-first search which is a search which orders partial solution according to some conditions and attempts to predict how close a partial solution is to a complete solution. Subdue is an IBS greedy relational learning system which discovers substructures which are both frequent and compress the dataset. It starts with a single vertex in the graph and then expands the best substructure present in the graph with a new edge. It then limits the number of most optimal substructures and then evaluates them on the bases of their ability to compress the input graph by using the minimum description length (MDL). It terminates after unique substructures are no longer discoverable. It is called an incomplete beam search because it limits the amount of best or most promising substructures [13].

2.2.3 Graph Theory Based Approaches

Apriori Algorithm

The Apriori algorithm operates on transactional databases to discover and identify frequently appearing items in item sets. It repeats this bigger item sets given that the items appear sufficiently often [14]. This algorithm is able to find association rules which show the general trends present in the database [15]. In context of graph data mining, item sets can be considered graphs and items can be considered the nodes of the graph. There are two important factors which are can be controlled in the Apriori algorithm; the support threshold and confidence. The support is the number of occurrences of each individual item and the support threshold is the minimum number of occurrences of an item that should be in the set. The confidence is the how often the left side of the transaction implies the right side. To apply the Apriori algorithm to graphs, we first discover all the frequently appearing subgraphs of size K. Then we discover all candidates of size k + 1 edges by combining the candidates of size k edges. They must share a common subgraph of k – 2 edges.

Pattern Growth

This algorithm is also known as the FP-growth algorithm. The FP stands for frequent pattern. It uses the depth-first approach and grows a frequent subgraph recursively and finds frequent item sets. This algorithm uses an extended prefix-tree to store crucial information in compressed form. This tree is called the frequent pattern tree. It is both efficient and scalable and has been proven to be more effective than other algorithms at mining frequent patterns. The algorithm works by compressing the database into an FP-tree and then dividing the FP-tree into sets of conditional databases, one for each frequently appearing pattern. Then these divided databases and mined separately which reduces the cost of searching for smaller patterns repeatedly. They are also concatenated to form longer frequently appearing patterns [5].

2.3 Previous Research

A large amount of research work has been done of graph mining. Some of the research work done in the graph mining is shown in Table 1.

Table 1 Table of certain research work done in the field

Full size table

The bar charts shown above are extracted from Table 1 and show which fields of graph data mining that have been researched the most. Figure 1 shows that frequent pattern growth is considered the most efficient and effective method for mining graph data. It is followed closely by Apriori-based approaches which build on the original Apriori algorithm. Figure 2 shows popular applications of graph mining. Cheminformatics and Bioinformatics are two closely related fields which use graph mining on chemical compound databases to find frequent patterns. Social network analysis is a resurging field in which graph mining is used to find patterns in users of social networking Websites and the relationships between these users. These are among many others the most popular applications of graph mining techniques.

3 Methodology

The initial objective in a data analysis/mining project is the search for relevant data and collection of that data. The data was collected from the U.S. Bureau of Transportation Statistics Website. One of the dataset is a collection of on-time flights during January 2016. The data consists of a few attributes which have missing values. These attributes are removed from the dataset and missing values are removed. Then the data was processed using Rattle to find frequent item sets in the dataset. Rattle uses a modified form of the Apriori algorithm to find frequently appearing item sets and generate association rules for the dataset. These results are plotted onto a graph where the x-axis represents the items and the y-axis represents the relative frequency of items. Due to the direct correlation between a city and state (For example, Los Angeles and California), the most frequent item sets are states. However, relevant statistics can be mined from the dataset by comparing the correct attributes (For example, ignoring states when processing cities and vice versa). Then we constructed a weighted graph of the T-100 Market All Carriers dataset. It was weighted on the basis of number of passengers on each route. Then network analysis techniques were used on this weighted network to find the betweenness, degree and closeness of airports. The R packages tnet and igraph were used to perform the network analysis.

4 Experimental Setup

4.1 Data Set

The dataset used is taken from the Website of United States Bureau of Transportation Statistics which is a part of the United States Department of Transportation and Research and Innovative Technology Administration. The dataset consists of around 440,000 instances of on-time flights during the month of January in 2016. There are about a 100 provided attributes which range from the date, origin city and state, destination city and state, market IDs, airport names, airport IDs to delay time, delay causes to diverted airport information attributes. The data set consists of 290 unique cities and 294 unique airports. We also use the BTS Master Coordinate database from January 2016 to May 2016 and the BTS Air Carrier Statistics T-100 Market All Carriers dataset for network analysis [16] (Fig. 3).

4.2 Attribute Selection

We only consider the attributes which represent the date, number of flights, distance, airport ID, origination city and state of flight and destination city and state of flight. We also used PASSENGERS attribute to create weighted network.

No. of instances/rows—445,829

No. of attributes—17

4.3 Tools Used

R is programming language used for data analysis and statistical computing. RStudio is an integrated development environment (IDE) for the R programming language which allows the user to create and load R projects easily. The R programming language contains many packages which aid data mining and while they can be implemented directly, there is an option to use user-developed GUIs for ease of use. Rattle is an open-source graphical user interface (GUI) made in R for the purposes of data mining. The Rattle GUI allows the user to easily load a dataset and perform data analysis and mining as well as create models, evaluate, associate, cluster, and transform the data in many ways. We use Rattle to find frequent item sets and use R packages igraph and tnet to perform network analysis.

4.4 Measures

The dataset used contains all the on-time flights in January 2016. This, however, means that the support for each item set is quite low as the frequencies of the flights have less than 0.07 support. Thus the measure for support used is 0.0300 and the confidence used is 0.4000.

5 Result and Analysis

Case Study 1—Graph Mining using Apriori algorithm: Processing all the attributes concurrently, we get the following results from the frequent item plot. We can see that most frequently found item set is California as the origin and destination state. It followed closely by Texas and then Florida. To make the frequently appearing item clearer, we selected corresponding attributes in absence of all others. It is significant to note down that the graphs like the U.S. airport network are highly symmetric in nature.

Ignoring all the attributes except ORIGIN_STATE_NAME and DEST_STATE_NAME, we get the following graph. This graph clarifies that the most commonly found state during the month of January in 2016 was California. It also gives us insight onto the most commonly found states during that time period. This can be attributed to a large number of interstate flights (Figs. 4, 5 and 6).

To calculate the most frequently appearing cities, we ignore all other attributes except ORIGIN_CITY_NAME and DEST_CITY_NAME. From the graph, we see that the most commonly found city is Atlanta. It is followed by Chicago and Denver. We see that even though California is the most commonly found state, neither of the three most commonly found cities are in California. Atlanta lies at the heart of this network and thus is often found on cross-country routes such as those which go from the western coast to the eastern coast and vice versa.

Similarly, we can calculate the most frequently appearing airport by only selecting the ORIGIN and DEST attributes. From this graph, we can see that the most commonly found airport is Hartsfield–Jackson Atlanta International Airport in Atlanta, Georgia. It is then followed by O’Hare International Airport in Chicago, Illinois and the Dallas/Fort Worth International Airport and then the Los Angeles International Airport (Figs. 7, 8 and 9).

Case Study 2––Networking analysis of airport network: Using link analysis techniques, we also perform network analysis of the airport network. Using R tnet package, we can convert the dataset into an R network and calculate several important features.

We calculate the airports in the network with the highest closeness, betweenness and degree. These factors allow us to find the airports with the most contacts which are the node with the highest degree. We can see that the Hartsfield–Jackson Atlanta International Airport is the airport with the most contacts. This can be attributed to its central position in the U.S. airport network. The closeness score allows us to see which airports are the most easily accessible from other airports. We can see that the LAX Airport has the highest closeness score and thus is, on average, closer to most airports. The betweenness score shows us which airport bridges the shortest path between the two airports. Binary analysis shows us that Ted Stevens Anchorage International Airport (ANC) is the airport with the highest betweenness but this does not take into account the weight along each route. Weighted analysis shows that the LAX airport has the highest betweenness and shows that for most routes LAX acts as an intermediary airport. It is followed closely by ATL and SEA airports.

6 Conclusions

The results from mining the dataset of on-time flights during January, 2016 allow us to find out which airports were the busiest during that time and which cities and states were visited most frequently. We can see that the Hartsfield–Jackson Atlanta International Airport was the busiest airport. The most frequently traveled to city is also Atlanta, Georgia and the most frequently traveled state was California. The ATL airport also has the highest degree which means it is the most connected airport and thus it is important to ensure that it functions properly. This will also help in identifying which routes should avoid disruption the most. Routes with LAX as one of the nodes are examples of one such route. Flights are seasonal in nature and therefore change month to month, therefore it is important to determine which airports, cities, and states are frequently visited and when, to improve quality of service and safety on flights that take place on those routes. It is also crucial to figure out that the loss of which airport causes the most disconnections in the airport network in case of emergencies or attack.

References

Cook, D. J., & Holder, L. B. (Eds.). (2006). Mining graph data. John Wiley & Sons.
Google Scholar
Sabitha, A. S., Mehrotra, D., & Bansal, A. (2012, May). Quality metrics a quanta for retrieving learning object by clustering techniques. In Digital information and communication technology and it’s applications (DICTAP), 2012 Second International Conference on (pp. 428–433). IEEE.
Google Scholar
Sabitha, A. S., Mehrotra, D., & Bansal, A. (2016). Delivery of learning knowledge objects using fuzzy clustering. Education and Information Technologies, 21(5), 1329–1349.
Google Scholar
Livne, A., Adar, E., Teevan, J., & Dumais, S. (2013, February). Predicting citation counts using text and graph mining. In Proc. the iConference 2013 Workshop on Computational Scientometrics: Theory and Applications.
Google Scholar
Yan, X., & Han, J. (2002). gspan: Graph-based substructure pattern mining. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on (pp. 721–724). IEEE.
Google Scholar
Gudes, E., Shimony, S. E., & Vanetik, N. (2006). Discovering frequent graph patterns using disjoint paths. IEEE Transactions on Knowledge and Data Engineering, 18(11), 1441–1456.
Google Scholar
Chen, C. C., Lee, K. W., Chang, C. C., Yang, D. N., & Chen, M. S. (2013, October). Efficient large graph pattern mining for big data in the cloud. In Big Data, 2013 IEEE International Conference on (pp. 531–536). IEEE
Google Scholar
Tanupriya Choudhury, Vivek Kumar and Darshika Nigam, Cancer Research Through The Help of Soft Computing Techniques: A Survey, International Journal of Computer Science and Mobile Computing, IJCSMC, vol. 2, issue 4, pg. 467–477, April (2013)
Google Scholar
Getoor, L. (2003). Link mining: a new data mining challenge. ACM SIGKDD Explorations Newsletter, 5(1), 84–89.
Google Scholar
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). Web usage mining: Discovery and applications of usage patterns from web data. Acm Sigkdd Explorations Newsletter, 1(2), 12–23.
Google Scholar
Patel, S. J., & Pattewar, T. M. (2014, July). Software birthmark based theft detection of JavaScript programs using agglomerative clustering and Frequent Subgraph Mining. In Embedded Systems (ICES), 2014 International Conference on (pp. 63–68). IEEE.
Google Scholar
King, R. D., Srinivasan, A., & Dehaspe, L. (2001). Warmr: a data mining tool for chemical data. Journal of Computer-Aided Molecular Design, 15(2), 173–181.
Google Scholar
Ketkar, N. S., Holder, L. B., & Cook, D. J. (2005, August). Subdue: Compression-based frequent pattern discovery in graph data. In Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (pp. 71–76). ACM.
Google Scholar
Inokuchi, A., Washio, T., & Motoda, H. (2000, September). An apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 13–23). Springer Berlin Heidelberg.
Google Scholar
Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487–499).
Google Scholar
Bureau of Transportation Statistics database (n.d.) Retrieved from http://www.transtats.bts.gov/DataIndex.asp
Yan, X., Yu, P. S., & Han, J. (2004, June). Graph indexing: a frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (pp. 335–346). ACM.
Google Scholar
Wang, W., Wang, C., Zhu, Y., Shi, B., Pei, J., Yan, X., & Han, J. (2005, June). Graphminer: a structural pattern-mining system for large disk-based graph databases and its applications. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (pp. 879–881). ACM.
Google Scholar
Palmer, C. R., Gibbons, P. B., & Faloutsos, C. (2002, July). ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 81–90). ACM.
Google Scholar
Kuramochi, M., & Karypis, G. (2004). An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1038–1051.
Google Scholar
Huan, J., Wang, W., Prins, J., & Yang, J. (2004, August). Spin: mining maximal frequent subgraphs from graph databases. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 581–586). ACM.
Google Scholar
Meinl, T., Borgelt, C., & Berthold, M. (2004). Discriminative closed fragment mining and perfect extensions in MoFa (pp. 3–14).
Google Scholar
Williams, M., Burry, J., & Rao, A. (2015, March). Graph mining indoor tracking data for social interaction analysis. In Pervasive Computing and Communication Workshops (PerCom Workshops), 2015 IEEE International Conference on (pp. 2–7). IEEE.
Google Scholar
Steinbauer, M., & Kotsis, G. (2013, June). Platform for General-Purpose Distributed Data-Mining on Large Dynamic Graphs. In Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2013 IEEE 22nd International Workshop on (pp. 178–183). IEEE.
Google Scholar
Nettleton, D. F. (2013). Data mining of social networks represented as graphs. Computer Science Review, 7, 1–34.
Google Scholar
Pinheiro, F., Kuo, M. H., Thomo, A., & Barnett, J. (2013, June). Extracting association rules from liver cancer data using the FP-growth algorithm. In Computational Advances in Bio and Medical Sciences (ICCABS), 2013 IEEE 3rd International Conference on (pp. 1–1). IEEE.
Google Scholar
Sidhu, S., Meena, U. K., Nawani, A., Gupta, H., & Thakur, N. (2014). FP Growth Algorithm Implementation. International Journal of Computer Applications, 93(8).
Google Scholar
Jia, Y., Zhang, J., & Huan, J. (2011). An efficient graph-mining method for complicated and noisy data with real-world applications. Knowledge and Information Systems, 28(2), 423–447.
Google Scholar
Akoglu, L., & Faloutsos, C. (2013, February). Anomaly, event, and fraud detection in large network datasets. In Proceedings of the sixth ACM international conference on Web search and data mining (pp. 773–774). ACM.
Google Scholar
Hu, X. (2011, November). Data mining and its applications in bioinformatics: Techniques and methods. In Granular Computing (GrC), 2011 IEEE International Conference on (pp. 3–3). IEEE.
Google Scholar
Xie, B., Kumar, A., Ramaswamy, P., Yang, L. T., & Agrawal, S. (2009, July). Social behavior association and influence in social networks. In Ubiquitous, Autonomic and Trusted Computing, 2009. UIC-ATC’09. Symposia and Workshops on (pp. 434–439). IEEE.
Google Scholar
Ranjan, P., & Vaish, A. (2014, November). Apriori Viterbi Model for Prior Detection of Socio-Technical Attacks in a Social Network. In Engineering and Telecommunication (EnT), 2014 International Conference on (pp. 97–101). IEEE.
Google Scholar
Peng, J. Y., Yang, L. M., Wang, J. X., Liu, Z., & Li, M. (2008, May). An efficient algorithm for detecting closed frequent subgraphs in biological networks. In 2008 International Conference on Bio Medical Engineering and Informatics (Vol. 1, pp. 677–681). IEEE.
Google Scholar
Nawaz, W., Khan, K. U., & Lee, Y. K. (2014, December). Core analysis for efficient shortest path traversal queries in social graphs. In Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on (pp. 363–370). IEEE.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Uttar Pradesh, Noida, Uttar Pradesh, India
Anant Joshi, Abhay Bansal, A. Sai Sabitha & Tanupriya Choudhury

Authors

Anant Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Abhay Bansal
View author publications
You can also search for this author in PubMed Google Scholar
A. Sai Sabitha
View author publications
You can also search for this author in PubMed Google Scholar
Tanupriya Choudhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anant Joshi .

Editor information

Editors and Affiliations

Department of Computer Science Engineering, PVP Siddhartha Institute of Technology, Vijayawada, Andhra Pradesh, India
Suresh Chandra Satapathy
Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joshi, A., Bansal, A., Sai Sabitha, A., Choudhury, T. (2018). An Efficient Way to Find Frequent Patterns Using Graph Mining and Network Analysis Techniques on United States Airports Network. In: Satapathy, S., Bhateja, V., Das, S. (eds) Smart Computing and Informatics . Smart Innovation, Systems and Technologies, vol 78. Springer, Singapore. https://doi.org/10.1007/978-981-10-5547-8_32

Download citation

DOI: https://doi.org/10.1007/978-981-10-5547-8_32
Published: 29 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5546-1
Online ISBN: 978-981-10-5547-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics