Cyber Attribution from Topological Patterns
- 127 Downloads
We developed a crawler to collect live malware distribution network data from publicly available sources including Google Safe Browser and VirusTotal. We then generated a dynamic graph with our visualization tool and performed malware attribution analysis. We found: 1) malware distribution networks form clusters rather than a single network; 2) those cluster sizes follow the Power Law; 3) there is a correlation between cluster size and the number of malware species in the cluster; 4) there is a correlation between the number of malware species and cyber events; and finally, 5) infrastructure components such as bridges, hubs, and persistent links play significant roles in malware distribution dynamics.
KeywordsCyber attribution Malware Malware distribution network MDN Dynamics Graph Security Computer virus Malicious software Topology
Similar to an epidemic virus spread, malicious files infect computer systems over a set of globally connected domains or IP addresses, which we call a malware distribution network (MDN) [4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15]. In this paper, we study temporal topological structures of an MDN with subsets of connected domains as a malicious cluster (M-Cluster). We created a novel dataset over an eight-month period by crawling the transparency report repository of Google Safe Browsing as well as collected URL and malware file hash scanning results from VirusTotal [8, 17]. We analyzed the topological structural evolution and malware hosted on various domain servers of the three largest M-Clusters in an eight-month period. Our analysis revealed the layout of an M-Cluster as a hub and bridge structure. We further observed that the increase in size of an M-Cluster occured in parallel to an increase in discovered malware on the domain servers. One scenario in which the manifestation of an M-Cluster may occur is in conjunction with global events, for example, the 2017 Presidential Inauguration of the United States of America. Our M-Cluster analysis also revealed a consistent presence of multiple layers of URL redirection services, which, we believe, serves to obfuscate servers hosting malware. The contributions of this paper are: 1) observation and analysis of malware distribution networks as clusters with a bridge and hub construction; 2) correlation between size increases of M-Clusters and the presence of hosted malware; 3) the significant roles of persistent bridges and hubs in malware distribution dynamics; and 4) development of algorithms to identify hubs and bridges.
2 Literature Review
Dynamic graphs have been used in software engineering and operation research. Schiller and Strufe developed the framework for the analysis of dynamic graphs with DNA (Dynamic Network Analyzer) . The topological properties of a dynamic graph include topological metrics of degree distribution (DD), connected components (C), assortativity (ASS), clustering coefficient (CC), rich-club connectivity (RCC), all-pairs-shortest paths (SP), and betweenness centrality (BC) . Yu, et al.  studied the malware propagation dynamics of a single malware ConFlicker botnet. The authors tracked three top-domain layers and the growth of total compromised hosts by Android malware. The authors used the epidemic dynamics model to interpolate the malware distribution process. They discovered the Power Law distribution of ConFlicker botnet in the top three levers, i.e. ranking in botnet size of the malware versus probability of the distribution. This is perhaps the most comprehensive study of malware distribution at single botnet with a computational distribution model.
3 Semantic Graph Model
We use a gradient arc for displaying the direction of edges. The decrease of alpha value indicates the direction, with 1 at the source and 0 at the end. This novel visual representation also enables us to add the attributes to the edges [19, 20, 21].
4 Data Collection and Malware Attribution
The MDN and M-Clusters were built from our dataset collected from Google Safe Browsing (GSB) and VirusTotal.com (VT). The data set spans a period of eight months from 19 January to 25 September 2017. The collection start date was specifically chosen to capture data related to the 2017 U.S. Presidential Inauguration. The end date, unfortunately, resulted from the unavailability of GSB API services. The GSB service has been used to warn users not to visit potentially unsafe URLs. The GSB Transparency Report is an online resource providing statistics from the collected data repository. An API set was made available to automate the retrieval of data from the repository for any submitted URL. The API requires a URL as input and returns a report including the timestamp of the last visit, the source, and the destination of the transmission. However, the report does not contain specific malware information.
VirusTotal (VT), on the other hand, provides a scanning service to detect the presence of malicious code in files and URLs. VT provides specific malware information. However, it does not contain the source-destination data. Scanning is a combination of multiple commercial anti-malware products providing both static and heuristic-based data analysis. In this study, we used the academic API service to automate submission and result retrieval for large data sets.
The site vk.net was selected as the seed website based on a four-month observation of the site reliably appearing on GSB. The report, in JSON format, consisted of various statistics. The statistics of interest to us were labeled: name, sendsToAttackSites, receivesTrafficFrom, sendsToIntermediary-Sites, lastVisitDate, and lastMaliciousDate. An MN with no incoming edges for the current collection was relabeled to a Root Malicious Node (RMN). This node is unique to our MDN graphs as it cannot be determined from the GSB reports alone. It is revealed only if the MDN graph is completed.
5 Topological Dynamic Clusters
6 Correlation of Events and Malware Clusters
7 Cyber Attribution from Topological Patterns
With the visualization and analytic model, we are able to track single Top Level Domain (TLD) nodes and reveal their “life cycle” in the malware distribution network, when the TLD address has been captured by both Google Safe Browsing (GSB) and VirusTotal (VT). Figure 12 shows the dynamics of the TLD adf.ly node and its inbound and outbound edges in the 8-months period. The plot shows that the node had persistent malware inbound and outbound traffic before January 19 through May 17. There are multiple recurrences during that period. The malware did not die out until May 17, 2017. It reached its peak between Feb 19 and March 19, in correlation with the cyber activities during that period.
We developed a crawler to collect live malware distribution network data from publicly available sources including Google Safe Browser and VirusTotal. We then generated the graph with our visualization tool and performed malware attribution. We have discovered: 1) malware distribution networks form clusters; 2) those cluster sizes follow the Power Law; 3) there is a correlation between cluster size and the number of malware species in the cluster; 4) there is also a correlation between number of malware species and cyber events; and finally, 5) the infrastructure components such as bridges, hubs, and persistent links play significant roles in malware distribution dynamics.
The SHA-256 of M is2eea543c86312c0fd361c31cba8774d2d6020c5ebcc1ce1a355482de74ed9863.
The authors would like to thank VIS research assistants Sebastian Peryt, Pedro Pimentel, and Sihan Wang for participating in 3D model prototyping and data processing. This project is in part funded by Cyber-Security University Consortium of Northrop Grumman Corporation. The authors are grateful to the discussions with Drs. Neta Ezer, Justin King, and Paul Conoval.
- 1.Schiller, B., Deusser, C., Castrillon, J., Strufe, T.: Compile- and run-time approaches for the selection of efficient data structures for dynamic graph analysis. Appl. Network Sci. 1 (2016). Article number: 9 https://link.springer.com/article/10.1007/s41109-016-0011-2
- 2.DNA at GitHub. https://github.com/BenjaminSchiller/DNA
- 3.Carey, C.E.: Continued bot infiltration of Trump’s Facebook Pages. Data for Democracy, 1 May 2017. https://medium.com/data-for-democracy/continued-bot-infiltration-of-trumps-facebook-pages-2df82ca86b5b
- 4.Gu, G., Perdisci, R., Zhang, J., Lee, W.: BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection. In: Proceedings of the 17th USENIX Security Symposium (Security 2008), (2008)Google Scholar
- 5.Gu, G., Zhang, J., Lee, W.: BotSniffer: detecting botnet command and control channels in network traffic. In: Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS 2008), February 2008Google Scholar
- 6.McCoy, D., et al.: Pharmaleaks: understanding the business of online pharmaceutical affiliate programs. In: Proceedings of the 21st USENIX conference on Security symposium, ser. Security 2012, p. 1. USENIX Association, Berkeley (2012)Google Scholar
- 7.Karami, M., Damon, M.: Understanding the emerging threat of ddos-as-a-service. In: Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (2013)Google Scholar
- 8.Google safe browsing. https://developers.google.com/safe-browsing/
- 9.Zhang, J., Seifert, C., Stokes, J.W., Lee, W.: Arrow: Generating signatures to detect drive-by downloads. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, 28 March–1 April 2011. ACM (2011)Google Scholar
- 11.Caballero, J., Grier, C., Kreibich, C., Paxson, V.: Measuring pay-per-install: the commoditization of malware distribution. In: Proceedings of the 20th USENIX conference on Security, ser. SEC 2011. USENIX Association, Berkeley (2011)Google Scholar
- 12.Goncharov, M.: Traffic direction systems as malware distribution tools. Trend Micro, Technical report (2011)Google Scholar
- 13.Behfarshad, Z.: Survey of malware distribution networks. Electrical and Computer Engineering, University of British Columbia, Technical report (2012)Google Scholar
- 14.Provos, N., McNamee, D., Mavrommatis, P., Wang, K., Modadugu, N.: The ghost in the browser analysis of web-based malware. In: Proceedings of the first Conference on First Workshop on Hot Topics in Understanding Botnets, ser. HotBots 2007. USENIX Association, Berkeley (2007)Google Scholar
- 15.Provos, N., Mavrommatis, P., Rajab, M.A., Monrose, F.: All your iframes point to us. In: Proceedings of the 17th Conference on Security symposium, ser. SS 2008. USENIX Association, Berkeley (2008)Google Scholar
- 19.Wigglesworth, V.B.: Insect Hormones, pp. 134–141. W.H. Freeman and Company (1970)Google Scholar
- 23.Jacobi, J.A., Benson, E.A., Linden, G.D.: Personalized recommendations of items represented within a database. US Patent. US 7113917 B2 (2006)Google Scholar
- 24.Peryt, S., Morales, J.A., Casey, W., Volkmann, A., Cai, Y.: Visualizing malware distribution network. In: IEEE Conference on Visualization for Security, Baltimore, October, 2016 (2016)Google Scholar
- 25.Rossi, R.A., Gallagher, B., Neville, J., Henderson, K.: Modeling dynamic behavior in large evolving graphs. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM 2013), pp. 667–676. ACM, New York (2013). http://dx.doi.org/10.1145/2433396.2433479