Big Data Networks
Big data networks provide a distributed infrastructure for storing, transmitting, and processing a huge volume of data to extract useful information from the data.
The world is generating data much more fast than before. It is predicted that the total data generated in the world doubles every 2 years (Gantz and Reinsel, 2012). The huge volume of data appears in various domains and applications, e.g., large graph or complex graph for social networks (Nie et al., 2017; Strogatz, 2001), protein-protein interaction (PPI) networks (Stelzl et al., 2005) in bioinformatics, huge number of searching requests in searching engine, and the big data generated from Internet of Things (IoT) devices (Sun et al., 2016).
How to efficiently store and process the huge volume of data is an important technique issue. The idea of leveraging the resources from a set of computers to enhance processing capability and to manipulate a large volume of data can be traced back to early distributed computing systems in 1970s and 1980s (Gligor and Shattuck, 1980). High-performance computing (HPC) (Dowd et al., 1998) and grid computing (Chervenak et al., 2000) played important roles in processing large volume of data since 1990s, and they are still important platforms to perform computational-aware tasks nowadays.
In 2006, Amazon released its elastic computer cloud (EC2) platform (Buyya et al., 2008), which provides a platform to aggregate resources from a set of powerless computers to obtain the required capability to process a large volume of data. This is different from previous approaches based on HPC. In the same year, Google released MapReduce (Dean and Ghemawat, 2008) and Bigtable (Chang et al., 2008), which support parallel processing of big data in a cluster containing a large number of computers and managing a huge number of structured data in a cluster. The Spark framework (Zaharia et al., 2010) improves efficiency by using in-memory computation. Moreover, Spark supports iterative processing which are necessary in many machine learning tasks.
Big Data VS. Networks
When the volume of data becomes huge, the data cannot be stored and processed on a single computer, and we have built networking systems to store, transmit, and process the big data. First, as the data exceeds the capability of a single computer node, it is necessary to build a network to connect a large number of nodes storing the data and exchange the data between them efficiently. This stimulates the development of data centers. Second, to process the big data, we need new network computing paradigms to process the data quickly and in parallel, such as MapReduce and Spark.
Big Data Storage
Topology design: The topologies used in DCNs can be classified into two categories: switch-centric topology and server-centric topology. In switch-centric DCN topologies, the network connection and routing functions are performed on switches, while the servers are mainly used for data storage and computation. Typical switch-centric topologies include Fat-Tree and VL2. In contrast, in server-centric topologies, the function of interconnection and routing are performed on servers, while the switches provide only simple crossbar functions. Typical server-centric topologies include Dcell, BCude, and uFdix.
TCP optimization: Due to the special features of DCNs that are different from traditional Internet, performance of traditional TCP congestion control protocols might degrade significantly in some cases. One such feature is TCP Incast (Wu et al., 2013): when a client sends data request to many servers simultaneously and these servers return data to the client at the same time. In such a case, the buffer of switches will overflow, which will greatly degrade the throughput of the network. Researchers propose several techniques to address the TCP Incast problem. Another hot research issue in TCP for DCNs is multipath TCP (MPTCP). In topologies like Fat-Tree and Bcube, there are multiple paths between a pair of hosts. In many cases, MPTCP can significantly increase the data throughput in DCNs.
Infrastructure as a Service (IaaS): In this type of service model, the cloud provides a set of highly scalable and automated computer resources to users in the form of bare hardware. The users can install operating system, deploy applications, or develop programmes by themselves. Examples of IaaS include Amazon AWS and Rackspace. The users can exploit the high scalability of IaaS, which allows them to dynamically scale up resources on demand instead of buying hardware outright.
Platform as a Service (PaaS): In this type of service model, the cloud provides a platform with components containing certain software for some applications. PaaS allows the customers to develop, run, and manage their applications on the platform, without building and managing low-layer infrastructure issues as in IaaS. PaaS provides a framework on which the users can create their customized applications. Examples of PaaS include Google App Engine and Windows Azure.
Software as a Service (Saas): In this type of service model, the cloud delivers applications to users through Internet. In the SaaS model, the user needs not to download and install applications on their individual computers. The customer can focus on maintenance and support of their application, while the detailed technical issues such as data, servers, storage, and middleware are all handled by the software vendors. Examples of SaaS include Google Apps, Dropbox, and Salesforce.
Big Data Processing Framework
MapReduce (Dean and Ghemawat, 2008) is a programming model for processing parallelizable problems across huge volume of data using a cluster of computers. MapReduce can process both unstructured data stored in a filesystem or structured data stored in a database. It can take advantage of the locality of data, processing the data near the place where the data is stored in order to minimize communication overhead. When using MapReduce, the user specifies a map function and a reduce function. The map function takes a key/value pair as input and generates a set of intermediate key/value pairs. The reduce function merges the intermediate values with the same intermediate key.
Map: In the map step, each worker node applies the map function to its local data and writes the output of the map function to a temporary storage. There is a master node that is in charge of partitioning data and send data to different worker nodes for processing, ensuring that only one copy of redundant input data is processed.
Shuffle: In the shuffle step, the worker nodes redistribute data based on the output keys, such that all data belonging to one key are located on the same worker node.
Reduce: In the reduce step, the worker nodes process each group of output data and merge the intermediate results in parallel.
There are two limitations of MapReduce. First, it cannot handle iterative algorithms, but a large number of data processing algorithms (e.g., many machine learning algorithms) are iterative. Second, its performance is greatly affected by the I/O operations. To overcome the limitations of MapReduce, the Spark (Zaharia et al., 2010) framework is developed. Compared with MapReduce, Spark significantly improves efficiency by doing in-memory computation. It provides caches to support iterative computation and multiple data sharing, which reduces cost in data I/O. It decreases the cost in writing intermediate result to HDFS by using the directed acyclic graph (DAG) model and reduces redundant sorting and I/O operations by using multi-thread pool. Spark supports more flexible APIs than MapReduce and supports more languages.
Machine Learning for Big Data Processing
Machine learning are a powerful technique to extract useful information or knowledge from big data. For example, face recognition based on deep learning has been successful in many security-related applications in recent years. Machine learning-based big data processing usually requires a large volume of labelled training data, with which we can build very powerful models that sometime even outperform human being in some special tasks. Recently, how to design suitable network structure for machine learning models that run on a large number of computing nodes have attracted researcher’s attention. This includes designing optimized network structure to reduce parameter transmission in learning models and adaptive learning model execution framework to reduce execution latency of the learning model.
Big Data Network Requirements
Network resiliency: The availability of a network is critical to communication of distributed resources. Path diversity and failover can help improve the resiliency of the networks to failures. Path diversity means to utilize different routes rather than a single fixed route between a pair of communicating nodes. Failover means to have the ability to recognize failures and switch over to an alternate path quickly.
Network congestion mitigation: When many big data applications launch at the same time, their data flow might work in a burst mode and might cause problems like Incast. This might cause congestion problems and consequently increase response latency. As previously described, network congestion can be mitigated by designing network topology and optimizing over TCP.
Network performance consistency: Because the processing time in big data applications is usually dominated by the computational time rather than data transmission delay, big data networks are not affected a lot by network latency. For big data applications, one more important issue is to maintain high synchronicity, which refers to that different jobs in big data applications need to be executed in parallel in order to assist in accurate analysis. In such cases, any large decrease in network performance may trigger failures in the outcome.
Network future scalability: As more and more big data applications will be ported to data centers, it is important for the data center architecture to have the ability of scale-up adaptively according to the requirement of applications.
Application awareness: Big data network architecture typically use computer cluster environments such that the nodes can build large data sets. Different applications might pose different requirements on the performance of the data center. For example, some applications might need high bandwidth, but others might focus on low latency. If the network aims to support different tenants’ applications with different requirements, it needs to support, differentiate, separate, and process various workloads with different requirements. When designing a big data network, the designer needs to consider all factors including coexistence issues of data flows, processes, and application resources.
Key protein complexes identification: How to identify the key protein complexes and their function models from a protein-protein interaction (PPI) network are most computation-intensive problems in the bioinformatics research (Stelzl et al., 2005). The high time complexity of traditional algorithms make them unsuitable for large-scale PPI networks. Currently, MapReduce and Spark have been used to process large-scale PPI networks (You et al., 2014; Harnie et al., 2017).
Recommendation systems in E-Business: In E-commerce companies such as Amazon and Alibaba, how to predict the preferences of different customers and make correct recommendation for them is a key issue. Amazon makes recommendation decisions with big data, which significantly increases its profit. Amazon uses item-to-item collaborative filtering to obtain accurate recommendation results, which relies big data network to handle billions of customers and products.
Community discovery in social networks: Community detection and discovery in large-scale social networks is another typical application that relies on big data network to process huge datasets. Large social media networks usually have a huge number of users, e.g., Facebook has more than 2 billion daily active users by 2017. In order to efficiently detect and discover communities in such large-scale social networks, MapReduce has been used (Guo et al., 2015).
- Al-Fares M, Radhakrishnan S, Raghavan B, Huang N, Vahdat A (2010) Hedera: dynamic flow scheduling for data center networks. In: Proceeding of NSDI, pp 89–92Google Scholar
- Buyya R, Yeo CS, Venugopal S (2008) Market-oriented cloud computing: vision, hype, and reality for delivering it services as computing utilities. In: 10th IEEE international conference on high performance computing and communications, 2008 (HPCC’08). Ieee, pp 5–13Google Scholar
- Dowd K, Severance CR, Loukides M (1998) High performance computing-RISC architectures, optimization and benchmarks. O’Reilly, ParisGoogle Scholar
- Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze Future 2007(2012):1–16Google Scholar
- Nie F, Zhu W, Li X (2017) Unsupervised large graph embedding. In: AAAI, pp 2422–2428Google Scholar
- Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10-10):95Google Scholar