Research on Hybrid Storage Method of Massive Heterogeneous Data for Mobile Environment

  • Shanshan Wu
  • Fan Yi
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 474)


The article relates to a hybrid storage system and method for processing massive heterogeneous data, mainly for the real-time information collection, high-speed storage and timely indexing in the mobile environment. The article provides a common heterogeneous data resource metadata description model, data mixed storage solution to standardize the sharing process of massive heterogeneous data resources, and provides an optimized index construction algorithm and data archiving method to realize the collection of massive data, index building and persistence, in order to complete the data resources sharing and effective use in mobile environment. The method in the article can effectively deal with various complex problems in mobile environment, for example, the heterogeneous data structure, a huge number, scattered physical location, the data complex content and so on.


Hybrid storage Massive heterogeneous data Real-time information collection Timely indexing Mobile environment 

1 Introduction

The article relates to a field of data storage, and more particularly to a hybrid storage system and method for processing mass heterogeneous data in the mobile environment.

In the mobile environment, the data structure is heterogeneous, the number of the data is huge, the data physical location is decentralized, the content of the data is complex and diverse; then data sources, data processing nodes and the demand of the data users is always dynamic changing. In such a complex environment, the uncertainty of the data and the uncertainty of the demand has brought a great challenge to the system data storage, sharing and effective use.

Traditional relational database is generally modeled using an entity-contact model, and the most widely used relational database currently used to convert the conceptual model into a two-dimensional table structure to store data. But this model cannot meet the mobile environment in dealing with massive heterogeneous data needs. The first is the limitation of the data model. The two-dimensional table model used in the relational database cannot effectively deal with semi-structured and unstructured data with temporal information. Secondly, it is a performance limitation, a relational database management system is designed for the static application, and is not optimized for efficient data processing, and real-time systems need to deal with a lot of time-consuming I/O operations, compared to the memory database, the traditional relational database is a disk database, disk read and write speed is far behind In the memory read and write, so the result is a relational database in dealing with real-time data in the case of its performance is not satisfactory.

In the mobile environment, using Redis and HBase non-relational database hybrid storage computing framework can avoid the above drawbacks for the comprehensive utilization of data resources to provide a means.

2 The Design of the System Framework

The object of the article is to propose a method for storing heterogeneous data in a mobile environment. The article is directed to a hybrid storage system and a method for processing massive heterogeneous data in the context of mobile cooperative combat. The implementation flow of the method is described in detail as Fig. 1.
Fig. 1.

The implementation flow of the method

The technical solution of this article is summarized as follows: First, we provide a unified heterogeneous data resource metadata description model to describe the heterogeneous data in the mobile environment, and to standardize and unify the basic data message description form of the whole network data resource. And then, based on the above data model, the article realize the mixed storage and index mechanism by using Redis and HBase database. Among them, Redis as a memory database, has many advantages, such as read and write speed, high query efficiency, support concurrent access; At the same time, Redis can handle semi-structured data well. The disadvantage of Redis is that it consumes memory and is difficult to maintain data. So it is necessary to use HBase to periodically process the data in Redis. HBase also supports the storage of semi-structured and unstructured data. For streaming media, such as audio and video generated in the mobile environment, an additional streaming media server is used to store it.

3 The Model and Method

3.1 Heterogeneous Data Resource Metadata Description Model

The first step in the whole system is to get the data reported by various mobile devices, including related pictures, audio, video and text description, and transformed into a unified formatting model. The process of conversion is mainly based on heterogeneous data resource metadata description model. The heterogeneous data resource metadata description model is mainly used to standardize and unify the basic data message description form of the whole network data resources, and provide a unified management mechanism for data resource submission, query and service encapsulation access. As shown in Fig. 2.
Fig. 2.

Heterogeneous data resource metadata description model

The heterogeneous data resource metadata description model includes the data universal unique identification code UUID, basic information description, resource attributes, resource location, discovery time, related pictures, audio and video information.
  1. (1)

    Data Universal Unique Identifier UUID is the unique identification ID of the data message, and the unique data can be found by UUID. Mainly for data storage and retrieval;

  2. (2)

    the basic information description, is to find the data object text description information;

  3. (3)

    resource attributes, is to find the data object category identification;

  4. (4)

    resource location, is to find the data object latitude and longitude and approximate height;

  5. (5)

    discovery time is the time when the data object is submitted;

  6. (6)

    picture information, audio information, video information, is the data object related to multimedia information, is published by the publisher.


3.2 The Multidimensional Index Structure

The second step is to format the data stored in the Redis memory database, and based on time, space, keywords, etc. to build the index.

As shown in Fig. 3, in the index layer, Redis provides a variety of data structures, can be formatted in accordance with the different dimensions of data retrieval index. When building a formatted data space index, a hash data structure (Hash) is used; When building a formatted data time index, an ordered set of data structures (Zset) are used; When building a formatted data key index, a hash data structure (Hash) is used; When building a formatted data type index, a collection data structure (Set) is used; and so on. Redis’s own built-in data structures and algorithms ensure efficient performance of index creation and message retrieval.
Fig. 3.

The multidimensional index structure

In the Hash structure, when the space index is established, the attributes such as the longitude, the dimension and the height of the spatial position are extracted and stored in the hash structure of Redis in the form of a string.

First use the hash function set by the dictionary to calculate the hash of attributes such as longitude, dimension, and height:
$$ {\text{hash}} = {\text{dict}} \to {\text{type}} \to {\text{hashFunction}}\left( {\text{key}} \right). $$
And then use the hash table’s sizemask property and hash value to calculate the index value:
$$ {\text{index}} = {\text{hash}}\,\& \,{\text{dict}} \to {\text{ht}}\left[ {\text{x}} \right].{\text{sizemask}}. $$

In this paper, Redis’s Zset (ordered set structure) is used to store the index relation of the message on the time dimension. Redis’s ordered aggregate structure uses the method of assigning weight to the element to sort the elements. Redis uses the jump table data structure at the bottom to store the elements of the ordered set. The time complexity of the jump, find, delete, etc. of the jump table is O (logN), the worst is O (N), and the nodes can be processed in batches by sequential operation.

In this paper, the Set (collection) structure is used to store the message type index relationship. Redis’s Set data structure is implemented at the bottom through an integer set. In the process of data retrieval and global situation analysis, it is also necessary to judge the attributes such as data types. And this process often only need to determine whether there is a category of elements, the characteristics of the collection structure and Redis consistent with the characteristics. Therefore, the use of Redis’s collection structure to store format packets in different categories on the index relationship. By using the set data structure of Redis, the time performance of O (N) established by message category index and the O (N) performance of judgment of packet type are guaranteed.

3.3 The Lock-Free Concurrent Migration Algorithm Based on the Mark

Step 3 is to migrate the historical data stored in Redis to the archive and migrate to the HBase database. In the process of data migration and archiving, according to the characteristics of the memory database Redis own storage format, this paper adopts the tag-based non-lock concurrent migration algorithm, as shown in Fig. 4:
Fig. 4.

The lock-free concurrent migration algorithm based on the mark

The specific algorithm is described below:
  1. (1)

    Lock the data migration flag in the in-memory database before data migration. After the migration flag is locked, the read and write requests accessed by the memory database are directed to the temporary storage area, and the message record table of the current storage area is locked;

  2. (2)

    The migration program uses the call partitioning module, and uses the hash algorithm to hash the set of messages in the message record table \( {\text{V}} = \left\{ {F_{1} ,F_{2} ,F_{3} , \ldots ,F_{n} } \right\} \), which needs to be migrated, according to the thread capacity of the configured thread pool, get the corresponding message set \( {\text{VF}} = \left\{ {VF_{1} ,VF_{2} ,VF_{3} , \ldots ,VF_{n} } \right\} \), where:

    $$ VF_{1} = \left\{ {vf_{1} ,vf_{2} , \ldots ,vf_{n1} } \right\},VF_{2} = \left\{ {vf_{1} ,vf_{2} , \ldots ,vf_{n2} } \right\}, \ldots ,VF_{n} = \left\{ {vf_{1} ,vf_{2} , \ldots ,vf_{nn} } \right\}. $$


    $$ \left| {VF_{1} } \right| + \left| {VF_{2} } \right| + \left| {VF_{3} } \right| + \cdots + \left| {VF_{n} } \right| = n_{1} + n_{2} + n_{3} + \cdots + n_{n} = n. $$

    In other words, the hash algorithm of the partition module ensures that the message record in the current memory database is divided into n packets without packets. After the division of the packet is completed, the same number of threads are started in the thread pool, and each migration thread is bound to the packet table, and then returns a Future object.

  3. (3)

    Future object in the migration thread is completed, with the return of the flag to determine whether all the current message data has been completed migration. If the migration has been completed, the data migration flag is released and the message list in the temporary storage area is stored in the message record table. The migrated message object, after committing to the HBase database, clears the message object in the in-memory database.


3.4 The Four-Way Coding Algorithm

Stored in the HBase data, this paper uses a four-way coding algorithm to build data index, the four-way coding algorithm is shown in Fig. 5.
Fig. 5.

The four-way coding algorithm

The general process is as follows: First, read the specified message file, according to the type of packet classification of the message. Because the data of different message categories belong to different HBase tables in storage, the classification is advantageous for subsequent processing. Then, the DOM structure of the message is obtained and merged with the existing mapping table. Finally, the extracted data is stored in the HBase database.

In order to complete the bi-directional mapping of packets to the HBase column, it is necessary to solve the problem of adding nodes.

The new nodes mainly include two categories: one is the original node that has not appeared, such as the Image node in Fig. 5(b), in Fig. 5(a) did not appear, need to be inserted into the middle of the Name and Author nodes; The second category is the addition of the same name node, because the same name node can appear any number of times, because the number of nodes is not fixed, and no upper limit. So we need to adapt to the dynamic expansion of the same name node, such as DiscoveryTime node in Fig. 5(b). This paper presents a four-way node coding scheme based on path indexing technology. The four-node coding scheme has the flexibility to adapt to the dynamic insertion of elements and the expansion of nodes of the same name.

The four-way node coding method can be described by the following rules:
  1. (1)

    The document node of the message document is encoded as 1 as the root node.

  2. (2)

    Each node has a current node number St. Nodes with the same parent node, the current layer number is different, the current layer number is an integer, all the nodes with the same parent node are named according to the order of access order p in the preorder sequence, and:

    $$ {\text{S}}_{\text{t}} = 4{\text{p}} + 1. $$
  3. (3)

    Except for root nodes, the node coding of all nodes I is prefixed with the encoding of the parent node, and the current layer number St of the I node is added. The middle is separated by the separator Open image in new window , so the encoding of the node I:

    $$ {\text{S}}_{\text{I}} = {\text{S}}_{\text{p}} .{\text{S}}_{\text{t}} . $$
  4. (4)

    The attribute node of the element is treated as a child of the element, and the attribute node is preceded by all other child nodes. The naming convention of the attribute node is preceded by the “@” sign in the original name.

  5. (5)

    If the same path node A exists in Fig. 4.2(a), it is judged whether the node with the current layer number smaller than the A node is present. If the node I already exists, Exists, then a new virtual node B, so that the current layer number is StA − 1, and the I node on the B node next layer; if there is B node, in accordance with the previous rules, put the I node on the right side of the next layer of the B node.

    $$ S_{tI} = \left\{ {\begin{array}{*{20}l} {\left( {S_{tA} - 1} \right).1,} \hfill & {S_{tA} - 1 = \varnothing } \hfill \\ {\left( {S_{tA} - 1} \right).last(),} \hfill & {S_{tA} - 1 \ne \varnothing } \hfill \\ \end{array} } \right.. $$
  6. (6)

    When a new message document is processed, the nodes in the message document are first traversed if a new node I is not found and the node appears between the encoded two nodes A and B, and the current Layer number A node is StA, and the current layer number of node B is StB. The current layer number of the I node is:

    $$ S_{tI} = \left\{ {\begin{array}{*{20}l} {\left( {S_{tA} + 1} \right).1,} \hfill & {S_{tB} = S_{tA} + 4} \hfill \\ {\left( {S_{tA} + 1} \right).last(),} \hfill & {S_{tB} = S_{tA} + 1} \hfill \\ \end{array} .} \right. $$

That is, between A node and B node to create a virtual node V, and the I node as the first node of the V node, if the V node already exists, then I on the V node at the end.

According to the above rules, Fig. 5(b) in the use of four nodes after the coding scheme as shown in Fig. 5(c) below. Nodes 1.2 and 1.9 are virtual nodes. The virtual node itself does not represent any path, and all the child nodes of the node are logically identical to the virtual node. We should be in the processing of the virtual node should be all the child nodes to the virtual node to the level of treatment to be treated.

According to the above rules, you can get a rule. The end of any encoding is the node of 4n + 1 is the original element node, and 4n + 2 at the end of the newly added nodes to 4n at the end and 4n + 1 nodes are the same path of the node. The node number ending here 4n + 3 is not used because the current encoding rule has already met the project requirements, so the 4n + 3 end node number is left for use as a later possible expansion.

The four-node coding scheme will eventually produce an index table, which is a bi-directional mapping of packets to the HBase column. The index table forms the node path index, which can quickly get the corresponding HBase column when querying the message path, and vice versa.

In HBase, the data table is used to store specific message data, and the index table is used to store the mapping of message attributes to HBase columns. Data table each row to store the message name, only set up a column key, column name storage four node coding algorithm to obtain the attribute encoding value. Index table and data table is different, because all the data in the same data table corresponds to a mapping table, so a data table index table only need to be stored in a row of HBase can. The row of each row of the index table is the table name of the data table corresponding to the row index table structure.

4 Conclusion

The object of the article is to propose a method for storing heterogeneous data in a mobile environment. Compared with the prior art, the method in this article has the advantages as follows: (1) The method can effectively deal with a variety of complex mobile environment problems, such as data structure heterogeneous, huge amount of data, physical location dispersion, complex and diverse data content, etc., brought about by the sharing and using of data; (2) The method provides timely and effective query access to massive data, through the proposed data hybrid storage framework based on cloud computing and one-stop query method; (3) The method is applied to the construction of information mobile environment, which can provide a unified data organization framework for the construction of mobile information services, so that different sources of data resources can be unified to support one-stop query access of the whole network of massive data resources.


  1. 1.
    Huang, Y.H., Li, G.Y.: Descriptive models for Internet of Things. In: International Conference on Intelligent Control and Information Processing, Harbin, China, pp. 483–486. IEEE Press (2010)Google Scholar
  2. 2.
    Xu, L.D., He, W., Li, S.: Internet of Things in industries: a survey. IEEE Trans. Ind. Inf. 10(4), 2233–2243 (2014)CrossRefGoogle Scholar
  3. 3.
    Yuxiang, Y., Cheng, X.: A development analysis of China’s intelligent transportation system. In: Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD), 23–25 May 2012 Google Scholar
  4. 4.
    Wang, H., Zhang, T., Quan, Y., et al.: Research on the framework of the environment Internet of Things. Int. J. Sustain. Dev. Work Ecol. 20(3), 199–204 (2013)CrossRefGoogle Scholar
  5. 5.
    Duan, Y.-E.: Design of intelligent agriculture management information system based on IoT. In: Proceedings of the 2011 International Conference on Intelligent Computation Technology and Automation (ICICTA), 28–29 March 2011Google Scholar
  6. 6.
    Lu, D., Liu, T.: The application of IOT in medical system. In: International Symposium on IT in Medicine and Education, pp. 272–275. IEEE (2011)Google Scholar
  7. 7.
    Datta, S.K., Rui, P.F.D.C., Bonnet, C., et al.: oneM2M architecture based IoT framework for mobile crowd sensing in smart cities. In: European Conference on Networks and Communications (2016)Google Scholar
  8. 8.
    Jian, A., Xiaolin, G., Jianwei, Y., et al.: Mobile crowd sensing for Internet of Things: a credible crowdsourcing model in mobile-sense service. In: IEEE International Conference on Multimedia Big Data, pp. 92–99. IEEE (2015)Google Scholar
  9. 9.
    Jeffery, K.G.: The Internet of Things: the death of traditional database? IEEE Techn. Rev. 26, 311–312 (2009)CrossRefGoogle Scholar
  10. 10.
    James, A., Cooper, J., Jeffery, K.: Research Directions in Database Architectures for the Internet of Things: A Communication of the First International Workshop on Database Architectures for the Internet of Things (DAIT 2009) (2009)Google Scholar
  11. 11.
    Zhang, Y., Han, W., Wang, W., et al.: Optimizing the storage of massive electronic pedigrees in HDFS. In: Proceedings of the 3rd International Conference on the Internet of Things, pp. 68–75. IEEE (2012)Google Scholar
  12. 12.
    Zhang, G., Li, C., Zhang, Y., et al.: SemanMedical: a kind of semantic medical monitoring system model based on the IoT sensors. In: Proceedings of the IEEE 14th International Conference on e-Health Networking, Applications and Services, pp. 238–243. IEEE (2012)Google Scholar
  13. 13.
    Paul, L., Dirk, M., Andre, B.: HashFS: applying hashing to optimize file systems for small file reads. In: Proceedings of the International Workshop on Storage Network Architecture and Parallel I/Os, pp. 33–42. IEEE (2010)Google Scholar
  14. 14.
    Zhang, Y., Liu, D.: Improving the efficiency of storing for small files in HDFS. In: Proceedings of the International Conference on Computer Science and Service System, pp. 2239–2242. IEEE (2012)Google Scholar
  15. 15.
    Yang, H., Qin, Y., Feng, G., et al.: Online monitoring of geological CO2 storage and leakage based on wireless sensor networks. IEEE Sens. J. 13(2), 556–562 (2013)CrossRefGoogle Scholar
  16. 16.
    Chang, P., Wand, T.: Supporting personal mobility with integrated RFID in VoIP systems. In: Proceedings of the International Conference on New Trends in Information and Service, pp. 1352–1359. IEEE (2009) Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Science and Technology on Information Systems Engineering LaboratoryNanjing Research Institute of Electronics EngineeringNanjingChina
  2. 2.School of SoftwareXiDian UniversityXi’anChina

Personalised recommendations