An initial evaluation of 6Stor, a dynamically scalable IPv6-centric distributed object storage system

Precise architecture description and first benchmark results

Abstract

The exponentially growing demand for storage puts a huge stress on traditional distributed storage systems. Historically, I/Ops (Inputs/Outputs per second) of hard drives have been the main limitation of storage systems. With the rapid deployment of solid state drives (SSDs) and the expected evolutions of their capacities, price and performance, we claim that CPU and network capacities will become bottlenecks in the future. In this context, we introduce 6Stor, an innovative, software-defined distributed storage system fully integrated with the networking layer. This storage system departs from traditional approaches in two manners: it leverages IPv6 new capabilities to increase the efficiency of its data plane—notably by using directly UDP and TCP rather than HTTP—and thus its performance; and it circumvents scalability limitations of other distributed systems by using a fully distributed metadata layer of indirection to offer flexibility. In this paper, we introduce and describe in details the architecture of 6Stor, with an emphasis on dynamic scalability and robustness to failure. We also present a testbed that we use to evaluate our novel approach by using Ceph—another well known distributed storage system—as baseline. Results obtained on an extensive testbed are presented and some initial conclusions are drawn.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Agents can be colocated on the same server. For example, monitors can also act as gateways but don’t function properly when located with OSDs. There are typically multiple OSDs per server corresponding to different storage devices.

  2. 2.

    Erasure Coding schemes are being investigated to greatly reduce the storage overhead but are not yet fully integrated to our 6Stor prototype. However, most of the architecture would work exactly the same way with encoded fragments instead of replicas.

  3. 3.

    In practice, the traditionnal linux network stack does not allow to listen on an arbitrary prefix. So implementation-wise, we use the workaround of binding the socket to ANYADDR and creating a local route redirecting the prefix to the loopback interface. This way, the packet is sent to the loopback interface and matches the ANYADDR criteria. There is no MAC resolution issues as the servers are seen as next-hop routers in the routing tables.

  4. 4.

    UDP is used for metadata exchanges because they fit in one packet. It provides better latency and less computing overhead but requires a timeout to retransmit when losses happen.

  5. 5.

    The lower the parameters \(a_m\) and \(a_s\), the lesser delay before the client receives an acknowledgement that the object is written to the cluster but the lower reliability in case of simultaneous failures during the operation.

  6. 6.

    In the background, it is also possible to rebalance data from existing SNs to the new one. However, this is absolutely not mandatory and can be done object by object, without making any node or object replica unavailable as it is the case in some storage systems.

  7. 7.

    A quick handshake is being made before each replica transmission to avoid multiple retransmissions for the same object.

  8. 8.

    This is not totally true, as there need to be at most as many MNs as there are different addresses in the cluster metadata prefix, but in practice this number is in the order of millions of billions.

References

  1. 1.

    Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: High-performance, reliable secondary storage. ACM Comput. Surv. (CSUR) 26(2), 145–185 (1994)

    Article  Google Scholar 

  2. 2.

    Cabrera, L.-F., Long, D.D.: Swift: using distributed disk striping to provide high I/O data rates. University of California, Santa Cruz, Computer Research Laboratory, vol. 8523 (1991)

  3. 3.

    Hachman, M.: Notebook hard drives are dead: how SSDS will dominate mobile pc storage by 2018, December 2015. https://www.pcworld.com/article/3011441/storage/notebook-hard-drives-are-dead-how-ssds-will-dominate-mobile-pc-storage-by-2018.html

  4. 4.

    Nanavati, M., Schwarzkopf, M., Wires, J., Warfield, A.: Non-volatile storage. Queue 139, 20:33–20:56. https://doi.org/10.1145/2857274.2874238

  5. 5.

    Davis, R.: The network is the new storage bottleneck, November 2016. https://www.datanami.com/2016/11/10/network-new-storage-bottleneck/

  6. 6.

    McKusick, K., Quinlan, S.: Gfs: Evolution on fast-forward. Commun. ACM 533, 42–49 (2010). https://doi.org/10.1145/1666420.1666439

  7. 7.

    Shvachko, K.: HDFS scalability: the limits to growth. Login: the magazine of USENIX & SAGE 35, 6–16 (2010)

  8. 8.

    Lee, D.-Y., Jeong, K., Han, S.-H., Kim, J.-S., Hwang, J.-Y., Cho, S.: Understanding write behaviors of storage backends in ceph object store. In: Proceedings of the 2017 IEEE International Conference on Massive Storage Systems and Technology, vol. 10 (2017)

  9. 9.

    Ruty, G., Surcouf, A., Rougier, J.L.: Collapsing the layers: 6stor, a scalable and ipv6-centric distributed storage system. In: 2017 Fourth International Conference on Software Defined Systems (SDS), May 2017, pp. 81–86 (2017)

  10. 10.

    Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, ser. OSDI ’06. Berkeley, CA, USA: USENIX Association, 2006, pp. 307–320. http://dl.acm.org/citation.cfm?id=1298455.1298485

  11. 11.

    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, ser. SOSP ’07, pp. 205–220. ACM, New York (2007). https://doi.org/10.1145/1294261.1294281

  12. 12.

    Arnold, J.: Openstack Swift: Using, Administering, and Developing for Swift Object Storage. O’Reilly Media, Inc., Sebastopol (2014)

    Google Scholar 

  13. 13.

    Gluster file system. https://docs.gluster.org/en/v3/Quick-Start-Guide/Architecture/

  14. 14.

    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). https://doi.org/10.1145/1165389.945450

  15. 15.

    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010, pp. 1–10 (2010)

  16. 16.

    Fizians, S.: Rozofs: a fault tolerant i/o intensive distributed file system based on mojette erasure code. In: Workshop Autonomic Oct, vol. 16, p. 17 (2014)

  17. 17.

    Weil, S.A., Leung, A.W., Brandt, S.A., Maltzahn, C.: Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing’07. ACM, pp. 35–44 (2007)

  18. 18.

    Weil, S.A., Brandt, S.A., Miller, E.L., Maltzahn, C.: Crush: Controlled, scalable, decentralized placement of replicated data. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, ser. SC ’06. ACM, New York (2006). https://doi.org/10.1145/1188455.1188582

  19. 19.

    Metz, C.: Google remakes online empire with ‘colossus’. Wired http://www.wired.com/2012/07/google-colossus/ (2012)

  20. 20.

    The lustre file system. http://doc.lustre.org/lustre_manual.xhtml

  21. 21.

    Faibish, S., Peng, T., Bent, J.M., Gupta, U.K., Pedone, Jr, J.M.: Lustre file system, Oct. 3 2017, US Patent 9,779,108

  22. 22.

    Niazi, S., Ismail, M., Haridi, S., Dowling, J., Grohsschmiedt, S., Ronström, M.: Hopsfs: Scaling hierarchical file system metadata using newsql databases. In: 15th USENIX Conference on File and Storage Technologies (FAST 17). Santa Clara, CA: USENIX Association, pp. 89–104 (2017). https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

  23. 23.

    Noel, R.R., Lama, P.: Taming performance hotspots in cloud storage with dynamic load redistribution. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), June 2017, pp. 42–49 (2017)

  24. 24.

    Krish, K.R., Anwar, A., Butt, A.R.: hats: a heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2014, pp. 502–511 (2014)

  25. 25.

    Openio. https://www.openio.io/wp-content/uploads/2016/12/OpenIO-CoreSolutionDescription.pdf

  26. 26.

    Nvfuse: a nvme user-space filesystem. https://github.com/nvfuse/nvfuse

  27. 27.

    Marinos, I., Watson, R.N., Handley, M., Stewart, R.R.: Disk, crypt, net: Rethinking the stack for high-performance video streaming. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication, ser. SIGCOMM ’17, pp. 211–224. ACM, New York (2017). https://doi.org/10.1145/3098822.3098844

  28. 28.

    Braun, M.B., Crowcroft, J.: Sna: Sourceless network architecture. University of Cambridge, Computer Laboratory, Tech. Rep. (2014)

  29. 29.

    Deen, G., Naik, G., Brzozowski, J.J., Daigle, L., Rose, B., Townsley, M.: “Using Media Encoding Networks to address MPEG-DASH video ,” Internet Engineering Task Force, Internet-Draft draft-deen-naik-ggie-men-mpeg-dash-00, July 2016, work in Progress. https://datatracker.ietf.org/doc/html/draft-deen-naik-ggie-men-mpeg-dash-00

  30. 30.

    Srinivasan, S., Schulzrinne, H.: Ipv6 addresses as content names in information-centric networking

  31. 31.

    Ahlgren, B., Dannewitz, C., Imbrenda, C., Kutscher, D., Ohlman, B.: A survey of information-centric networking. IEEE Communications Magazine 50(7) (2012)

  32. 32.

    Muscariello, L., Carofiglio, G., Auge, J., Papalini, M.: Hybrid information-centric networking, internet engineering task force, internet-draft draft-muscariello-intarea-hicn-00, Jun. 2018, work in Progress. https://datatracker.ietf.org/doc/html/draft-muscariello-intarea-hicn-00

  33. 33.

    Carofiglio, G.: Mobile video delivery with hybrid icn. Cisco, Tech. Rep, Tech. Rep (2016)

  34. 34.

    Bonald, T., Oueslati-Boulahia, S., Roberts, J.: Ip traffic and qos control: the need for a flow-aware architecture. In: World Telecommunications Congress. Citeseer (2002)

  35. 35.

    Zhao, W., Olshefski, D., Schulzrinne, H.: Internet quality of service: an overview. Columbia University, New York, Technical Report CUCS-003-00 (2000)

  36. 36.

    Dimakis, A.G., Ramchandran, K., Wu, Y., Suh, C.: A survey on network codes for distributed storage. Proc. IEEE 99(3), 476–489 (2011)

    Article  Google Scholar 

  37. 37.

    Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: Osdi, vol. 10, pp. 1–7 (2010)

  38. 38.

    Luo, T., Lee, R., Mesnier, M., Chen, F., Zhang, X.: hstorage-db: heterogeneity-aware data management to exploit the full capability of hybrid storage systems. Proc. VLDB Endow. 5(10), 1076–1087 (2012). https://doi.org/10.14778/2336664.2336679

  39. 39.

    Kaneko, S., Nakamura, T., Kamei, H., Muraoka, H.: A guideline for data placement in heterogeneous distributed storage systems. In: 2016 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), July 2016, pp. 942–945 (2016)

  40. 40.

    Qin, Y., Ai, X., Chen, L., Yang, W.: Data placement strategy in data center distributed storage systems. In: 2016 IEEE International Conference on Communication Systems (ICCS), Dec 2016, pp. 1–6 (2016)

  41. 41.

    Wu, C.F., Hsu, T.C., Yang, H., Chung, Y.C.: File placement mechanisms for improving write throughputs of cloud storage services based on ceph and hdfs. In: 2017 International Conference on Applied System Innovation (ICASI), May 2017, pp. 1725–1728 (2017)

  42. 42.

    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003). https://doi.org/10.1109/TNET.2002.808407

  43. 43.

    Claise, B., Fullmer, M., Calato, P., Penno, R.: Ipfix protocol specification. Interrnet-draft, work in progress (2005)

  44. 44.

    Wikipedia, “NetFlow—Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=NetFlow&oldid=847024174 (2018)

  45. 45.

    Hawari, M.: “TCP fast open cookie for IPv6 prefixes,” internet engineering task force, internet-draft draft-hawari-tcpm-tfo-ipv6-prefixes-00, July 2015, work in Progress. https://datatracker.ietf.org/doc/html/draft-hawari-tcpm-tfo-ipv6-prefixes-00

  46. 46.

    Goedegebure, S.: Big buck bunny. https://peach.blender.org/

  47. 47.

    Baumann, T.: Valkaama project. http://www.valkaama.com/

  48. 48.

    wrk—a http benchmarking tool. https://github.com/wg/wrk

  49. 49.

    What is vpp? https://wiki.fd.io/view/VPP/What_is_VPP%3F

  50. 50.

    Filsfils, C., Nainar, N.K., Pignataro, C., Cardona, J.C., Francois, P.: The segment routing architecture. In: 2015 IEEE Global Communications Conference (GLOBECOM), Dec 2015, pp. 1–6 (2015)

  51. 51.

    Segment routing. http://www.segment-routing.net/

  52. 52.

    Desmouceaux, Y., Pfister, P., Tollet, J., Townsley, M., Clausen, T.: 6lb: scalable and application-aware load balancing with segment routing. IEEE/ACM Trans. Netw. 26(2), 819–834 (2018)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Guillaume Ruty.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ruty, G., Rougier, J., Surcouf, A. et al. An initial evaluation of 6Stor, a dynamically scalable IPv6-centric distributed object storage system. Cluster Comput 22, 1123–1142 (2019). https://doi.org/10.1007/s10586-018-02897-8

Download citation

Keywords

  • Distributed object storage
  • IPv6 centric networking
  • Software defined systems