The Temple University Hospital Digital Pathology Corpus

  • Nabila Shawki
  • M. Golam Shadin
  • Tarek Elseify
  • Luke Jakielaszek
  • Tunde Farkas
  • Yuri Persidsky
  • Nirag Jhala
  • Iyad Obeid
  • Joseph Picone


Pathology is a branch of medical science focused on the cause, origin, and nature of disease. A typical pathology laboratory workflow involves preparation of a tissue specimen on a glass slide using a stain designed to enhance imaging and analysis by a board-certified pathologist using a conventional light microscope. Digital pathology is the process of digitizing an analog image so that it can be manipulated by computer. Digitizing pathology slides into whole slide images provides many benefits including real-time, remote analysis of the specimen. Digital pathology is creating an enormous opportunity for the application of machine learning techniques to automate and accelerate the diagnostic process. Over ten million pathology slides are produced and interpreted by experts annually in the United States alone. This suggests that there is an ample supply of data to support machine learning research if it can be acquired and curated in a cost-effective manner.

In this chapter, we discuss the development of the world’s largest open source corpus of digitized pathology images and review the process being used to collect the digital images along with associated standards for annotation and archival. These images are currently being collected at Temple University Hospital and are facilitating the development of automated interpretation technology. This corpus, known as the Temple University Hospital Digital Pathology Corpus (TUHDP), is expected to reach one million images, or one petabyte of data, over the next decade. Though this corpus is currently being collected using a single digital scanner at one institution, we hope over time we can include data from other hospitals and scanning equipment. The initial phase of the project, which is described here, focuses on generating 100,000 images that will be released by December 2020. The first installment of this release, over 20,000 images, is now publicly available.

Performance of deep learning systems is heavily dependent on the breadth and quality of the data used. In this chapter, we also introduce some pilot experiments on classifying various types of images using a deep learning system that is based on a combination of convolutional neural networks and long short-term memory networks. We show that performance on relatively simple tasks, such as artifact classification, exceeds 95% sensitivity. We discuss several approaches to memory management and computational complexity issues for these ultra-high-resolution images. We demonstrate that the field of pathology is sufficiently rich to support the development of high-performance classification systems. These systems enable a new generation of decision support technology for pathologists. This directly addresses a future industry need for efficient workflows in response to the projected decline in the number of board-certified pathologists.


Digital pathology Deep learning Big data Convolutional neural networks CNN Long short-term networks LSTM 



This material is supported by the National Science Foundation under grant nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Opensource libraries that were used to develop the deep learning model presented in this chapter are: Shapely v1.6.4, OpenSlide v1.1.1, Abstract Syntax Library, OpenCV-Python v3.4.1, NumPy v1.14.2, PIL v4.2.1, TensorFlow v1.9.0, and Keras v2.2.4.


  1. 1.
    Sattar, H. (2017). Fundamentals of pathology: Medical course and step 1 review (8th ed.). Chicago, IL: Pathoma, LLC.. Retrieved from Scholar
  2. 2.
    Rolls, G. (2018). An introduction to specimen preparation. Retrieved from
  3. 3.
    Anderson, J. (2019). An introduction to routine and special staining. Retrieved from
  4. 4.
    American Cancer Society. (2019). What happens to biopsy and cytology specimens? Retrieved August 19, 2019, from
  5. 5.
    Eiseman, E., & Haga, S. (2000). In E. Eiseman (Ed.) A handbook of human tissue sources: A national resource of human tissue samples (1st ed.). Washington, DC: Rand Publishing. Retrieved from
  6. 6.
    Kapila, S. N., Boaz, K., & Natarajan, S. (2016). The post-analytical phase of histopathology practice: Storage, retention and use of human tissue specimens. International Journal of Applied & Basic Medical Research, 6(1), 3–7. Scholar
  7. 7.
    Hallworth, M. J. (2011). The ‘70% claim’: what is the evidence base? Annals of Clinical Biochemistry: International Journal of Laboratory Medicine, 48(6), 487–488. Scholar
  8. 8.
    Jhala, N. (2017). Digital pathology: Advancing frontiers. In IEEE Signal Processing in Medicine and Biology Symposium (SPMB), Philadelphia, PA. Retrieved from
  9. 9.
    Barry, M. J., Kaufman, D. S., & Wu, C.-L. (2008). Case 15-2008 :2008: A 55-year-old man with an elevated prostate-specific antigen level and early-stage prostate cancer. The New England Journal of Medicine, 358(20), 2161–2168. Scholar
  10. 10.
    Bongaerts, O., Clevers, C., Debets, M., Paffen, D., Senden, L., Rijks, K., et al. (2018). Conventional microscopical versus digital whole-slide imaging-based diagnosis of thin-layer cervical specimens: A validation study. Journal of Pathology Informatics, 9(1), 29–37. Scholar
  11. 11.
    The Medical Futurist. (2018). The digital future of pathology. Retrieved August 19, 2019, from
  12. 12.
    Stathonikos, N., Veta, M., Huisman, A., & van Diest, P. J. (2013). Going fully digital: Perspective of a Dutch academic pathology lab. Journal of Pathology Informatics, 4, 15. Scholar
  13. 13.
    Leica Biosystems. (2019). Aperio AT2 – High volume, digital whole slide scanning. Retrieved from
  14. 14.
    Philips. (2019). Clinical digital pathology system. Retrieved August 19, 2019, from
  15. 15.
    Hanna, M. G., Monaco, S. E., Cuda, J., Xing, J., Ahmed, I., & Pantanowitz, L. (2017). Comparison of glass slides and various digital-slide modalities for cytopathology screening and interpretation. Cancer Cytopathology, 125(9), 701–709. Scholar
  16. 16.
    Joint Photographic Experts Group. (2019). Overview of JPEG. Retrieved from
  17. 17.
    Campbell, C., Mecca, N., Duong, T., Obeid, I., & Picone, J. (2018). Expanding an HPC cluster to support the computational demands of digital pathology. In I. Obeid & J. Picone (Eds.), IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–2). Philadelphia, PA: IEEE. Retrieved from Scholar
  18. 18.
    Mahar, J. H., Rosencrance, J. G., & Rasmussen, P. A. (2018). Telemedicine: Past, present, and future. Cleveland Clinic Journal of Medicine, 85(12), 938–942. Retrieved from Scholar
  19. 19.
    Beam, A., & Kohane, I. S. (2016). Translating artificial intelligence into clinical care. JAMA, 316(22), 2368–2369. Scholar
  20. 20.
    Hamilton, P. W., Bankhead, P., Wang, Y., Hutchinson, R., Kieran, D., McArt, D. G., et al. (2014). Digital pathology and image analysis in tissue biomarker research. Methods, 70(1), 59–73. Scholar
  21. 21.
    Janowczyk, A., & Madabhushi, A. (2016). Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics, 7. Retrieved from
  22. 22.
    Bauer, D. R., Otter, M., & Chafin, D. R. (2018). A new paradigm for tissue diagnostics: Tools and techniques to standardize tissue collection, transport, and fixation. Current Pathobiology Reports, 6(2), 135–143. Scholar
  23. 23.
    Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2014). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17. Scholar
  24. 24.
    Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., et al. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42(December 2012), 60–88. Scholar
  25. 25.
    Barker, J., Hoogi, A., Depeursinge, A., & Rubin, D. (2016). Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Medical Image Analysis, 30(1), 60–71. Scholar
  26. 26.
    Gleason, D. F. (1992). Histologic grading of prostate cancer: a perspective. Human Pathology, 23(3), 273–279. Retrieved from Scholar
  27. 27.
    Gordetsky, J., & Epstein, J. (2016). Grading of prostatic adenocarcinoma: current state and prognostic implications. Diagnostic Pathology, 11, 25. Scholar
  28. 28.
    He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. Scholar
  29. 29.
    Picone, J., Farkas, T., Obeid, I., & Persidsky, Y. (2017). MRI: High performance digital pathology using big data and machine learning. Major Research Instrumentation (MRI), Division of Computer and Network Systems, January 11, 2017. Retrieved from
  30. 30.
    Harabagiu, S., Picone, J., & Moldovan, D. (2002). Voice activated question answering. In Proceedings of the International Conference on Computational Linguistics, Taipei, Taiwan (pp. 1–7). Retrieved from
  31. 31.
    Capp, N., Campbell, C., Elseify, T., Obeid, I., & Picone, J. (2018). Optimizing EEG visualization through remote data retrieval. In IEEE Signal Processing in Medicine and Biology Symposium, Philadelphia, PA (pp. 1–2). Retrieved from
  32. 32.
    Picone, J., Obeid, I., & Harabagiu, S. (2018). Automated cohort retrieval from EEG medical records. In 26th Conference on Intelligent Systems for Molecular Biology, Chicago, IL (pp. 1–7). Retrieved from
  33. 33.
    Ross, M. H., & Pawlina, W. (2019). Histology: A text and atlas: with correlated cell and molecular biology (8th ed.). Philadelphia, PA: Wolters Kluwer Health. Retrieved from Scholar
  34. 34.
    Gutman, D., Cobb, J., Somanna, D., Park, Y., Wang, F., Kurc, T., et al. (2013). Cancer Digital Slide Archive: an informatics resource to support integrated in silico analysis of TCGA pathology data. Journal of the American Medical Informatics Association, 20(6), 1091–1098. Scholar
  35. 35.
    Drissen, H. (2017). Philips and LabPON plan to create world’s largest pathology database of annotated tissue images for deep learning. Retrieved from
  36. 36.
    Ferrell, S., von Weltin, E., Obeid, I., & Picone, J. (2018). Open source resources to advance EEG research. In IEEE Signal Processing in Medicine and Biology Symposium, Philadelphia, PA (pp. 1–3). Retrieved from
  37. 37.
    Obeid, I., & Picone, J. (2018). The Temple University Hospital EEG Data Corpus. In Augmentation of brain function: Facts, fiction and controversy. Volume I: Brain-machine interfaces (1st ed., pp. 394–398). Lausanne, Switzerland: Frontiers Media S.A.. Scholar
  38. 38.
    de Freitas, N., Reed, S., & Vinyals, O. (2017). Deep learning: Practice and trends. In Neural Information Processing Systems, Long Beach, CA. Retrieved from
  39. 39.
    Golmohammadi, M., Shah, V., Obeid, I., & Picone, J. (2019). Deep learning approaches for automatic analysis of EEGs. In S.-M. Chan & W. Pedrycz (Eds.), Deep learning: Algorithms and applications (1st ed.). New York, NY: Springer. Retrieved from Scholar
  40. 40.
    LeCun, Y., & Bengio, Y. (1998). Convolutional networks for images, speech, and time series. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 255–258). Cambridge, MA: MIT Press. Retrieved from Scholar
  41. 41.
    Golmohammadi, M., Ziyabari, S., Shah, V., Obeid, I., & Picone, J. (2018). Deep architectures for spatio-temporal modeling: Automated seizure detection in scalp EEGs. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL (pp. 1–6).
  42. 42.
    Saon, G., Sercu, T., Rennie, S., & Kuo, H.-K. J. (2016). The IBM 2016 English Conversational Telephone Speech Recognition System. In Proceedings of the Annual Conference of the International Speech Communication Association (Vol. 08–12–Sept, pp. 7–11).
  43. 43.
    Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA (pp. 1–14). Scholar
  44. 44.
    Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018). DropBlock: A regularization method for convolutional networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (pp. 10727–10737). Red Hook, NY: Curran Associates, Inc.. Retrieved from Scholar
  45. 45.
    Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). A^2-nets: Double attention networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 352–361). Red Hook, NY: Curran Associates, Inc. Retrieved from Scholar
  46. 46.
    Cireşan, D. C., Giusti, A., Gambardella, L. M., & Schmidhuber, J. (2013). Mitosis detection in breast cancer histology images with deep neural networks. In International Conference on Medical Image Computing and Computer-assisted Intervention. Haspolat, Turkey: Signal Processing and Communications Applications Conference.
  47. 47.
    Cruz-Roa, A., Basavanhally, A., Gonzalez, F., Gilmore, H., Feldman, M., Ganesan, S., et al. (2014). Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In Medical Imaging 2014: Digital Pathology (pp. 1–15).
  48. 48.
    Hua, K. L., Hsu, C. H., Hidayati, S. C., Cheng, W. H., & Chen, Y. J. (2015). Computer-aided classification of lung nodules on computed tomography images via deep learning technique. OncoTargets and Therapy, 8, 2015–2022. Scholar
  49. 49.
    Sirinukunwattana, K., Raza, S. E. A., Tsang, Y. W., Snead, D. R. J., Cree, I. A., & Rajpoot, N. M. (2016). Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Transactions on Medical Imaging, 35(5), 1196–1206. Scholar
  50. 50.
    Bejnordi, B. E., Zuidhof, G., Balkenhol, M., Hermsen, M., Bult, P., van Ginneken, B., et al. (2017). Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images. Journal of Medical Imaging, 4(4), 44504. Scholar
  51. 51.
    Wang, D., Khosla, A., Gargeya, R., Irshad, H., & Beck, A. H. (2016). Deep learning for identifying metastatic breast cancer. ArXiv Preprint ArXiv, 1606, 05718.Google Scholar
  52. 52.
    Obeid, I., & Picone, J. (2016). The Neural Engineering Data Consortium: Building community resources to advance research. Philadelphia, PA: Temple University. Scholar
  53. 53.
    Campbell, C., Mecca, N., Obeid, I., & Picone, J. (2017). The Neuronix HPC Cluster: Improving cluster management using free and open source software tools. In I. Obeid & J. Picone (Eds.), IEEE Signal Processing in Medicine and Biology Symposium (p. 1). Philadelphia, PA: IEEE. Scholar
  54. 54.
    Yoo, A. B., Jette, M. A., & Grondona, M. (2003). SLURM: Simple Linux utility for resource management. In D. Feitelson, L. Rudolph, & U. Schwiegelshohn (Eds.), Job scheduling strategies for parallel processing (pp. 44–60). Berlin: Springer.CrossRefGoogle Scholar
  55. 55.
    Red Hat Inc. (2019). What is Gluster? Retrieved from Introduction/.
  56. 56.
    Bonwick, J., Ahrens, M., Henson, V., Maybee, M., & Shellenbaum, M. (2003). The Zettabyte File System. In Proceedings of the 2nd Usenix Conference on File and Storage Technologies, San Francisco, CA (pp. 1–13). Retrieved from
  57. 57.
    Satyanarayanan, M., Goode, A., Gilbert, B., Harkes, J., & Jukic, D. (2013). OpenSlide: A vendor-neutral software foundation for digital pathology. Journal of Pathology Informatics, 4(1), 27. Scholar
  58. 58.
    Clunie, D. (2019). DICOM whole slide imaging: Acquire, archive, view, annotate, download and transmit, Bangor, PA. Retrieved from
  59. 59.
    Leica Biosystems. (2008). Digital slides and third-party data interchange (MAN-0069, Revision B), Wetzlar, Germany. Retrieved August 22, 2019, from
  60. 60.
    Leica Biosystems. (2018). Aperio ImageScope - Pathology slide viewing software. Retrieved from
  61. 61.
    Rojo, M. G., Garcia, G. B., Mateos, C. P., Garcia, J. G., & Vicente, M. C. (2006). Critical comparison of 31 commercially available digital slide systems in pathology. International Journal of Surgical Pathology, 14(4), 285–305. Scholar
  62. 62.
    Brzezinski, R. (2016). HIPAA privacy and security compliance - Simplified: Practical Guide for healthcare providers and managers 2016 Edition (3rd ed.). Seattle, WA: CreateSpace Independent Publishing Platform.Google Scholar
  63. 63.
    Epic Systems Corporation. (2019). EPIC outcomes. Retrieved from
  64. 64.
    The Cornell Law School. (2019). 42 CFR 493.1274 - Standard: cytology. Retrieved from
  65. 65.
    Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. CrossRefzbMATHGoogle Scholar
  66. 66.
    Armato, S. G., III, McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., et al. (2011). The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical Physics, 38(2), 915–931. Scholar
  67. 67.
    Roux, L., & Capron, F. (2014). MITOS atypia 2014 grand challenge. Retrieved April 22, 2019, from
  68. 68.
    Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), Lille, France (pp. 448–456). Retrieved from
  69. 69.
    Ba, J., & Kingma, D. (2014). Adam: A method for stochastic optimization. In International Conference on Learning Representations, Banff, Canada (pp. 1–15).
  70. 70.
    Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii (pp. 1–8).
  71. 71.
    Fukunaga, K. (1990). Introduction to statistical pattern recognition. Computer science and scientific computing (2nd ed.). San Diego, CA: Academic Press. Retrieved from Scholar
  72. 72.
    Shah, V., von Weltin, E., Ahsan, T., Obeid, I., & Picone, J. (2019). On the use of non-experts for generation of high-quality annotations of seizure events. Journal of Clinical Neurophysiology (in review). Retrieved from
  73. 73.
    CDC. (1988). Clinical laboratory improvement amendments. Retrieved from

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Nabila Shawki
    • 1
  • M. Golam Shadin
    • 1
  • Tarek Elseify
    • 1
  • Luke Jakielaszek
    • 1
  • Tunde Farkas
    • 2
  • Yuri Persidsky
    • 2
  • Nirag Jhala
    • 2
  • Iyad Obeid
    • 1
  • Joseph Picone
    • 1
  1. 1.The Neural Engineering Data ConsortiumTemple UniversityPhiladelphiaUSA
  2. 2.Department of Pathology and Lewis Katz School of MedicineTemple UniversityPhiladelphiaUSA

Personalised recommendations