Static Validation of Dynamically Generated HTML Documents Based on Abstract Parsing and Semantic Processing

  • Hyunha Kim
  • Kyung-Goo Doh
  • David A. Schmidt
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7935)


Abstract parsing is a static-analysis technique for a program that, given a reference LR(k) context-free grammar, statically checks whether or not every dynamically generated string output by the program conforms to the grammar. The technique operates by applying an LR(k) parser for the reference language to data-flow equations extracted from the program, immediately parsing all the possible string outputs to validate their syntactic well-formedness.

In this paper, we extend abstract parsing to do semantic-attribute processing and apply this extension to statically verify that HTML documents generated by JSP or PHP are always valid according to the HTML DTD. This application is necessary because the HTML DTD cannot be fully described as an LR(k) grammar. We completely define the HTML 4.01 Transitional DTD in an attributed LALR(1) grammar, carry out experiments for selected real-world JSP and PHP applications, and expose numerous HTML validation errors in the applications. In the process, we experimentally show that semantic properties defined by attribute grammars can also be verified using our technique.


static analysis string analysis abstract parsing HTML validation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    HTML 4.01 Transitional DTD W3C Recommendation (December 24, 1999),
  2. 2.
    Agrawal, G.: Simultaneous demand-driven data-flow and call graph analysis. In: Proc. International Conference on Software Maintenance, Oxford (1999)Google Scholar
  3. 3.
    Brabrand, C., Møller, A., Schwartzbach, M.I.: The <bigwig> project. ACM Transaction on Internet Technology 2 (2002)Google Scholar
  4. 4.
    Choi, T.-H., Lee, O., Kim, H., Doh, K.-G.: A practical string analyzer by the widening approach. In: Kobayashi, N. (ed.) APLAS 2006. LNCS, vol. 4279, pp. 374–388. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Christensen, A.S., Møller, A., Schwartzbach, M.I.: Static analysis for dynamic XML. In: Proc. PLAN-X 2002 (2002)Google Scholar
  6. 6.
    Christensen, A.S., Møller, A., Schwartzbach, M.I.: Extending Java for high-level web service construction. ACM TOPLAS 25 (2003)Google Scholar
  7. 7.
    Doh, K.-G., Kim, H., Schmidt, D.A.: Abstract parsing: static analysis of dynamically generated string output using LR-parsing technology. In: Palsberg, J., Su, Z. (eds.) SAS 2009. LNCS, vol. 5673, pp. 256–272. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  8. 8.
    Doh, K.-G., Kim, H., Schmidt, D.A.: Abstract LR-parsing. In: Agha, G., Danvy, O., Meseguer, J. (eds.) Talcott Festschrift. LNCS, vol. 7000, pp. 90–109. Springer, Heidelberg (2011)Google Scholar
  9. 9.
    Duesterwald, E., Gupta, R., Soffa, M.L.: A practical framework for demand-driven interprocedural data flow analysis. ACM TOPLAS 19, 992–1030 (1997)CrossRefGoogle Scholar
  10. 10.
    Hopcroft, J., Ullman, J.: Formal Languages and their Relation to Automata. Addison Wesley (1969)Google Scholar
  11. 11.
    Horwitz, S., Reps, T., Sagiv, M.: Demand interprocedural dataflow analysis. In: Proc. 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering (1995)Google Scholar
  12. 12.
    Kirkegaard, C., Møller, A.: Static analysis for Java Servlets and JSP. In: Yi, K. (ed.) SAS 2006. LNCS, vol. 4134, pp. 336–352. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Minamide, Y.: Static approximation of dynamically generated web pages. In: Proc. 14th International Conference on World Wide Web, pp. 432–441 (2005)Google Scholar
  14. 14.
    Minamide, Y., Tozawa, A.: XML validation for context-free grammars. In: Kobayashi, N. (ed.) APLAS 2006. LNCS, vol. 4279, pp. 357–373. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Møller, A., Schwarz, M.: HTML validation of context-free languages. In: Hofmann, M. (ed.) FOSSACS 2011. LNCS, vol. 6604, pp. 426–440. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Nishiyama, T., Minamide, Y.: A translation from the HTML DTD into a regular hedge grammar. In: Ibarra, O.H., Ravikumar, B. (eds.) CIAA 2008. LNCS, vol. 5148, pp. 122–131. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Thiemann, P.: Grammar-based analysis of string expressions. In: Proc. ACM SIGPLAN International Workshop on Types in Languages Design and Implementation, pp. 59–70 (2005)Google Scholar
  18. 18.
    Wassermann, G., Su, Z.: The essence of command injection attacks in web applications. In: Proc. 33rd ACM Symp. POPL, pp. 372–382 (2006)Google Scholar
  19. 19.
    Wassermann, G., Su, Z.: Sound and precise analysis of web applications for injection vulnerabilities. In: Proc. ACM PLDI, pp. 32–41 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hyunha Kim
    • 1
  • Kyung-Goo Doh
    • 1
  • David A. Schmidt
    • 2
  1. 1.Hanyang UniversityAnsanSouth Korea
  2. 2.Kansas State UniversityKansasUSA

Personalised recommendations