Predicting V(D)J Recombination Using Conditional Random Fields

  • Raunaq Malhotra
  • Shruthi Prabhakara
  • Raj Acharya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7632)


V(D)J gene segments undergo combinatorial recombination in the T-cells and B-cells to provide humans and other vertebrates with a large number of antibodies required for immunity. Each such recombination further undergoes mutations in their DNA sequences so that they can recognize diverse antigens. Predicting the combination of gene segments which formed a particular antibody is an essential task for studying disease propagation and analysis. We propose a model based on conditional random fields (CRFs) for predicting the boundary positions between V-D-J gene segments. We train the CRFs by generating synthetic gene recombinations using all of the alleles of the V, D and J gene segments. The alleles corresponding to a read can be determined by mapping the segmented reads to the DNA sequences of the gene segments using softwares like BLAST and usearch. We test our method on simulated dataset as well as real data of Stanford_S22 individual.


Conditional Random Fields VDJ recombination Mapping of DNA sequences 


  1. 1.
    Interactive Image Segmentation with Conditional Random Fields, vol. 2 (2008)Google Scholar
  2. 2.
    Boyd, S.D., Marshall, E.L., Merker, J.D., Maniar, J.M., Zhang, L.N., Sahaf, B., Jones, C.D., Simen, B.B., Hanczaruk, B., Nguyen, K.D., Nadeau, K.C., Egholm, M., Miklos, D.B., Zehnder, J.L., Fire, A.Z.: Measurement and clinical monitoring of human lymphocyte clonality by massively parallel v-d-j pyrosequencing. Science Translational Medicine 1(12), 12–23 (2009)CrossRefGoogle Scholar
  3. 3.
    Byrd, R.H., Lu, P., Nocedal, J., Zhu, C.: A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16(5), 1190–1208 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. J. Mach. Learn. Res. 3, 951–991 (2003)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010)CrossRefGoogle Scholar
  6. 6.
    Fippiat, J.-P., Williams, S.C., Tomlinson, L.M., Cook, G.P., Cherif, D., Le Paslier, D., Collins, J.E., Dunham, l., Winter, G., Lefranc, M.-P.: Organization of the human immunoglobulin lambda light-chain locus on chromosome 22q11.2. Human Molecular Genetics 4(6), 983–991 (1995)CrossRefGoogle Scholar
  7. 7.
    Gata, B.A., Malming, H.R., Jackson, K.J.L., Bain, M.E., Wilson, P., Collins, A.M.: ihmmune-align: hidden markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences. Bioinformatics 23(13), 1580–1587 (2007)CrossRefGoogle Scholar
  8. 8.
    Giudicelli, V., Chaume, D., Lefranc, M.-P.: IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor VJ and VD J rearrangement analysis. Nucleic Acids Research 32(suppl. 2), W435–W440 (2004)CrossRefGoogle Scholar
  9. 9.
    Jackson, K.J.L., Boyd, S., Gaëta, B.A., Collins, A.M.: Benchmarking the performance of human antibody gene alignment utilities using a 454 sequence dataset. Bioinformatics 26(24), 3129–3130 (2010)CrossRefGoogle Scholar
  10. 10.
    Jung, D., Giallourakis, C., Mostoslavsky, R., Alt, F.W.: Mechanism and control of v(d)j recombination at the immunoglobulin heavy chain locus. Annual Review of Immunology 24(1), 541–570 (2006)CrossRefGoogle Scholar
  11. 11.
    Kudo, T.: Crf++: Yet another crf toolkit (2005)Google Scholar
  12. 12.
    Lafferty, J., Mccallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)Google Scholar
  13. 13.
    Lefranc, M.-P.: Imgt, the international immunogenetics database: a high-quality information system for comparative immunogenetics and immunology. Developmental &; Comparative Immunology 26(8), 697–705 (2002)CrossRefGoogle Scholar
  14. 14.
    Li, M.-H., Lin, L., Wang, X.-L., Liu, T.: Protein protein interaction site prediction based on conditional random fields. Bioinformatics 23(5), 597–604 (2007)CrossRefzbMATHGoogle Scholar
  15. 15.
    Lorenz, W., Straubinger, B., Zachau, H.G.: Physical map of the human immunoglobulin k locus and its implications for the mechanisms of vkjk rearrangement. Nucleic Acids Research 15(23), 9667–9676 (1987)CrossRefGoogle Scholar
  16. 16.
    Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields (2003)Google Scholar
  17. 17.
    Munshaw, S., Kepler, T.B.: SoDA2: a Hidden Markov Model approach for identification of immunoglobulin rearrangements. Bioinformatics 26(7), 867–872 (2010)CrossRefGoogle Scholar
  18. 18.
    Neuberger, M.S.: Antibody diversification by somatic mutation: from burnet onwards. Immunolo. Cell Biol. 86, 124–132 (2008)CrossRefGoogle Scholar
  19. 19.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3(10), e3373+ (2008)Google Scholar
  20. 20.
    Souto-Carneiro, M.M., Longo, N.S., Russ, D.E., Sun, H.-W.W., Lipsky, P.E.: Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOLVER.. Journal of immunology (Baltimore, Md.: 1950) 172(11), 6790–6802 (2004)Google Scholar
  21. 21.
    Volpe, J.M., Cowell, L.G., Kepler, T.B.: Soda: implementation of a 3d alignment algorithm for inference of antigen receptor recombinations. Bioinformatics 22(4), 438–444 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Raunaq Malhotra
    • 1
  • Shruthi Prabhakara
    • 1
  • Raj Acharya
    • 1
  1. 1.Department of Computer Science EngineeringPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations