Twomode clustering of genotype by trait and genotype by environment data
 935 Downloads
 5 Citations
Abstract
In this paper, we demonstrate the use of twomode clustering for genotype by trait and genotype by environment data. In contrast to two separate (one mode) clusterings on genotypes or traits/environments, twomode clustering simultaneously produces homogeneous groups of genotypes and traits/environments. For twomode clustering, we first scan all twomode cluster solutions with all possible numbers of clusters using kmeans. After deciding on the final numbers of clusters, we continue with a twomode clustering algorithm based on a genetic algorithm. This ensures optimal solutions even for large data sets. We discuss the application of twomode clustering to multiple trait data stemming from genomic research on tomatoes as well as an application to multienvironment data on barley.
Keywords
Twomode clustering Biclustering Genotype by trait interaction Genotype by environment interaction Metabolomics Tomato Barley Twomode kmeans Genetic algorithmIntroduction
Genotype by environment interaction is the phenomenon that occurs when genotypes respond differentially to changes in the environment. An attractive approach to model genotype by environment interaction is by the identification of groups of genotypes and groups of environments that internally exhibit a certain homogeneity, thereby relegating the genotype by environment interaction to differences between the genotypic and environmental groups. A well know example of this approach is the twoway (or twomode) clustering method described in (Corsten and Denis 1990). The same strategy of reducing the complexity of genotype by environment twoway tables by grouping genotypes and environments can also be applied to genotype by trait data matrices, provided the traits are expressed on a scale that allows direct comparison, like, for example, when the traits are all metabolite concentrations.
Clustering methods order objects (genotypes) or variables (environments, traits) in groups that are similar with respect to some measure, e.g. Euclidean distance or the correlation coefficient (Vandeginste et al. 1998). Clustering is a popular technique due to its visualization probabilities and ease of use. Regular clustering, i.e. oneway clustering, aims at finding the best partitioning in one direction of a twoway table or data matrix. The best partitioning may be defined as the clustering that results in the minimum sum of squared distances across clusters between the data assigned to a cluster and the corresponding cluster center (in other words, the total within cluster distance is minimal). As opposed to regular, oneway clustering, twoway, or twomode, clustering aims to find the best partitioning of the data in two directions (both genotypes and environments/traits). The added benefit in comparison with oneway clustering is that it becomes immediately clear why certain objects have been clustered together, since their variables have also been clustered simultaneously.
There are different algorithms available for twomode clustering, one example is twomode kmeans (Vichi 2001; Rocci and Vichi 2008; van Rosmalen et al. 2009). Some methods have a tendency to get stuck in local optima. Other twomode cluster algorithms are based on global optimization methods, such as Simulated Annealing, Tabu Search (van Rosmalen et al. 2009) and Genetic Algorithms (GA) (Hageman et al. 2008b; Cavill et al. 2009). Recently, we have introduced twomode clustering using a Genetic Algorithm in metabolomics (Hageman et al. 2008a, b). GAs work on a group of solutions at the time, using biologically inspired operators such as mutation and crossover to explore the search space. It can take large steps in the search space thereby minimizing the risk of getting trapped in a local optimum.
Twomode clustering has shown to be a valuable tool for the identification of biological meaningful clusters in metabolomics data (Hageman et al. 2008b). It can clearly identify genotypes that behave similarly and also show simultaneously in which environments or for which traits they behave similarly. After twomode clustering, a careful scrutiny of the genotypes and corresponding molecular markers can possibly reveal which markers are responsible for a particular phenotypic response.
We will demonstrate this by performing a genotype by trait and a genotype by environment analysis using twomode clustering on tomato and barley data.
Materials and methods
Data
The first dataset is on tomatoes and maintained by the Center for BioSystems Genomics (CBSG, http://www.cbsg.nl/). The CBSG is a joint venture in the field of plant genomics of breeding companies, biotech companies, research institutes and universities in the Netherlands. The goal of the tomato CBSG project is to develop a marker assisted strategy for quality traits. This dataset consisted of 94 genotypes, all cultivars provided by five companies involved with the project, which fell into three major categories; cherry, round and beef tomatoes. Almost all cultivars were F1 hybrids. For twomode clustering, we used information on the metabolites, and sensory studies. Metabolic profiles were measured using GCMS and LCMS, more details on this dataset can be found here (Ursem et al. 2008; van Berloo et al. 2008; Gavai et al. 2009). Traits in this context mean metabolites and sensory attributes. The data was range scaled to get all metabolites and sensory attributes at the same level.
Locations and years for Steptoe × Morex doubled haploid data
Environment  Location  Year 

ID91  Aberdeen, Idaho  1991 
ID92  Tetonia, Idaho  1992 
MAN92  Brandon, Manitoba  1992 
MTd91  Bozeman, Montana (dryland)  1991 
MTd02  Bozeman, Montana (dryland)  1992 
MTi91  Bozeman, Montana (irrigated)  1991 
MTi92  Bozeman, Montana (irrigated)  1992 
SKs92  Saskatoon, Saskatchewan  1992 
WA91  Pullman, Washington  1991 
WA92  Pullman, Washington  1992 
Twomode clustering
Twomode clustering tries to find clusters in objects and variables simultaneously, as opposed to oneway or onemode clustering, where either objects or variables are clustered. In this work, we aim to find the optimal twomode cluster solution between genotypes and environments or genotypes and traits. There are several algorithms available for creating twomode clusters. In this paper we will use two techniques for finding an optimal twomode partitioning: twomode kmeans and genetic algorithm based twomode clustering.

Y (I × J): data matrix of I rows and J columns

R (I × P): membership matrix for I rows (genotypes) of matrix of Y, allowing for P row clusters.

M (P × Q): matrix containing cluster averages for P row and Q column clusters

C (J × Q): membership matrix for J columns (environments or metabolites/traits) of matrix Y.

E (I × J): matrix of residuals, containing the difference between each measurement and its cluster average from matrix M.
Membership or incidence matrices R and C contain only zeros and a single one on each row and uniquely assign each genotype by environment or genotype by trait element of Y to one of the P and Q clusters. The location of the one indicates membership to that particular cluster. The quality of the twomode cluster algorithm is largely depending on its ability to find the best solution for the membership matrices R and C. The use of global optimizers reduces the risk of reaching sub optimal solutions and is the reason we used GAs for obtaining the final solution.
Twomode kmeans and genetic algorithms twomode clustering both use the same decomposition of matrix Y. The difference between the two methods is how they come to their final solution. Twomode kmeans works on one single solution and iteratively recalculates cluster centers and adjusts row and column cluster memberships to the nearest cluster. For a detailed discussion of twomode kmeans see (van Rosmalen et al. 2009).
The inner workings of twomode clustering with GAs are also described elsewhere, but repeated here in short for clarity. For a more detailed discussion on twomode clustering, GAs or the combination of the two, the reader is referred to (Corsten and Denis 1990; Vichi 2001; Hageman et al. 2003; Madeira and Oliveira 2004; Van Mechelen et al. 2004; Turner et al. 2005; Hageman et al. 2008b; van Rosmalen et al. 2009).
Genetic algorithm
 1.
Initialization: GAs work on a group of trial solutions at a time (a group of trial solutions is called a population). At the start of the GA the population is filled with random solutions which are just random assignments of data elements to clusters.
 2.
Evaluation: each trial solution in the population is evaluated. A trial solution is a vector with cluster number assignments. In GA terminology a trial solution is called a string. In this case, for each string the total within cluster sum of squares (SS _{ res }) is calculated as shown in the Eq. (2).
 3.
Stop: a stop criterion is checked, usually a minimal change in SS _{ res } for the last number of iterations (called generations), otherwise a predefined maximum number of generations.
 4.
Selection: a fraction of the best strings, that is, the ones with the smallest SS _{ res }, are selected for the next generation.
 5.
Recombination: the selected strings are recombined (called crossover) to yield new strings.
 6.
Mutation: small random changes (called mutation) are applied to the new strings.
An important aspect of GAs is choosing adequate values for the parameters defining the GA itself (Hageman et al. 2003). This can be done with trial and error or with an experimental design. In this case we used the parameters from our previous study with metabolomics data (Hageman et al. 2008b).
Numbers of row and column clusters
The decomposition as shown in Eq. 1 requires a predefined number of row and column clusters (as indicated with P and Q), which is usually unknown beforehand. There are a number of methods available for estimating the optimal number of clusters (e.g. BIC, GAP statistic, knee/L/scree plots) (Milligan and Cooper 1985; Salvador and Chan 2004). We used the knee method, where the number of row and column clusters are plotted against SS _{ res }, the squared within cluster distance (Hageman et al. 2008b). The point where the increase in the number of clusters only marginally decreases SS _{ res } is evidenced in the graph by a knee or L shape. This point is regarded as the optimal numbers of clusters. Since the creation of the knee plot requires the calculation of SS _{ res } for all possible combinations of numbers of clusters, to save computation time, this stage was performed using twomode kmeans clustering. Although two mode kmeans can get stuck in a local optimum, the global shape and trends of the knee plot will still show us how many clusters can be considered optimal. After the choice for a particular numbers of clusters has been made, the twomode clustering is repeated from scratch with the GA based twomode clustering using the cluster numbers obtained with twomode kmeans. The idea is that if twomode kmeans is stuck in a local optimum, the GA based twomode clustering may overcome such a local optimum due to the nature of its optimization approach, and find a better twomode partitioning.
Software
Settings for twomode kmeans clustering and genetic algorithm based twomode clustering
Settings for twomode genetic algorithm  Value 

Data type  Integer 
Population size  200 
Mutation rate  0.005 
Number of generations  4,000 
Crossover rate  0.8 
Crossover type  2 point cross over 
Settings for twomode kmeans  
Number of restarts  50 
To compare the results from twomode clustering, the tomato data set is also analyzed using principal component analysis. For comparison, an AMMI model (van Eeuwijk et al. 2005) was fitted to the barley data. A mixed model multienvironment QTL mapping was performed following the methods as described by (Boer et al. 2007), and in a more basic form presented in (Malosetti et al. 2004). All these analyses were performed in GenStat 12th edition (Payne et al. 2009).
Results
Tomato data
The greatest effect in this data set is the difference between cherry tomatoes and the others varieties. The clustering on genotype clearly shows the distinction between the cherry tomatoes and the beef and round. One of the genotype clusters solely consists of cherry tomatoes (cluster three). The inspection of the trait direction can provide insights into what is causing this separation. Trait clusters three and four give a clear contrast between cherry tomatoes and the rest of tomato types. Cluster three shows a higher than average concentration of glucose, fructose and sucrose (and corresponding properties like brix and the sensory attribute taste ‘sweet’). Cluster four contains sensory attributes ‘mouth feel mealy’, ‘taste unripe’ and ‘taste watery’ which are all below average for the cherry tomatoes. Property ‘fruit weight’ is also below average for the cherry tomatoes, which is expected as they are smaller in comparison to the other types. Trait’s cluster one contains the sensory trait ‘taste tomato’ together with the metabolite isobutylthiazole, one of the odorants associated with the smell of tomatoes. None of the tomatoes types showed a higher concentration of isobutylthiazole, suggesting that all these tomatoes taste equally well like tomatoes.
Barley data
List of markers that were close to fixation in a particular cluster (frequency higher than 0.81)
Cluster nr  Marker > 0.81  in common  Type  Chromosome  cM 

1  abc162  0.94  AA  2  73.5 
abg19  1.00  AA  2  58.8  
abg2  1.00  AA  2  41.2  
abg459  0.94  AA  2  47.3  
adh8  1.00  AA  2  56.0  
crg3a  0.82  AA  2  125.2  
pox  1.00  AA  2  52.6  
rbcs  0.94  AA  2  27.2  
abg377  0.82  AA  3  98.4  
abg396  0.81  AA  3  73.0  
abg453  0.82  AA  3  109.4  
abg471  0.82  AA  3  32.7  
abg703a  0.82  AA  3  83.6  
psr156  0.88  AA  3  91.3  
wg622  0.81  AA  4  1.4  
abc302  0.82  AA  7  78.2  
abg395  0.82  AA  7  45.6  
abg473  0.81  AA  7  105.2  
abr336  0.82  AA  7  44.2  
ale  0.82  AA  7  68.2  
cdo57b  0.82  AA  7  92.2  
ltp1  0.88  AA  7  52.8  
mSrh  0.82  AA  7  97.9  
rrn2  0.88  AA  7  48.2  
2  abg2  0.86  AA  2  41.2 
abg459  0.82  AA  2  47.3  
chs1b  0.86  AA  2  16.1  
prx2  0.82  AA  2  183.1  
rbcs  0.86  AA  2  27.2  
nar7  0.82  BB  6  76.0  
3  abg2  0.87  BB  2  41.2 
rbcs  0.83  BB  2  27.2  
abg396  1.00  BB  3  73.0  
abg703a  0.96  BB  3  83.0  
psr156  0.83  BB  3  91.3  
4  abg2  1.00  BB  2  41.2 
abg313a  0.83  BB  2  0.0  
abg459  0.88  BB  2  47.3  
chs1b  0.91  BB  2  16.1  
pox  0.83  BB  2  52.6  
rbcs  0.92  BB  2  27.2  
5  abg396  0.83  BB  3  73.0 
abg703a  0.83  BB  3  83.6  
6  –  –  –  –  – 
The twomode clustering could then be seen as a quick step to sample potentially interesting markers associated to the patterns of variation in the column direction (either environment or traits).
Concluding remarks
We have demonstrated the use of twomode clustering to explore genotype by trait and genotype by environment data. We have first examined different numbers of clusters in the twomode clustering using kmeans and the knee plot. After deciding on the final numbers of clusters, a GA based twomode clustering was used for finding a better solution than kmeans. The GA only found a better solution for the genotype by environment data.
Twomode clustering is able to extract relevant features from both data sets. It finds genotypes with a similar response in particular sets of environments (barley data set) or set of genotypes that share some common set of characteristics (tomato data set). In the interpretation of the twomode clustering, external information can be useful. For example, in the barley data set, a closer analysis of the genotype clusters in relation to molecular markers provided information about relevant markers associated to the differential response of genotypes in particular environments.
Twomode clustering as presented in this paper is an easy to use tool and gives a clear and straightforward to interpret graphical overview of a multidimensional data set, capturing the most relevant features of the data set under study. Our implementation of twomode clustering is very similar in spirit to the twoway clustering algorithm presented by (Corsten and Denis 1990), but it seems easier to generalize to other contexts like threemode cluster analysis or clustering procedures for generalized linear models in place of linear models. We are currently working on such types of extensions to the twomode kmeans and GA two means clustering algorithm.
Notes
Acknowledgments
This project was (co)financed by the Centre for BioSystems Genomics (CBSG) which is part of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References
 Boer MP, Wright D, Feng LZ, Podlich DW, Luo L, Cooper M, van Eeuwijk FA (2007) A mixedmodel quantitative trait loci (QTL) analysis for multipleenvironment trial data using environmental covariables for QTLbyenvironment interactions, with an example in maize. Genetics 177:1801–1813PubMedCrossRefGoogle Scholar
 Cavill R, Keun HC, Holmes E, Lindon JC, Nicholson JK, Ebbels TMD (2009) Genetic algorithms for simultaneous variable and sample selection in metabonomics. Bioinformatics 25:112–118PubMedCrossRefGoogle Scholar
 Corsten LCA, Denis JB (1990) Structuring interaction in 2way tables by clustering. Biometrics 46:207–215CrossRefGoogle Scholar
 Gavai AK, Tikunov Y, Ursem R, Bovy A, van Eeuwijk F, Nijveen H, Lucas PJF, Leunissen JAM (2009) Constraintbased probabilistic learning of metabolic pathways from tomato volatiles. Metabolomics 5:419–428PubMedCrossRefGoogle Scholar
 Hageman JA, Streppel M, Wehrens R, Buydens LMC (2003) Wavelength selection with Tabu search. J Chemometr 17:427–437CrossRefGoogle Scholar
 Hageman JA, Hendriks M, Westerhuis JA, van der Werf MJ, Berger R, Smilde AK (2008a) Simplivariate models: ideas and first examples. Plos One 3(9):1–12CrossRefGoogle Scholar
 Hageman JA, van den Berg RA, Westerhuis JA, van der Werf MJ, Smilde AK (2008b) Genetic algorithm based twomode clustering of metabolomics data. Metabolomics 4:141–149CrossRefGoogle Scholar
 Kleinhofs A, Kilian A, Maroof MAS, Biyashev RM, Hayes P, Chen FQ, Lapitan N, Fenwick A, Blake TK, Kanazin V, Ananiev E, Dahleen L, Kudrna D, Bollinger J, Knapp SJ, Liu B, Sorrells M, Heun M, Franckowiak JD, Hoffman D, Skadsen R, Steffenson BJ (1993) A molecular, isozyme and morphological map of the barley (HordeumVulgare) genome. Theor Appl Genet 86:705–712CrossRefGoogle Scholar
 Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1:24–45PubMedCrossRefGoogle Scholar
 Malosetti M, Voltas J, Romagosa I, Ullrich SE, van Eeuwijk FA (2004) Mixed models including environmental covariables for studying QTL by environment interaction. Euphytica 137:139–145CrossRefGoogle Scholar
 Mathworks (2008) Matlab 7.1Google Scholar
 Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50:159–179CrossRefGoogle Scholar
 Payne R, Harding S, Murray D, Soutar D, Baird D, Glaser A, Channing I, Welham S, Gilmour A, Thompson R, Webster R (2009) The guide to GenStat release 12, part 2: statistics. VSN International, Hemel HempsteadGoogle Scholar
 Rocci R, Vichi M (2008) Twomode multipartitioning. Comput Stat Data Anal 52:1984–2003CrossRefGoogle Scholar
 Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Proceedings of the 16th IEEE international conference on tools with artificial intelligence (ICTAI 2004)Google Scholar
 Turner HL, Bailey TC, Krzanowski WJ, Hemingway CA (2005) Biclustering models for structured microarray data. IEEE/ACM Trans Comput Biol Bioinform 2:316–329PubMedCrossRefGoogle Scholar
 Ursem R, Tikunov Y, Bovy A, van Berloo R, van Eeuwijk F (2008) A correlation network approach to metabolic data analysis for tomato fruits. Euphytica 161:181–193CrossRefGoogle Scholar
 van Berloo R, Zhu AG, Ursem R, Verbakel H, Gort G, van Eeuwijk FA (2008) Diversity and linkage disequilibrium analysis within a selected set of cultivated tomatoes. Theor Appl Genet 117:89–101PubMedCrossRefGoogle Scholar
 van Eeuwijk FA, Malosetti M, Yin XY, Struik PC, Stam P (2005) Statistical models for genotype by environment data: from conventional ANOVA models to ecophysiological QTL models. Aust J Agric Res 56:883–894CrossRefGoogle Scholar
 Van Mechelen I, Bock HH, De Boeck P (2004) Twomode clustering methods: a structured overview. Stat Methods Med Res 13:363–394PubMedCrossRefGoogle Scholar
 van Rosmalen J, Groenen PJF, Trejos J, Castillo W (2009) Optimization strategies for twomode partitioning. J Classif 26:155–181CrossRefGoogle Scholar
 Vandeginste BGM, Massart DL, Buydens LMC, Jong SD, Lewi PJ, SmeyersVerbeke J (1998) Handbook of chemometrics. Elsevier, AmsterdamGoogle Scholar
 Vichi M (2001) Double kmeans clustering for simultaneous classification of objects and variables. Springer, HeidelbergGoogle Scholar