Introduction

Gene copy number variations (CNVs) have been recognized as a major source of variation in humans and other mammals (Iafrate et al. 2004; Sebat et al. 2004; Freeman et al. 2006) as well as in maize (Springer et al. 2009). The duplicated genomic segments leading to CNV are usually reported to be larger than 1 kb (Stankiewicz and Lupski 2010) and can contain one or more genes. CNVs can be classified into two groups based on their frequency in populations—recurrent CNVs and rare CNVs, which are likely to be induced by differing mechanisms. The most common genetic mechanism causing duplication or deletion type recurring CNVs in humans is NAHR (non-allelic homologous recombination) while mechanisms like FoSTeS (Fork Stalling and Template Switching) and MMBIR (microhomology-mediated break-induced replication) are involved in rare CNV events and make use of replication mechanisms (Liu et al. 2012). Non-homologous mechanisms, such as multiple NHEJ (non-homologous end joining), may also account for some complex rearrangements (Gu et al. 2008).

CNVs have not been investigated extensively in conifers, although the amount of genomic information and quality of this information is increasing (Nystedt et al. 2013; Zimin et al. 2017) and initial studies indicate that they may be quite common in spruce (Prunier et al. 2017). Conifer genomes have several properties that may facilitate CNV formation such as a high proportion of repetitive sequences which can facilitate unequal crossovers and other genomic rearrangements (Gu et al. 2008) as well as the presence of gene family clusters (Liu and Ekramoddoullah 2009; Hedman et al. 2013; Warren et al. 2015) and presence of active transposons (Voronova and Rungis 2014). While the genome segments involved in CNV can be large, the distribution of genes in conifer genomes, averaging in one gene in 705 kb in Picea abies (Nystedt et al. 2013), may imply that duplication of large genomic segments could involve only one gene. In the maize genome, the overwhelming majority of CNV events involve one gene (Swanson-Wagner et al. 2010). There is evidence suggesting that gene duplicates from whole genome duplication events diversify developmental and physiological regulation but tandem duplicates increase the diversity in genes involved in environmental response including resistance to pathogens (Salojärvi et al. 2017).

Individuals containing multiple copies of a gene can have higher levels of gene expression, thus influencing the phenotype (Chen et al. 2006; Sutton et al. 2007; Díaz et al. 2012, Mehta et al. 2014). CNV analyses of quantitative trait loci in tree species are scarce, but there are some reports in crop species. Increased copy number of the wheat Rht-D1b allele is correlated with yield (Pearce et al. 2011; Li et al. 2012), duplication of the ZMM19 MADS-box gene leads to changes in cob phenotype in maize (Wingen et al. 2012), and CNVs influencing growth and development have also been identified in the potato genome (Iovene et al. 2013). In addition, CNVs have been shown to influence pest resistance in soybean (Cook et al. 2012) and glyphosate resistance in Amaranthus palmeri (Gaines et al. 2010).

CNVs can be detected using several methods including representational oligonucleotide microarray analysis (ROMA) (Lucito et al. 2003), fosmid paired end sequencing (Tuzun et al. 2005), fluorescent in situ hybridization (FISH), comparative genomic hybridization (CGH) (Kallioniemi et al. 1992; Ju et al. 2010), array comparative genomic hybridization (aCGH) (Perry et al. 2008), use of high-density whole-genome SNP microarrays (Huang et al. 2004), digital PCR (Dube et al. 2008), several next-generation sequencing approaches (Krumm et al. 2012; Duan et al. 2013; Wang et al. 2014; D’Aurizio et al. 2016), and pyrosequencing (Cantsilieris et al. 2013). However, real-time PCR remains the reference method most often used to confirm CNVs identified by other methods (Hashemi et al. 2013; Ghosh et al. 2014).

Scots pine is ecologically and commercially the most important tree species in Latvian forests, being the dominant species in 29% of forests (more than 0.97 million ha) (Ministry of Agriculture of the Republic of Latvia 2014). A breeding program for Scots pine has been established in Latvia, and the infrastructure of this breeding program includes seed orchards and tree nurseries. One of the traits of interest for pine breeding is the resistance to root rot caused by Heterobasidion annosum, but this trait has not been included in the breeding program as it is difficult to characterize the degree of resistance against H. annosum in Scots pine. Research into the molecular genetic responses of conifers to Heterobasidion infection has identified differentially expressed resistance genes (Adomas et al. 2007) as well as differences in expression levels between individuals (Danielsson et al. 2011).

In order to further investigate the basis of this variation, qPCR and C-HRM (Borun et al. 2014) were utilized to analyze CNV of the Scots pine thaumatin-like protein (PsTLP) gene. In vitro analyses have shown that the protein encoded by this gene inhibits the growth of H. annosum as well as a range of other fungi (Snepste et al., submitted). An initial investigation by qPCR using one primer set revealed evidence of CNV of the PsTLP gene in Latvian Scots pine populations (Šķipars et al. 2011). In this study, three primer sets were used to determine the relative amplicon quantities of different regions of the PsTLP gene using qPCR, and two primer sets were used for C-HRM analysis. Three endogenous control genes were used in qPCR. In addition, a limited number of samples were also analyzed using digital PCR (dPCR). Usually, detection of CNV using qPCR utilizes reference samples with predetermined copy number (D’haene et al. 2010). However, there are no Scots pine reference samples with well-characterized gene copy numbers that could be utilized. Additional evidence of existence of multiple copy numbers of the TLP gene were obtained by analysis of Pinus sylvestris transcriptome data obtained from a single individual and publically available genomic sequence scaffolds of Pinus taeda and Pinus lambertiana.

Materials and methods

Experimental material

Twenty-three mature Scots pine individuals (GE05, GE06, and GE09–GE29) were utilized for CNV analyses of the PsTLP gene using the qPCR and C-HRM methods. The trees originate from a pine breeding program progeny trial established in 1979 located in Kalsnava district, Latvia. Samples are open-pollinated progeny obtained from a number of different mother trees, and are a sub-set of samples previously analyzed for CNVs (Šķipars et al., 2011). DNA was extracted from fresh needles using the Genomic DNA isolation kit (Thermo Fisher Scientific) and quantified by use of Qubit fluorometer and the dsDNA BR kit (Thermo Fisher Scientific). The integrity of DNA was assessed by electrophoresis on a 2% agarose gel.

qPCR

CNV of the PsTLP gene (GenBank accession no. JX461338.1, total length 936 bp, CDS 43-121, 267-892) was analyzed using three different primer sets, amplifying separate, non-overlapping regions of the PsTLP gene. Primer set TLP3′ amplifies the region from nucleotide 797 to 861 of the PsTLP gene, primer set TLPc amplifies the region from nucleotide 474 to 693, and primer set TLP5′ amplifies the region from nucleotide 62 to 161. Amplicons of primer sets TLP3′ and TLPc are entirely within the protein coding region of the gene, while the amplicon of primer pair TLP5′ contains both protein coding and intron sequence. The amplicon of TLPc partly covers the sequence encoding the signature amino acids of the thaumatin family (Prosite accession no. PS00316) (Fig. 1).

Fig. 1
figure 1

Schematic depiction of the PsTLP gene. Coding regions are presented by gray bars; 5′ and 3′UTRs and the intron are represented as a black line. Regions amplified by primer pairs used in this study are depicted below

All PsTLP specific primer sets were analyzed with three different endogenous controls—glyceraldehyde 3-phosphate dehydrogenase (GAPDH), an Avr9/Cf-9 rapidly elicited (ACRE) gene homolog (PsACRE), and a Pinus taeda water-stress inducible protein (Lp3–1) gene. The GAPDH gene is widely utilized as control gene for quantification of gene expression; therefore, the utility of this gene as a control for gene copy number was investigated. The PsACRE (an Avr9/Cf-9 rapidly elicited (ACRE) gene homolog was chosen as a control as in previous experiments in our laboratory; it had constant amplification results from DNA samples. The Pinus taeda water-stress-inducible protein (Lp3-1) gene (NCBI accession number U52865) has been characterized as a conserved ortholog sequence in conifers (Krutovsky et al. 2006), and therefore was expected to be stable and conserved within P. sylvestris. All primers were screened for specificity in silico using NCBI BLAST (Altschul et al. 1990), and no significant similarities with other sequences were found. The utilized primer sequences are given in Table 1.

Table 1 Sequences of PCR primers utilized in this study

The qPCR protocol for determination of relative CN of the PsTLP gene (reaction volume 10 μl) was as follows: 2 μl of 5× HOT FIREPol® EvaGreen® HRM Mix (Solis BioDyne), 250 nM each forward and reverse primer, 5 ng of Scots pine DNA, deionized water. Thermal cycling conditions were as follows: 15′ 95 °C initial denaturation and polymerase activation followed by 40 cycles of 95 °C 15 s, 60 ° C 20 s, 72 °C 1 min. Data interpretation is described in D’haene et al. (2010) and Šķipars et al. (2011).

For valid ΔΔCT calculations, the amplification efficiencies of target amplicons (TLP3′, TLPc, TLP5′) and endogenous control amplicons (GAPDH, PsACRE, Lp3-1) must be within 10% of each other (Schmittgen and Livak 2008). Amplification efficiency was determined by the CT slope method. CT values were measured over a twofold dilution range (1.75–14 ng) of three DNA samples (individuals GE05, GE14, GE15). The amplification efficiencies were as follows: TLP5′, 100.41%; TLPc, 96.93%; TLP3′, 96.22%; GAPDH, 102.32%; PsACRE, 98.07%; Lp3-1, 95.62%. In previous reports, standard samples with known target gene copy number were utilized for calculation of the rescaling factor. However, such standard samples are not available for Scots pine; therefore, samples utilized for rescaling factor calculations were chosen from among the analyzed samples. As it is expected that the majority of samples would show similar results (representing the most common gene copy number), the samples belonging to the majority (by relative quantitation results) could be used as reference samples. The rescaling factor is utilized to classify quantitative results into discrete relative gene copy number classes, and therefore does not influence the relative ranking of the individuals, but may affect the interpretation of the gene copy number for individuals with quantitation results close to the boundaries between relative gene copy number classes. Samples used for rescaling factor calculation and the calculated rescaling factors are given in Table 2.

Table 2 Samples used for calculation of rescaling factors

C-HRM

Due to the multiplex nature of the C-HRM reaction, the GAPDH and PsACRE genes, which were previously used as endogenous controls in qPCR analysis, were unable to be used as controls because of overlapping melting temperatures of amplicons of these control genes with the amplicon of interest. Therefore, only the Lp3-1 gene was used as the endogenous control for C-HRM analysis. Efficiency of the multiplex C-HRM reaction was tested by determining the PCR efficiency for individual primer sets and for the multiplex reaction using fivefold serial dilutions (2.4 to 300 ng per reaction). qPCR reactions with 2.4 to 60 ng of DNA per reaction showed no sign of significant PCR inhibition, in contrast to reactions with 300 ng of DNA per reaction (TLP3′ 110.98%, Lp3-1 91.57%, multiplex 100.95%). C-HRM efficiency of the reaction with TLPc primer set was determined by analyzing the results of a twofold serial dilution (2.5–20 ng), and TLPc amplification efficiency was 104.22% (multiplex 94.41%). Fifteen nanograms of DNA per reaction were utilized for C-HRM analyses. Primer set TLP5′ was not used in C-HRM analysis due to overlapping melting temperature of the amplicon with control amplicons. Reaction conditions (total volume 20 μl) were as follows: 4 μl of 5× HOT FIREPol® EvaGreen® HRM Mix (Solis BioDyne), 250 nM each forward and reverse primer, 15 ng of Scots pine DNA, deionized water. Thermal cycling conditions: 15′ 95 ° C initial denaturation and polymerase activation followed by 26 cycles of 95 °C 15 s, 60 °C 20s, 72 °C 20 s followed by a high-resolution melting curve stage. The reaction was performed on an Applied Biosystems StepOnePlus instrument.

In the original report about this method, standard samples with known copy number of the target genes were used to obtain the peak height ratio for use in data normalization, assuming that the peak height ratio is 0.5 × for samples with a gene deletion (n copies) and 1.5 × for samples with a duplication (3n copies) (compared to the peak height ratio of the standard samples) (Borun et al. 2014). Data interpretation involves data normalization which essentially means that the peak height ratio is divided by the average peak ratio for the standard samples (Borun et al. 2014). After including standard deviations, an approximate scale was created by the authors of this method for assignment of analyzed samples to different sample groups. Samples with a normalized peak height ratio below 0.6 indicate a deletion, a ratio between 0.9 and 1.1 indicates unchanged copy number compared to controls, and a ratio above 1.4 indicates duplication (Borun et al. 2014). We extrapolated this scale so a value of 2.0 ± 0.1 would correspond to relative copy number of 4n.

Digital PCR

Digital PCR was performed on the Life Technologies QuantStudio® 3D Digital PCR System using QuantStudio™ 3D Digital PCR Master Mix, and data were analyzed using QuantStudio® 3D AnalysisSuite™ Cloud Software.

The composition of one reaction with total volume of 15 μl was 7.5 μl of Digital PCR Master Mix, PsTLP assay containing primers TLP3'-F and TLP3'-R with final concentration 900 nM and probe TLP3'-P with final concentration 300 nM, GAPDH assay containing primers GAPDH-F and GAPDH-R with final concentration 900 nM and probe GAPDH-P with final concentration 300 nM, 20 ng of genomic DNA. Each reaction (14.5 μl) was loaded onto a QuantStudio™ 3D Digital PCR 20K chip. PCR was performed on a GeneAmp PCR System 9700. The cycling conditions were 10 min at 96.0 °C followed by 39 cycles of 2 min at 60.0 °C and 30 s at 98.0 °C, then a hold for 2 min at 60.0 °C followed by storage at 10.0 °C in the instrument until the reading of the chips.

Transcriptome sequencing

Transcribed sequences were obtained from analysis of one clone (sample GE24) after inoculation with H. annosum (strain V Str 28). RNA was extracted following the method described in Šķipars et al. (2014); obtained RIN (RNA integrity number) values exceeded 7. Ribosomal RNA was removed using the Thermo RiboMinus™ Plant kit for RNA-Seq, and the transcriptome libraries were prepared using the Ion Total RNA-Seq Kit v2 (both kits from Thermo Fisher Scientific). The following steps including emulsion PCR and IonTorrent sequencing were performed at the Latvian Biomedical Research and Study Centre. Transcriptome reads were aligned against expected amplicon sequences for primer sets TLP5′, TLPc, and TLP3′, alignment limited to 100 best hits. For graphical depiction, transcriptome sequences were trimmed and aligned to the amplicons; singleton sequences were removed. Sequences were grouped into haplotypes manually. The transcriptome read database used for the analysis contained 60 million reads. Software analyses were performed using CLC Genomics Workbench (Qiagen) and Vector NTI (InforMax Inc.).

Results

Analysis of the qPCR results (supplementary Table 1) indicated that both the gene region amplified and the endogenous control used in the analysis can have an effect on estimated relative copy number of the PsTLP gene (Table 3). Comparison of the calculated relative gene copy number interpretation results revealed that, depending on the endogenous control used, six to seven samples had the same copy number of all three PsTLP gene regions. There are four samples (GE06, GE10, GE16, and GE21) which had the same copy number for each gene region regardless of the endogenous control utilized. Two samples, GE16 and GE21, had the same calculated relative gene copy number for all three gene regions with all endogenous controls. Sample GE19 had the same region specific calculated relative gene copy number with all endogenous controls. In four cases, the relative copy number of the TLP3′ region was increased by two copies or more compared to the estimated copy numbers of the other two gene regions. These samples include GE09, GE17, GE27, and GE29, regardless of the endogenous control used. In contrast, in sample GE18, the copy number of the TLP3′ region decreased by two copies compared to copy numbers of the other gene regions, regardless of the endogenous control utilized. This indicates that the 3′ region was more variable in terms of copy number in comparison to the 5′ and central regions, regardless of the endogenous control utilized. The calculated relative copy number of the TLPc region for sample GE22 increased by two copies when GAPDH was used as the endogenous control and a decreased copy number for sample GE14 was calculated when Lp3-1 was used as the endogenous control. The copy number of the TLP5′ region did not have a difference of more than two copies between the analyzed individuals, regardless of the endogenous control. Samples GE06, GE10, and GE13 had an endogenous control—specific increase or decrease of estimated gene copy number. These results highlight the necessity of using several gene regions and several endogenous controls to ensure accurate CNV assay results.

Table 3 Comparison of calculated relative copy number values by qPCR of the three regions of the PsTLP gene using three endogenous controls

All qPCR analyses were performed using 5 ng of template DNA. This provides the opportunity to not only use the relative quantity values determined by use of reference samples and endogenous controls but to also analyze the raw Ct values, which are expected to be very similar between samples for the endogenous controls as well as the target amplicons for samples with similar gene region copy numbers. Analysis of the deviation of sample Ct values from average Ct values for endogenous control amplicons reveal differences in Ct values between individuals (Fig. 2).

Fig. 2
figure 2

a Comparison of deviations from average Ct value for each sample depending on target region. b Comparison of deviations from average Ct value for each sample depending on endogenous control

The observed deviations can be expressed as the theoretical influence of the deviation in the endogenous control reaction on the relative quantitation results (supplementary Table 2). Examination of the relative gene copy number in conjunction with the information about raw Ct values and the deviations of the Ct values from the average allows assessment of whether the change in calculated relative quantity of target amplicon is due to amplification of the target amplicon or to unexpected variation in amplification of the endogenous control, which would indicate that the change in relative gene copy number may be artefactual. For example, the increase in the relative gene copy number in sample GE6 (using Lp3-1 as the endogenous control) is probably artefactual due to the anomalous amplification of the Lp3-1 endogenous control in this individual (Fig. 3). However, use of this information for correction of relative quantification results and interpretation of relative gene copy number is complicated by possible variations in qPCR efficiency and technical replicate Ct values as well as by the fact that the average Ct value is calculated from all samples, including those with deviations. Therefore, this information can be utilized as an indicator to identify possibly anomalous results or samples, which should be further investigated with regard to CNV of the target gene or gene region.

Fig. 3
figure 3

Relative gene copy number of three different PsTLP gene regions, endogenous control Lp3-1; dashed lines represent borders between relative gene copy number groups

To confirm these results using an alternative CNV detection technique, the 23 individuals were analyzed using C-HRM. The multiplex C-HRM reaction produces two amplicons with distinct melting temperatures in a single reaction. Of the three previously analyzed PsTLP primer sets, only the TLP3′ and TLPc primer sets were compatible with the Lp3-1endogenous control for C-HRM analysis. Differences in the copy number of an amplicon are detected by calculating the peak height ratio of the target amplicon and the control amplicon and comparing these values between samples (Fig. 4, Table 4).

Fig. 4
figure 4

Derivative melting curves of C-HRM analyses of samples GE05 and GE09. The left peaks (between 75 and 80 °C) are the reference amplicon (Lp3-1) melting curve peaks and the right peaks (between 80 and 85 °C) are the PsTLP amplicon melting curve peaks (primer set TLP3′)

Table 4 C-HRM results of PsTLP assays

As no reference samples were available, we calculated the normalization factor from the peak height ratio values obtained in our experiments. The distribution of peak height ratio results was estimated. Samples were divided into peak height ratio groups by increments of 0.1. As we expect most of the samples to have the same gene copy number, the average value of samples from the largest groups were used as standards. For the TLP3′ primer set, the normalization factor was calculated to be ~ 0.796 (the average values of samples with peak height ratio within the range 0.6–0.9), but for the TLPc primer set, it was calculated to be ~ 2.076 (the average values of samples with peak height ratio within the range 2.0–2.2). The distribution of raw peak ratio results is shown in Fig. 5.

Fig. 5
figure 5

Distribution of raw peak ratio results by target region

Interpretation of the C-HRM results is problematic because many samples are outside the predefined boundaries for segregation of the samples into different gene copy number groups. Six of 23 samples analyzed with TLP3′ primer set and 12 of 23 samples analyzed with primer set TLPc fall outside of these boundaries. Nevertheless, it is possible to use the C-HRM results for visualization of differences between samples even if there are some issues regarding gene copy number interpretation (Fig. 6).

Fig. 6
figure 6

Comparison of normalized C-HRM results with relative quantification results obtained by use of qPCR. In both methods primer sets TLP3′ and TLPc were used, endogenous control Lp3-1

The quantitation results obtained by C-HRM are highly correlated with the qPCR results (R 2 TLP3′  = 0.88; R2 TLPc  = 0.92). As mentioned previously, the interpretation of the quantitation results and assignment of samples into discrete gene copy number groups was complicated by the absence of reference samples with a pre-determined gene copy number. Not all of the samples estimated to have increased copy number by qPCR were confirmed by the C-HRM method, indicating that while the quantitation results were well correlated, the interpretation of gene copy number, particularly in the absence of well-defined reference samples, is more uncertain using the C-HRM method in comparison to qPCR. In addition, the C-HRM method is not as widely applicable to all target gene/endogenous control combinations due to the required differences in amplicon melting temperatures.

A limited number of individuals with differing CN as determined by the qPCR and C-HRM analysis were also analyzed using digital PCR (dPCR) (Table 5). Despite the fact that the dPCR method results in absolute numbers of an amplicon (copies/μl which can be translated in copies/genome if the mass of DNA per genome is known), a reference gene (in this instance GAPDH) was included in the analysis for normalization. While the absolute calculated amplicon numbers of the PsTLP 3’ region were different, the relative values determined by dPCR were correlated with the qPCR data for the analyzed samples (R 2 = 0.90) (which were also normalized using the GAPDH control gene). However, the sample number is too small for any meaningful conclusions to be made, and these data should be viewed only as additional supporting information.

Table 5 Results of CNV analysis using dPCR

Transcriptome data were also used to investigate the copy number of the analyzed PsTLP gene regions within the genome of one individual (GE24). This individual showed similar quantitative and calculated relative copy number results for all three PsTLP gene regions. The transcript sequences were aligned to the sequences of amplicons produced by the primer sets TLP5′, TLPc and TLP3′ (supplementary Fig. 1). BLAST results were limited to 100 reads most similar to the amplicon sequences. For the TLP5′ region, the expressed sequence reads only mapped to the exon. After unique (singleton) reads were removed, 14 SNPs were identified in the TLP5′ region (corresponding to 4 haplotypes), 30 SNPs were identified in the TLPc region (corresponding to 5 haplotypes), and 20 SNPs were identified in the TLP3′ region (corresponding to 8 haplotypes). These results indicate that there are several variants of the PsTLP gene within the genome of individual GE24, with a differing number of haplotypes of each amplified gene region. While no copy number differences between the three analyzed gene regions were identified in this individual, the identification of multiple haplotypes found in the gene transcripts supports the presence of multiple copy numbers of all or part of the PsTLP gene. In addition, a larger number of transcribed sequence variants were found corresponding to the 3′ region of the PsTLP gene, suggesting a higher copy number of this region of the gene, indicating that different regions of the PsTLP gene may have differing copy numbers, which corresponds to the overall results obtained by real-time PCR.

The PsTLP gene sequence was also used to search the Pinus taeda genome (NCBI accession no. APFE000000000.3) sequence scaffolds using NCBI BLAST. More than one match was found to several scaffolds—130,911 (2 hits), 85,527 (3 hits), 51,749 (3 hits) (accession numbers APFE031015264.1, APFE030842585.1, and APFE031073227.1, respectively). In many cases, these matches are missing the 5′ part of the gene and the aligned sequence starts after the intron (all matches from scaffolds 85,527 and 51,749) and one of the matches to scaffold 51,749 contains only a 197 nt long sequence from the 3′ region of the gene. Similarly, BLAST analysis of the whole genome shotgun sequencing project of Pinus lambertiana (NCBI accession no. LMTP000000000.1) using the full-length PsTLP gene as the query sequence identified six scaffolds with more than one match to the query sequence. One matching sequence contained the entire TLP gene; three hits to different scaffolds contained the central and the 5′ regions of the gene, while other hits included only one of the gene regions. Five hits from three different scaffolds contained the intron sequence. In this description, we use “5′ region, central region, and 3′ region” to describe whether the matching sequences contain the sequences of the amplicons generated with our primer sets. Detailed alignment information is provided in supplementary Table 3.

Discussion

Results of this study show that to reliably detect gene CNV by quantitative PCR methods, it is necessary to use several primer sets targeted to different regions of the target gene, as demonstrated by the detected differences in relative quantity between different regions of PsTLP. The PsTLP gene copy number results obtained using the different methods used in this study are comparable, best demonstrated by the strong correlation of raw data values. Given that the PCR-based methods utilized the same primer sets, this is not unexpected. However, there were quantitative differences between the three analyzed gene regions within some individuals suggesting presence of partial duplications of the Ps TLP gene. In addition, the endogenous control genes utilized also had an influence on the calculated relative gene copy number results, indicating that several control genes should be utilized, in order to detect false positive gene CNV results. Comparing the PCR-based methods utilized in this study, C-HRM is more limited regarding experimental design compared to qPCR, as target and control amplicons are multiplexed and therefore are required to have differing melting temperatures. In addition, reference samples with defined target gene copy numbers are required for accurate gene copy number interpretation. The digital PCR results, which can be utilized for determination of absolute gene copy numbers, can also present difficulties in interpretation if the structure of CNV polymorphism is complicated (e.g., a small fold change) without the use of pre-characterized reference samples. A single region of a gene can be utilized to identify CNVs when the gene and surrounding regions have been well characterized by sequencing or other approaches (Anhuf et al. 2003; Kulka et al. 2006; Díaz et al. 2012; Cook et al. 2014). One of the advantages of using CGH for detection of gene CNVs is that thousands of probes per array can be utilized, and the probe design process can include criteria such as the minimum number of probes per gene and distance between probes (Swanson-Wagner et al. 2010; Prunier et al., 2017). Yet, in the interpretation of CGH results, variable signal intensity ratios from different probes from a single gene are often used as criteria for omitting a gene from analysis (Swanson-Wagner et al. 2010; Prunier et al. 2017) without considering possible partial duplication. An alternative CGH-based approach where CNVs are detected based on assessment of signal ratios of adjacent probes (Springer et al. 2009) could be more suited for identification of partial gene duplications. High-throughput sequencing (HTS) methods could provide an alternative approach to study CNV than qPCR or CGH as more information is obtained about possible structural variations (SVs). In addition, this method would not be influenced by SNPs in primer binding sites of a qPCR assay (SNPs in primer binding sites were observed in our analysis of transcriptome data). However, the short read lengths that are a feature of the majority of current HTS technologies complicate the analysis of complex genomic SVs (including CNVs), even in well-characterized genomes (Sudmant et al. 2015). The increasing availability of long read sequencing technologies will simplify the identification and characterization of these complex SVs; however, high-quality reference genomes will still be required to provide accurate genotyping of these SVs and their functional significance (Couldrey et al. 2017).

The significance of gene CNVs is likely to have been underestimated due to the technical difficulties of accurate detection and determination of polymorphisms within populations. Duplicated genome regions have been implicated in the formation of gene families and pseudogenes (Zhang 2003); however, these have been studied after sequence divergence of the duplicated regions. The functions of full-length gene duplicates are retained at a comparatively high frequency, suggesting that positive selection, via several mechanisms, can reduce the rate of pseudogene formation (Moore and Purugganan 2005; Panchy et al. 2016). Most CNV studies have emphasized the detection of duplicated full length genes; however, partial gene duplications can also have a functional role. Partially duplicated genes have been shown to contribute to formation of new genes, frequently with altered or novel functions (Toll-Riera et al. 2011). Examples include the HvARM1 gene from Hordeum vulgare, which contributes to resistance against the powdery mildew fungus Blumeria graminis (Rajaraman et al. 2017). In addition, a partial duplication of the A17 protein encoding gene in vaccinia virus provides resistance to rifampin (Erlandson et al. 2014), while partial duplication of the COL1A2 gene (Raff et al. 2000) and other genes (Hu and Worton 1992) can cause disease in humans. In addition to qPCR evidence for full or partial duplications of the PsTLP gene, analysis of the transcriptome obtained from a single individual identified a number of haplotypes, suggesting that several different PsTLP copies are transcriptionally active. Divergence of gene expression between duplicated genes has been reported, and the degree of divergence depends on the mechanisms by which these duplications were formed and the time since duplication (Wang et al. 2011).

The structure and evolution of CNV polymorphism of the PsTLP gene are not clear, but an ancestral gene duplication event can be proposed (present in P. taeda and P. lambertiana), with additional duplications in P. sylvestris resulting in the observed differences between P. sylvestris individuals. Whole genome duplication events are proposed to have occurred at least two times in the evolution of major gymnosperm clades (Li et al. 2015); thus, these events might have contributed to the observed CN variations. Copy number variations or structural variations in resistance- or stress response-linked genes were found to be common in other studies (Neiman et al. 2009; DeBolt 2010; McHale et al. 2012; Boocock et al. 2015; Prunier et al. 2017). Additional copies of resistance linked sequences should increase the rate of formation of new resistance-linked genes as it would increase the amount of sequences available for homologous recombination linked CNV events. The observed enrichment of defense/immunity related CNVs among the entire set of CNVs identified in Picea species (Prunier et al. 2017) might suggest positive selection effects. The formation of resistance gene clusters can generate and maintain high haplotypic diversity, thus facilitating rapid evolution of novel resistance genes (Friedman and Baker 2007). This may indicate that resistance related genes are preferentially duplicated and have a higher frequency of CNVs. Recent studies on CNV mechanisms in different species show that a portion of CNV events are recurrent and occur at specific genomic locations (Zmienko et al. 2016). Analysis of the genomic regions surrounding these duplications may identify the presence of particular sequence motifs that may be implicated in CNV formation.

In conclusion, the real-time PCR-based methods utilized in this study identified reproducible quantitative differences in the copy number of all or part of the PsTLP gene. Our results indicate that two of 23 samples (8.7%) have increased relative copy number regardless of target region and endogenous control, and other individuals have increased copy numbers of regions of the PsTLP gene. While in some cases this interpretation could be a result of technical variations in the utilized methods, transcriptome and genome alignment analyses provide additional evidence of partial duplications of the TLP gene in conifers. Further analysis of the genomic regions surrounding the PsTLP loci in P. sylvestris will enable a more thorough characterization of the CNV events and provide insight into the evolution of these events and explain the observed differences between P. sylvestris individuals, including transcription profiling of different PsTLP transcripts and linking of these data to phenotype.