The advent of Massively Parallel Sequencing (MPS) has enabled high-resolution characterization of Short Tandem Repeat (STR) loci, revealing variants in the repeat and flanking regions, which are critical for distinguishing isoalleles and enhancing probabilistic genotyping. In 2024, the International Society for Forensic Genetics (ISFG) published updated nomenclature recommendations [1] that promote the use of sequence codes based on the minimum range to avoid masking polymorphisms and encourage the adoption of new bracketed repeat format. Maintaining backward compatibility with length-based nomenclature derived from Capillary Electrophoresis (CE) remains essential for continuity in forensic databases.
To comply with these updated recommendations, we introduce a hybrid alignment strategy for generating sequence codes within the minimum range defined by the Forensic STR Sequence Guide (FSSG). The method implemented in the GeneMarker®HTS software v2.7.0 combines flanking region alignment and locus sequence alignment. GeneMarker®HTS software provides specified panels for Promega PowerSeq® kit and Nimagen IDSeek OmniSTR Global kit. Moreover, it includes a general panel derived from the GRCh38 reference for data produced by other kits.
To validate this approach, a concordance study was conducted. Stutter artifacts and low-coverage alleles were filtered to ensure accurate CE allele calling. A conversion algorithm from minimum-range sequence length to repeat count was also implemented. The method achieved 100% concordance with CE allele calls for NIST SRM-2391d reference samples and 99.95% concordance across 650 samples in the NIST-Promega dataset using the PowerSeq® 46GY System, with discordance due to low coverage. It also achieved 100% concordance for the 2800M control sample with the Nimagen IDSeek OmniSTR Global system. This hybrid alignment-based strategy effectively resolves allele naming discrepancies caused by flanking and repeat region variants and supports reliable CE allele reporting for forensic applications.
According to the updated nomenclature recommendations issued by the International Society for Forensic Genetics (ISFG), the minimum range, rather than the traditional repeat region, should be captured. Following this guidance, we applied a flanking region alignment with local alignment method [2] to get the minimum range. In this method, 20-30 base pair sequences are extracted from the 5’ and 3’ flanking regions outside the minimum range based on the GRCh38 reference. These flanking sequences are carefully designed to accurately identify the 5’ and 3’ boundaries of the minimum range for each locus, even in the presence of variants within the flanking regions.
Because different kits yield different read lengths, some reads do not span the minimum range region. Therefore, locus-specific alignment with the GRCh38 reference is used for loci with shorter flanking regions, for example as shown in Figure 1, the vWA locus in both the Promega and Nimagen kits. With this locus-specific sequence approach, GeneMarker®HTS software can accurately capture the minimum range regions using padding or approximation when reads are shorter.
Figure 2. Result of STR analysis
In GeneMarker®HTS software, four panels are available for Promega kits, two panels for Nimagen kits and a general panel applicable across all kit systems as shown in Figure 3. To sort reads to different loci accurately, each panel includes either primer sequences or flanking-region sequences specifically designed for locus identification. The panels also encode the information to generate the correct bracket-repeat format and corresponding CE-based allele calls.
The four panels designed for Promega kits correspond to Promega® PowerSeq® 46GY, 56GMY, CRM and Mito systems. In these panels, primers from these systems are used to assign reads to loci. Promega® PowerSeq® 46GY includes 22 autosomal STR loci, 22 Y-STR loci and amelogenin.
Nimagen offers two kits: IDSeek OmniSTR Global and IDSeek mYSTR. Compared with Promega® PowerSeq® 46GY, IDSeek OmniSTR Global kit includes six additional loci: D4S2408, D6S1043, D9S1122, D17S1301, D20S482 and SE33. Because reads from these kits have higher sequencing error rates, panels that use flanking region sequences from the GRCh38 reference were designed to avoid random errors at or near the boundaries of the minimum range region.
The general panel in GeneMarker®HTS software for all kits is fssg_all panel and includes 35 autosomal STRs, 38 X/Y STRs and amelogenin, as specified in the ISFG Forensic STR Sequence Guide (FSSG, version 6). This panel can be applied to the Precision ID GlobalFiler™ NGS STR Panel v2 kits (Applied Biosystems, Thermo Fisher Scientific) and the ForenSeq MainstAY/Signature kits(Verogen, Qiagen), among others.
NIST SRM 2391d [3] includes paired-end FASTQ data for three samples (Component A, B, and C). With built-in “Promega® PowerSeq® 46GY” panel, GeneMarker®HTS software produced results of 22 autosomal STR Loci, 22 Y-STR loci and amelogenin. The Genotypes of the resulting calls are 100% concordant with the certified genotypes/haplotypes.
For IDSeek OmniSTR Global kit, the 2800M control sample (available from Nimagen: https://www.nimagen.com/downloads) was used to assess the CE concordance. 30 loci are covered by this kit and the method achieved 100% concordance with the CE alleles calls.
Table 1. CE concordance study for different datasets
|
Sample/Dataset |
Kit Name |
Count of Sampled Loci |
Concordance |
|
NIST SRM 2391d A/B/C samples |
Promega® PowerSeq® 46GY |
3*45 = 135 |
100% |
|
NIST 650 sample dataset |
Promega® PowerSeq® 46GY |
650*45 = 29,250 |
99.95% |
|
2800M sample |
Nimagen IDSeek OmniSTR Global |
1*30 |
100% |
The National Institute of Standards and Technology (NIST), in conjunction with Promega Corporation, generously provided paired-end FASTQ files for 650 samples with corresponding CE allele calls. Samples were amplified with the PowerSeq®Auto/Y System and sequenced on an Illumina®MiSeq. GeneMarker®HTS software achieved 99.95% concordance with the CE allele calls across 29,250 sampled loci.
Allele-calling rules used for CE concordance: 1.Homozygosity: If the top allele’s relative proportion is ≥90% and its read count >2, call the locus homozygous for that allele. 2. Noise filtering: Discard any allele with a relative proportion <10%. 3. Stutter detection: Classify a candidate at −1 repeat as stutter if its proportion difference is ≥40% of the main allele. 4. Y-STRs (excluding DYS385a/b): If multiple candidates are present, retain other alleles only if its proportion difference is within 15% of the top allele. 5. Autosomal STRs and DYS385a/b: If more than two candidates are present, retain other alleles only if its proportion difference is within 5% of the second allele.
As shown in Table 2, for NIST 650 samples with 29,250 sampled loci, there are only 14 sampled loci discordant with the CE names mainly on FGA, D5S818, D8S1179, D21S11, CSF1PO, D13S317, vWA, PentaE, DYS439 and DYS456. The discordance is due to low coverage of reads on the loci.
Table 2. CE concordance study for 29250 sampled loci: 14 discordant allele names with accuracy of 99.95%, number in the parenthesis means the count of mismatches on the locus
|
Count of Discordant allele Names |
||
|
Autosomal |
12 |
FGA(2), D5S818(1), D8S1179(2), D21S11(1),CSF1PO(1), D13S317(2), vWA(1), PentaE(2) |
|
ChrY |
2 |
DYS439, DYS456 |
The GeneMarker®HTS software[4-7] provides a streamlined workflow for forensic mitochondrial and STR DNA data analysis across major high throughput sequencing (HTS) systems and chemistries.
A hybrid alignment-based approach of flanking region alignment and locus sequence alignment is proposed to capture the minimum range regions of the autosomal STR/Y-STR alleles accurately, even in the presence of variants within the repeat region or the flanking areas. Using this method, results were 100% concordant with CE allele calls for NIST SRM 2391d samples and 99.95% concordant with CE allele calls across 29,250 sampled loci in the NIST-Promega dataset with the Promega® PowerSeq® 46GY system.
With the Nimagen panel and the general panel, the hybrid alignment-based approach also achieved 100% concordance for Nimagen data and produced good results for Precision ID GlobalFiler and ForenSeq MainstAY/Signature data.
High throughput sequencing data can reveal additional information that is not available with traditional CE data. The additional sequence information can be beneficial in forensic casework applications. Strengths of this data include both its resolving power for excluding an individual and the ability to determine potential relationships between evidence and suspects due to Mendelian inheritance of nuclear DNA.
We would like to thank Dr. Peter Vallone at the National Institute of Standards and Technology (NIST) for generously supplying data to complete the concordance study between the CE results and GeneMarker®HTS STR/Y-STR results. We would also like to thank Promega Corporation, Madison, WI, USA for providing Autosomal and Y-STR data, and Drs. Mitchell Holland and Jennifer McElhoe at Penn State University for their comments/suggestions during the mitochondrial/STR analysis development.
Sign up to start your free 35-day trial! No credit card, no commitment required.
Start your free 35-day trial now.