A Hybrid Framework for Autosomal and Y-STR Allele Identification

Written by SoftGenetics Team | Jan 28, 2026 9:52:17 AM

Featuring Research Presented as Posters at ISHI 2025

Lidong Luo, Yiqiong Wu, Edward Bouton, Sarah Copeland

Introduction

The advent of Massively Parallel Sequencing (MPS) has enabled high-resolution characterization of Short Tandem Repeat (STR) loci, revealing variants in the repeat and flanking regions, which are critical for distinguishing isoalleles and enhancing probabilistic genotyping. In 2024, the International Society for Forensic Genetics (ISFG) published updated nomenclature recommendations [1] that promote the use of sequence codes based on the minimum range to avoid masking polymorphisms and encourage the adoption of new bracketed repeat format. Maintaining backward compatibility with length-based nomenclature derived from Capillary Electrophoresis (CE) remains essential for continuity in forensic databases.

To comply with these updated recommendations, we introduce a hybrid alignment strategy for generating sequence codes within the minimum range defined by the Forensic STR Sequence Guide (FSSG). The method implemented in the GeneMarker®HTS software v2.7.0 combines flanking region alignment and locus sequence alignment. GeneMarker®HTS software provides specified panels for Promega PowerSeq® kit and Nimagen IDSeek OmniSTR Global kit. Moreover, it includes a general panel derived from the GRCh38 reference for data produced by other kits.

To validate this approach, a concordance study was conducted. Stutter artifacts and low-coverage alleles were filtered to ensure accurate CE allele calling. A conversion algorithm from minimum-range sequence length to repeat count was also implemented. The method achieved 100% concordance with CE allele calls for NIST SRM-2391d reference samples and 99.95% concordance across 650 samples in the NIST-Promega dataset using the PowerSeq® 46GY System, with discordance due to low coverage. It also achieved 100% concordance for the 2800M control sample with the Nimagen IDSeek OmniSTR Global system. This hybrid alignment-based strategy effectively resolves allele naming discrepancies caused by flanking and repeat region variants and supports reliable CE allele reporting for forensic applications.

Method

According to the updated nomenclature recommendations issued by the International Society for Forensic Genetics (ISFG), the minimum range, rather than the traditional repeat region, should be captured. Following this guidance, we applied a flanking region alignment with local alignment method [2] to get the minimum range. In this method, 20-30 base pair sequences are extracted from the 5’ and 3’ flanking regions outside the minimum range based on the GRCh38 reference. These flanking sequences are carefully designed to accurately identify the 5’ and 3’ boundaries of the minimum range for each locus, even in the presence of variants within the flanking regions. Figure 1. Example of read spans from different kits

Because different kits yield different read lengths, some reads do not span the minimum range region. Therefore, locus-specific alignment with the GRCh38 reference is used for loci with shorter flanking regions, for example as shown in Figure 1, the vWA locus in both the Promega and Nimagen kits. With this locus-specific sequence approach, GeneMarker®HTS software can accurately capture the minimum range regions using padding or approximation when reads are shorter.

Figure 2. Result of STR analysis

STR Panels

In GeneMarker®HTS software, four panels are available for Promega kits, two panels for Nimagen kits and a general panel applicable across all kit systems as shown in Figure 3. To sort reads to different loci accurately, each panel includes either primer sequences or flanking-region sequences specifically designed for locus identification. The panels also encode the information to generate the correct bracket-repeat format and corresponding CE-based allele calls.

Figure 3. An illustration of STR panels

Promega Panels

The four panels designed for Promega kits correspond to Promega® PowerSeq® 46GY, 56GMY, CRM and Mito systems. In these panels, primers from these systems are used to assign reads to loci. Promega® PowerSeq® 46GY includes 22 autosomal STR loci, 22 Y-STR loci and amelogenin.

Nimagen Panels

Nimagen offers two kits: IDSeek OmniSTR Global and IDSeek mYSTR. Compared with Promega® PowerSeq® 46GY, IDSeek OmniSTR Global kit includes six additional loci: D4S2408, D6S1043, D9S1122, D17S1301, D20S482 and SE33. Because reads from these kits have higher sequencing error rates, panels that use flanking region sequences from the GRCh38 reference were designed to avoid random errors at or near the boundaries of the minimum range region.

General Panel for different kits

The general panel in GeneMarker®HTS software for all kits is fssg_all panel and includes 35 autosomal STRs, 38 X/Y STRs and amelogenin, as specified in the ISFG Forensic STR Sequence Guide (FSSG, version 6). This panel can be applied to the Precision ID GlobalFiler™ NGS STR Panel v2 kits (Applied Biosystems, Thermo Fisher Scientific) and the ForenSeq MainstAY/Signature kits(Verogen, Qiagen), among others.

Results

Concordance Study

NIST SRM 2391d [3] includes paired-end FASTQ data for three samples (Component A, B, and C). With built-in “Promega® PowerSeq® 46GY” panel, GeneMarker®HTS software produced results of 22 autosomal STR Loci, 22 Y-STR loci and amelogenin. The Genotypes of the resulting calls are 100% concordant with the certified genotypes/haplotypes.

For IDSeek OmniSTR Global kit, the 2800M control sample (available from Nimagen: https://www.nimagen.com/downloads) was used to assess the CE concordance. 30 loci are covered by this kit and the method achieved 100% concordance with the CE alleles calls.

Table 1. CE concordance study for different datasets

Sample/Dataset	Kit Name	Count of Sampled Loci	Concordance
NIST SRM 2391d A/B/C samples	Promega® PowerSeq® 46GY	3*45 = 135	100%
NIST 650 sample dataset	Promega® PowerSeq® 46GY	650*45 = 29,250	99.95%
2800M sample	Nimagen IDSeek OmniSTR Global	1*30	100%

The National Institute of Standards and Technology (NIST), in conjunction with Promega Corporation, generously provided paired-end FASTQ files for 650 samples with corresponding CE allele calls. Samples were amplified with the PowerSeq®Auto/Y System and sequenced on an Illumina®MiSeq. GeneMarker®HTS software achieved 99.95% concordance with the CE allele calls across 29,250 sampled loci.

Allele-calling rules used for CE concordance: 1.Homozygosity: If the top allele’s relative proportion is ≥90% and its read count >2, call the locus homozygous for that allele. 2. Noise filtering: Discard any allele with a relative proportion <10%. 3. Stutter detection: Classify a candidate at −1 repeat as stutter if its proportion difference is ≥40% of the main allele. 4. Y-STRs (excluding DYS385a/b): If multiple candidates are present, retain other alleles only if its proportion difference is within 15% of the top allele. 5. Autosomal STRs and DYS385a/b: If more than two candidates are present, retain other alleles only if its proportion difference is within 5% of the second allele.

As shown in Table 2, for NIST 650 samples with 29,250 sampled loci, there are only 14 sampled loci discordant with the CE names mainly on FGA, D5S818, D8S1179, D21S11, CSF1PO, D13S317, vWA, PentaE, DYS439 and DYS456. The discordance is due to low coverage of reads on the loci.

Table 2. CE concordance study for 29250 sampled loci: 14 discordant allele names with accuracy of 99.95%, number in the parenthesis means the count of mismatches on the locus

	Count of Discordant allele Names
Autosomal	12	FGA(2), D5S818(1), D8S1179(2), D21S11(1),CSF1PO(1), D13S317(2), vWA(1), PentaE(2)
ChrY	2	DYS439, DYS456

Conclusion

The GeneMarker®HTS software[4-7] provides a streamlined workflow for forensic mitochondrial and STR DNA data analysis across major high throughput sequencing (HTS) systems and chemistries.

A hybrid alignment-based approach of flanking region alignment and locus sequence alignment is proposed to capture the minimum range regions of the autosomal STR/Y-STR alleles accurately, even in the presence of variants within the repeat region or the flanking areas. Using this method, results were 100% concordant with CE allele calls for NIST SRM 2391d samples and 99.95% concordant with CE allele calls across 29,250 sampled loci in the NIST-Promega dataset with the Promega® PowerSeq® 46GY system.

With the Nimagen panel and the general panel, the hybrid alignment-based approach also achieved 100% concordance for Nimagen data and produced good results for Precision ID GlobalFiler and ForenSeq MainstAY/Signature data.

High throughput sequencing data can reveal additional information that is not available with traditional CE data. The additional sequence information can be beneficial in forensic casework applications. Strengths of this data include both its resolving power for excluding an individual and the ability to determine potential relationships between evidence and suspects due to Mendelian inheritance of nuclear DNA.

Acknowledgements

We would like to thank Dr. Peter Vallone at the National Institute of Standards and Technology (NIST) for generously supplying data to complete the concordance study between the CE results and GeneMarker®HTS STR/Y-STR results. We would also like to thank Promega Corporation, Madison, WI, USA for providing Autosomal and Y-STR data, and Drs. Mitchell Holland and Jennifer McElhoe at Penn State University for their comments/suggestions during the mitochondrial/STR analysis development.

References

Katherine B. Gettings, Martin Bodner, Lisa A. Borsuk, Jonathan L. King, David Ballard, Walther Parson, Corina C. G. Benschop, et al. Recommendations of the DNA Commission of the International Society for Forensic Genetics (ISFG) on Short Tandem Repeat Sequence Nomenclature. Forensic Science International: Genetics 2024;68:102946.
Mengyao Zhao, Wan-Ping Lee, Erik P. Garrison, Gabor T. Marth. SSW library: An SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One. 2013 Dec 4;8(12):e82138.
NIST Certificate of Analysis Standard Reference Material 2391d PCR-Based DNA Profiling Standard https://www-s.nist.gov/srmors/certificates/2391d.pdf
MM Holland, E Pack, JA McElhoe. Evaluation of GeneMarker® HTS for improved alignment of mtDNA MPS data, haplotype determination, and heteroplasmy assessment. Forensic Science International: Genetics 2017, 28, pp. 90-98.
MD Brandhagen, RS Just, JA Irwin. Validation of NGS for mitochondrial DNA casework at the FBI Laboratory. Forensic Sci Int Genet. 2020 Jan;44:102151.
C.S. Liu, L. Luo, J. McGuigan, J. Wu, J. Todd, C. Prosser, S. Copeland, T. Snyder-Leiby, High throughput sequencing data analysis workflow: mtDNA variant detection and identification of STR/Y-STR alleles and iso-alleles, Forensic Science International: Genetics Supplement Series, 2019, Volume 7, Issue 1, Pages 639-640.
L. Luo, Y. Wu, J. Todd, J. Ruth, E. Podlaszewski, S. Copeland, T. Snyder-Leiby, C.S. Liu. Identification of STR/Y-STR alleles with tolerance for variants and stutter detection using GeneMarker®HTS software. Forensic Science International: Genetics Supplement Series, 2022. 8. 10.1016.

Get Started with SoftGenetics

Sign up to start your free 35-day trial! No credit card, no commitment required.
Start your free 35-day trial now.

View full post