Comparing Blast2SNP Outputs: Metrics, Filtering, and Interpretation

Blast2SNP Tutorial: From Sequence Alignment to High-Confidence SNPs

Overview

This tutorial shows a practical pipeline to go from sequence alignments (BLAST) to high-confidence single-nucleotide polymorphisms (SNPs) using Blast2SNP, covering input preparation, running BLAST, parsing results, calling candidate SNPs, filtering, and basic validation. Assumes you have reference and query FASTA files and a Unix-like environment.

Requirements

  • Blast2SNP (installed and on PATH)
  • BLAST+ (blastn or blastp depending on input)
  • samtools (for basic sequence handling)
  • bcftools (filtering and VCF tools)
  • Python or Perl (optional scripts for parsing)
  • Reference FASTA and query FASTA(s)

1. Prepare inputs

  1. Reference: Ensure the reference FASTA is indexed:
    • samtools faidx reference.fasta
  2. Queries: Clean query sequences (trim adapters, low-quality ends) and format as FASTA.
  3. Naming: Use unique sequence IDs in FASTA headers; include sample identifiers if processing multiple samples.

2. Run BLAST

Use BLASTN for nucleotide sequences:

Code

blastn -query queries.fasta -db reference.fasta -outfmt 6 -evalue 1e-6 -num_threads 8 -max_target_seqs 5 -out blast_results.tsv
  • outfmt 6 provides tabular output (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore).
  • Adjust e-value, threads, and max_targetseqs as needed.

3. Parse BLAST hits for candidate variants

Blast2SNP accepts BLAST tabular output. Typical parsing steps:

  • For each hit, compute alignment orientation and map query positions to reference positions.
  • Extract mismatched columns between query and reference alignment; each mismatch is a candidate SNP.
  • Record for each candidate: reference chromosome/contig, reference position, reference base, query base, strand, alignment score, percent identity, read/query ID.

Blast2SNP will perform these mapping steps automatically when provided proper BLAST output and the reference FASTA (see command below).

4. Run Blast2SNP

Basic Blast2SNP invocation:

Code

blast2snp –blast blastresults.tsv –ref reference.fasta –out candidates.vcf –min-identity 90 –min-align-length 50

Key options:

  • –blast: BLAST tabular file
  • –ref: reference FASTA
  • –out: output VCF file
  • –min-identity: filter low-identity alignments
  • –min-align-length: discard short alignments that produce unreliable SNP calls

If processing multiple samples, run per-sample and later merge VCFs or provide per-sample BLAST files if Blast2SNP supports multi-sample input.

5. Initial filters and annotations

After Blast2SNP produces candidates.vcf, apply basic filters with bcftools:

Code

bcftools filter -i ‘QUAL>=30 && DP>=5’ candidates.vcf -o candidates.filtered.vcf
  • QUAL: variant quality (threshold 30 is a common starting point)
  • DP: read depth (>=5 helps reduce false positives) If your VCF lacks DP, compute depth from alignments or add coverage via samtools mpileup or custom scripts.

Annotate variants (optional) with snpEff or VEP to add functional context:

Code

snpEff ann referencedb candidates.filtered.vcf > candidates.ann.vcf

6. Advanced filtering strategies

  • Strand bias: Remove SNPs supported predominantly by one strand.
  • Allele balance: For heterozygous calls, require allele fraction within expected range (e.g., 0.3–0.7).
  • Repetitive regions: Mask or remove variants in low-complexity or repetitive sequence (use RepeatMasker tracks or k-mer uniqueness).
  • Proximity filters: Flag SNPs within N bp of indels or clustered SNPs which may be alignment artifacts. Example bcftools expression for allele fraction:

Code

bcftools +fill-tags candidates.filtered.vcf – -t AF | bcftools filter -i ‘AF>0.3 && AF<0.7' -o candidates.het.vcf

7. Validation and confirmation

  • Visualize candidate SNPs in IGV or similar genome browsers by creating a BAM of query alignments against the reference:
    • Convert BLAST alignments to SAM/BAM if using BLAST-based mapping, or realign queries with a short-read aligner (bwa mem) for better visualization.
    • samtools view -bS alignments.sam | samtools sort -o alignments.sorted.bam
    • samtools index alignments.sorted.bam
  • Confirm top-priority SNPs by Sanger sequencing or independent sequencing runs.
  • Cross-sample comparison: variants seen across multiple independent samples increase confidence.

8. Reporting results

Provide a final VCF and a short TSV summary with key columns:

  • Chromosome, Position, Ref, Alt, QUAL, DP, AF, Sample Include filters applied and thresholds used.

Example minimal pipeline (commands)

  1. Index reference:

Code

samtools faidx reference.fasta
  1. BLAST:

Code

blastn -query sample.fasta -db reference.fasta -outfmt 6 -evalue 1e-6 -numthreads 8 -out sample.blast.tsv
  1. Blast2SNP:

Code

blast2snp –blast sample.blast.tsv –ref reference.fasta –out sample.vcf –min-identity 90 –min-align-length 50
  1. Filter:

Code

bcftools filter -i ‘QUAL>=30 && DP>=5’ sample.vcf -o sample.filtered.vcf

Tips and best practices

  • Use stricter identity and length thresholds for divergent sequences.
  • Always inspect a subset of calls manually in a genome browser.
  • Keep metadata linking query IDs to samples to trace variants back to source sequences.
  • Document all parameters for reproducibility.

Troubleshooting

  • Few SNPs: loosen min-identity or alignment length, or check query quality.
  • Many false positives: increase quality/DP thresholds, mask low-complexity regions, or require multiple supporting queries.
  • Misplaced coordinates: ensure BLAST output and reference FASTA use identical contig names and coordinate systems.

If you want, I can produce example parsing scripts (Python) or a reproducible Snakemake workflow for this pipeline.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *