Blast2SNP Tutorial: From Sequence Alignment to High-Confidence SNPs
Overview
This tutorial shows a practical pipeline to go from sequence alignments (BLAST) to high-confidence single-nucleotide polymorphisms (SNPs) using Blast2SNP, covering input preparation, running BLAST, parsing results, calling candidate SNPs, filtering, and basic validation. Assumes you have reference and query FASTA files and a Unix-like environment.
Requirements
- Blast2SNP (installed and on PATH)
- BLAST+ (blastn or blastp depending on input)
- samtools (for basic sequence handling)
- bcftools (filtering and VCF tools)
- Python or Perl (optional scripts for parsing)
- Reference FASTA and query FASTA(s)
1. Prepare inputs
- Reference: Ensure the reference FASTA is indexed:
- samtools faidx reference.fasta
- Queries: Clean query sequences (trim adapters, low-quality ends) and format as FASTA.
- Naming: Use unique sequence IDs in FASTA headers; include sample identifiers if processing multiple samples.
2. Run BLAST
Use BLASTN for nucleotide sequences:
Code
blastn -query queries.fasta -db reference.fasta -outfmt 6 -evalue 1e-6 -num_threads 8 -max_target_seqs 5 -out blast_results.tsv
- outfmt 6 provides tabular output (qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore).
- Adjust e-value, threads, and max_targetseqs as needed.
3. Parse BLAST hits for candidate variants
Blast2SNP accepts BLAST tabular output. Typical parsing steps:
- For each hit, compute alignment orientation and map query positions to reference positions.
- Extract mismatched columns between query and reference alignment; each mismatch is a candidate SNP.
- Record for each candidate: reference chromosome/contig, reference position, reference base, query base, strand, alignment score, percent identity, read/query ID.
Blast2SNP will perform these mapping steps automatically when provided proper BLAST output and the reference FASTA (see command below).
4. Run Blast2SNP
Basic Blast2SNP invocation:
Code
blast2snp –blast blastresults.tsv –ref reference.fasta –out candidates.vcf –min-identity 90 –min-align-length 50
Key options:
- –blast: BLAST tabular file
- –ref: reference FASTA
- –out: output VCF file
- –min-identity: filter low-identity alignments
- –min-align-length: discard short alignments that produce unreliable SNP calls
If processing multiple samples, run per-sample and later merge VCFs or provide per-sample BLAST files if Blast2SNP supports multi-sample input.
5. Initial filters and annotations
After Blast2SNP produces candidates.vcf, apply basic filters with bcftools:
Code
bcftools filter -i ‘QUAL>=30 && DP>=5’ candidates.vcf -o candidates.filtered.vcf
- QUAL: variant quality (threshold 30 is a common starting point)
- DP: read depth (>=5 helps reduce false positives) If your VCF lacks DP, compute depth from alignments or add coverage via samtools mpileup or custom scripts.
Annotate variants (optional) with snpEff or VEP to add functional context:
Code
snpEff ann referencedb candidates.filtered.vcf > candidates.ann.vcf
6. Advanced filtering strategies
- Strand bias: Remove SNPs supported predominantly by one strand.
- Allele balance: For heterozygous calls, require allele fraction within expected range (e.g., 0.3–0.7).
- Repetitive regions: Mask or remove variants in low-complexity or repetitive sequence (use RepeatMasker tracks or k-mer uniqueness).
- Proximity filters: Flag SNPs within N bp of indels or clustered SNPs which may be alignment artifacts. Example bcftools expression for allele fraction:
Code
bcftools +fill-tags candidates.filtered.vcf – -t AF | bcftools filter -i ‘AF>0.3 && AF<0.7' -o candidates.het.vcf
7. Validation and confirmation
- Visualize candidate SNPs in IGV or similar genome browsers by creating a BAM of query alignments against the reference:
- Convert BLAST alignments to SAM/BAM if using BLAST-based mapping, or realign queries with a short-read aligner (bwa mem) for better visualization.
- samtools view -bS alignments.sam | samtools sort -o alignments.sorted.bam
- samtools index alignments.sorted.bam
- Confirm top-priority SNPs by Sanger sequencing or independent sequencing runs.
- Cross-sample comparison: variants seen across multiple independent samples increase confidence.
8. Reporting results
Provide a final VCF and a short TSV summary with key columns:
- Chromosome, Position, Ref, Alt, QUAL, DP, AF, Sample Include filters applied and thresholds used.
Example minimal pipeline (commands)
- Index reference:
Code
samtools faidx reference.fasta
- BLAST:
Code
blastn -query sample.fasta -db reference.fasta -outfmt 6 -evalue 1e-6 -numthreads 8 -out sample.blast.tsv
- Blast2SNP:
Code
blast2snp –blast sample.blast.tsv –ref reference.fasta –out sample.vcf –min-identity 90 –min-align-length 50
- Filter:
Code
bcftools filter -i ‘QUAL>=30 && DP>=5’ sample.vcf -o sample.filtered.vcf
Tips and best practices
- Use stricter identity and length thresholds for divergent sequences.
- Always inspect a subset of calls manually in a genome browser.
- Keep metadata linking query IDs to samples to trace variants back to source sequences.
- Document all parameters for reproducibility.
Troubleshooting
- Few SNPs: loosen min-identity or alignment length, or check query quality.
- Many false positives: increase quality/DP thresholds, mask low-complexity regions, or require multiple supporting queries.
- Misplaced coordinates: ensure BLAST output and reference FASTA use identical contig names and coordinate systems.
If you want, I can produce example parsing scripts (Python) or a reproducible Snakemake workflow for this pipeline.