Chapter 3 Data Integration
3.1 Input Requirements
The package accepts mutation data in either CSV format(produced by snippy) or standard VCF format. A minimal input should contain these columns:
## CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS
## 1 NC_000962 1718 snp C A A:20 C:0
## 2 NC_000962 1977 snp A G G:20 A:0
## 3 NC_000962 2532 snp T C C:20 T:0 CDS + 481/1209 161/402
## EFFECT LOCUS_TAG GENE
## 1
## 2
## 3 synonymous_variant c.481T>C p.Leu161Leu Rv0002 dnaN
## PRODUCT
## 1
## 2
## 3 DNA polymerase III subunit beta
The snp_file_list
contains mutation information identified by the snippy software. A standard mutation file should include the following 14 columns:
Column | Description |
---|---|
CHROM |
Chromosome identifier |
POS |
Genomic position of the mutation |
TYPE |
Mutation type (SNP, INDEL, etc.) |
REF |
Reference allele |
ALT |
Alternate allele |
EVIDENCE |
Read depth supporting the mutation |
FTYPE |
Functional type of mutation |
STRAND |
Coding strand (+/-) |
NT_POS |
Nucleotide position within gene |
AA_POS |
Amino acid position in protein |
EFFECT |
Functional consequence |
LOCUS_TAG |
Gene locus identifier |
GENE |
Gene name |
PRODUCT |
Protein product description |
Mutation matrices are typically represented as binary (0/1) indicators denoting absence/presence/ of mutations across samples. Three key observations motivate our storage approach:
- Sparsity: Most mutations occur in only a few samples.
- Annotation redundancy: Many mutation types are identical across samples.
- Scalability: Whole-genome datasets can be extremely large.
We therefore decompose the data into two components:
- snp_anno: Stores the union of all mutation annotations (shared across samples)
- snp_data: Stores the sparse binary matrix of mutation occurrences
This decomposition provides significant advantages:
- Reduced storage requirements (Each unique annotations stored once)
- Faster matrix operations on sparse data
- Efficient memory usage during analysis
# Demo to provide snp files and sample names for integration and production of
# snp_anno and snp_data
snp_file_list <- list.files("example",recursive =T, full.names = T)
all_smp_name <- str_extract(snp_file_list,"(?<=snippy_data/)[^/]+")
IntegrateData(
file_path = snp_file_list,
smp_name = all_smp_name,
output_dir = "output",
chunk_size = 1000, # Process in chunks of 1000 samples to reduce memory usage
nworkers = 4 # Parallel processing
)
The IntegrateData function generates two files:snp_anno and snp_data, as mentioned above.
For large-scale analyses (>10,000 samples), we recommend processing in chunks (set chunk_size appropriately).
This approach has been tested with >50,000 bacterial genomes while maintaining memory efficiency.
Now, we directly use snp_anno and snp_data files produced by IntegrateData
function to construct the SNPObject.
# read in snp_anno, qread is from qs package
snp_anno <- qread("example/snp_anno.qs",nthreads = 8)
# edit Mtb gene names to make it more readable (optional).
snp_anno$LOCUS_TAG <- gsub("VBD_","v",snp_anno$LOCUS_TAG)
snp_anno$GENE <- ifelse(str_detect(snp_anno$GENE,"RVBD_RS"),snp_anno$LOCUS_TAG,snp_anno$GENE)
head(snp_anno)
## CHROM POS TYPE FTYPE STRAND NT_POS AA_POS
## A8Csnp NC_018143 8 snp CDS + 8/1524 3/507
## G10Csnp NC_018143 10 snp CDS + 10/1524 4/507
## A11Csnp NC_018143 11 snp CDS + 11/1524 4/507
## C12Tsnp NC_018143 12 snp CDS + 12/1524 4/507
## C15Tsnp NC_018143 15 snp CDS + 15/1524 5/507
## G16Csnp NC_018143 16 snp CDS + 16/1524 6/507
## EFFECT LOCUS_TAG_NEW GENE
## A8Csnp missense_variant c.8A>C p.Asp3Ala RVBD_RS00005 dnaA
## G10Csnp missense_variant c.10G>C p.Asp4His RVBD_RS00005 dnaA
## A11Csnp missense_variant c.11A>C p.Asp4Ala RVBD_RS00005 dnaA
## C12Tsnp synonymous_variant c.12C>T p.Asp4Asp RVBD_RS00005 dnaA
## C15Tsnp synonymous_variant c.15C>T p.Pro5Pro RVBD_RS00005 dnaA
## G16Csnp missense_variant c.16G>C p.Gly6Arg RVBD_RS00005 dnaA
## PRODUCT AA_effect_short
## A8Csnp chromosomal replication initiator protein DnaA D3A
## G10Csnp chromosomal replication initiator protein DnaA D4H
## A11Csnp chromosomal replication initiator protein DnaA D4A
## C12Tsnp chromosomal replication initiator protein DnaA D4D
## C15Tsnp chromosomal replication initiator protein DnaA P5P
## G16Csnp chromosomal replication initiator protein DnaA G6R
## LOCUS_TAG
## A8Csnp Rv0001
## G10Csnp Rv0001
## A11Csnp Rv0001
## C12Tsnp Rv0001
## C15Tsnp Rv0001
## G16Csnp Rv0001
# read in snp_data
snp_data <- qread("example/snp_data.qs",nthreads = 8)
# snp_data is a large sparse matrix, with over 50000 rows(samples) and >700k
# non-redundant mutations.
snp_data[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
## A8Csnp G10Csnp A11Csnp C12Tsnp C15Tsnp G16Csnp
## DRR034340 . . . . . .
## DRR034341 . . . . . .
## DRR034342 . . . . 1 .
## DRR034343 . . . . . .
## DRR034344 . . . . . .
## DRR034345 . . . . . .
3.2 Creating the SNP Object
# Be careful! The first column of metadata must named as Sample_name.
sample_info <- fread("example/sample_info.txt",data.table = F) %>%
filter(ID %in% rownames(snp_data)) %>%
dplyr::rename(Sample_name=ID) %>%
mutate(Lineage=paste0("L",str_extract(Lineage,"(?<=LINEAGE).+")))
head(sample_info)
## Sample_name Genome_coverage Sequencing_depth Lineage Sublineage BioPrject
## 1 DRR034340 0.976338 52.9203 L2 lineage2.2.1 PRJDB3875
## 2 DRR034341 0.971547 36.7679 L2 lineage2.2.1 PRJDB3875
## 3 DRR034342 0.981057 40.2434 L4 lineage4.9.x.x PRJDB3875
## 4 DRR034343 0.977891 62.5146 L2 lineage2.2.1 PRJDB3875
## 5 DRR034344 0.982537 48.4552 L4 lineage4.9.x.x PRJDB3875
## 6 DRR034345 0.976601 42.5142 L4 lineage4.5 PRJDB3875
# Create the object
Mtb_mut <- CreateSNPObj(snp_anno = snp_anno,
snp_data = snp_data,
metadata = sample_info,
remove_genes_dt = "Mtb", # remove genes for Mtb(including PE/PPE,phage seq,etc)
verbose=T) # show the progress
## [1/2] Remove selected genes
## [2/2] Annotate amino acid changes
The qMut package utilizes R6 objects that contain multiple accessible attributes and methods for comprehensive mutation analysis; we will show them in next chapter.