Chapter 3 Data Integration

3.1 Input Requirements

The package accepts mutation data in either CSV format(produced by snippy) or standard VCF format. A minimal input should contain these columns:

exp_data <- read.csv("example/snippy_data/snps.csv")

head(exp_data,n=3)
##       CHROM  POS TYPE REF ALT EVIDENCE FTYPE STRAND   NT_POS  AA_POS
## 1 NC_000962 1718  snp   C   A A:20 C:0                              
## 2 NC_000962 1977  snp   A   G G:20 A:0                              
## 3 NC_000962 2532  snp   T   C C:20 T:0   CDS      + 481/1209 161/402
##                                    EFFECT LOCUS_TAG GENE
## 1                                                       
## 2                                                       
## 3 synonymous_variant c.481T>C p.Leu161Leu    Rv0002 dnaN
##                           PRODUCT
## 1                                
## 2                                
## 3 DNA polymerase III subunit beta

The snp_file_list contains mutation information identified by the snippy software. A standard mutation file should include the following 14 columns:

Column Description
CHROM Chromosome identifier
POS Genomic position of the mutation
TYPE Mutation type (SNP, INDEL, etc.)
REF Reference allele
ALT Alternate allele
EVIDENCE Read depth supporting the mutation
FTYPE Functional type of mutation
STRAND Coding strand (+/-)
NT_POS Nucleotide position within gene
AA_POS Amino acid position in protein
EFFECT Functional consequence
LOCUS_TAG Gene locus identifier
GENE Gene name
PRODUCT Protein product description

Mutation matrices are typically represented as binary (0/1) indicators denoting absence/presence/ of mutations across samples. Three key observations motivate our storage approach:

  1. Sparsity: Most mutations occur in only a few samples.
  2. Annotation redundancy: Many mutation types are identical across samples.
  3. Scalability: Whole-genome datasets can be extremely large.

We therefore decompose the data into two components:

  1. snp_anno: Stores the union of all mutation annotations (shared across samples)
  2. snp_data: Stores the sparse binary matrix of mutation occurrences

This decomposition provides significant advantages:

  • Reduced storage requirements (Each unique annotations stored once)
  • Faster matrix operations on sparse data
  • Efficient memory usage during analysis
# Demo to provide snp files and sample names for integration and production of 
# snp_anno and snp_data
snp_file_list <- list.files("example",recursive =T, full.names = T)
all_smp_name <- str_extract(snp_file_list,"(?<=snippy_data/)[^/]+")

IntegrateData(
  file_path = snp_file_list,
  smp_name = all_smp_name,
  output_dir = "output",
  chunk_size = 1000,  # Process in chunks of 1000 samples to reduce memory usage
  nworkers = 4        # Parallel processing
)

The IntegrateData function generates two files:snp_anno and snp_data, as mentioned above.

For large-scale analyses (>10,000 samples), we recommend processing in chunks (set chunk_size appropriately).

This approach has been tested with >50,000 bacterial genomes while maintaining memory efficiency.

Now, we directly use snp_anno and snp_data files produced by IntegrateData function to construct the SNPObject.

# read in snp_anno, qread is from qs package
snp_anno <- qread("example/snp_anno.qs",nthreads = 8)

# edit Mtb gene names to make it more readable (optional). 
snp_anno$LOCUS_TAG <- gsub("VBD_","v",snp_anno$LOCUS_TAG)
snp_anno$GENE <- ifelse(str_detect(snp_anno$GENE,"RVBD_RS"),snp_anno$LOCUS_TAG,snp_anno$GENE)

head(snp_anno)
##             CHROM POS TYPE FTYPE STRAND  NT_POS AA_POS
## A8Csnp  NC_018143   8  snp   CDS      +  8/1524  3/507
## G10Csnp NC_018143  10  snp   CDS      + 10/1524  4/507
## A11Csnp NC_018143  11  snp   CDS      + 11/1524  4/507
## C12Tsnp NC_018143  12  snp   CDS      + 12/1524  4/507
## C15Tsnp NC_018143  15  snp   CDS      + 15/1524  5/507
## G16Csnp NC_018143  16  snp   CDS      + 16/1524  6/507
##                                       EFFECT LOCUS_TAG_NEW GENE
## A8Csnp     missense_variant c.8A>C p.Asp3Ala  RVBD_RS00005 dnaA
## G10Csnp   missense_variant c.10G>C p.Asp4His  RVBD_RS00005 dnaA
## A11Csnp   missense_variant c.11A>C p.Asp4Ala  RVBD_RS00005 dnaA
## C12Tsnp synonymous_variant c.12C>T p.Asp4Asp  RVBD_RS00005 dnaA
## C15Tsnp synonymous_variant c.15C>T p.Pro5Pro  RVBD_RS00005 dnaA
## G16Csnp   missense_variant c.16G>C p.Gly6Arg  RVBD_RS00005 dnaA
##                                                PRODUCT AA_effect_short
## A8Csnp  chromosomal replication initiator protein DnaA             D3A
## G10Csnp chromosomal replication initiator protein DnaA             D4H
## A11Csnp chromosomal replication initiator protein DnaA             D4A
## C12Tsnp chromosomal replication initiator protein DnaA             D4D
## C15Tsnp chromosomal replication initiator protein DnaA             P5P
## G16Csnp chromosomal replication initiator protein DnaA             G6R
##         LOCUS_TAG
## A8Csnp     Rv0001
## G10Csnp    Rv0001
## A11Csnp    Rv0001
## C12Tsnp    Rv0001
## C15Tsnp    Rv0001
## G16Csnp    Rv0001
# read in snp_data
snp_data <- qread("example/snp_data.qs",nthreads = 8)

# snp_data is a large sparse matrix, with over 50000 rows(samples) and >700k 
# non-redundant mutations.
snp_data[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##           A8Csnp G10Csnp A11Csnp C12Tsnp C15Tsnp G16Csnp
## DRR034340      .       .       .       .       .       .
## DRR034341      .       .       .       .       .       .
## DRR034342      .       .       .       .       1       .
## DRR034343      .       .       .       .       .       .
## DRR034344      .       .       .       .       .       .
## DRR034345      .       .       .       .       .       .

3.2 Creating the SNP Object

# Be careful! The first column of metadata must named as Sample_name.
sample_info <- fread("example/sample_info.txt",data.table = F) %>%
  filter(ID %in% rownames(snp_data)) %>%
  dplyr::rename(Sample_name=ID) %>% 
  mutate(Lineage=paste0("L",str_extract(Lineage,"(?<=LINEAGE).+")))

head(sample_info)
##   Sample_name Genome_coverage Sequencing_depth Lineage     Sublineage BioPrject
## 1   DRR034340        0.976338          52.9203      L2   lineage2.2.1 PRJDB3875
## 2   DRR034341        0.971547          36.7679      L2   lineage2.2.1 PRJDB3875
## 3   DRR034342        0.981057          40.2434      L4 lineage4.9.x.x PRJDB3875
## 4   DRR034343        0.977891          62.5146      L2   lineage2.2.1 PRJDB3875
## 5   DRR034344        0.982537          48.4552      L4 lineage4.9.x.x PRJDB3875
## 6   DRR034345        0.976601          42.5142      L4     lineage4.5 PRJDB3875
# Create the object
Mtb_mut <- CreateSNPObj(snp_anno = snp_anno,
                        snp_data = snp_data,
                        metadata = sample_info,
                        remove_genes_dt = "Mtb", # remove genes for Mtb(including PE/PPE,phage seq,etc)
                        verbose=T) # show the progress
## [1/2] Remove selected genes
## [2/2] Annotate amino acid changes

The qMut package utilizes R6 objects that contain multiple accessible attributes and methods for comprehensive mutation analysis; we will show them in next chapter.

3.3 Merging Objects for Expanded Analysis

When incorporating new samples, the MergeData() function enables seamless integration of multiple qMut objects.

obj_merged <- MergeData(list(obj_old,obj_new))