Chapter 3 Data Integration

3.1 Input Requirements

The package accepts mutation data in either CSV format(produced by snippy) or standard VCF format. A minimal input should contain these columns:

exp_data <- read.csv("example/snippy_data/snps.csv")

head(exp_data,n=3)

##       CHROM  POS TYPE REF ALT EVIDENCE FTYPE STRAND   NT_POS  AA_POS
## 1 NC_000962 1718  snp   C   A A:20 C:0                              
## 2 NC_000962 1977  snp   A   G G:20 A:0                              
## 3 NC_000962 2532  snp   T   C C:20 T:0   CDS      + 481/1209 161/402
##                                    EFFECT LOCUS_TAG GENE
## 1                                                       
## 2                                                       
## 3 synonymous_variant c.481T>C p.Leu161Leu    Rv0002 dnaN
##                           PRODUCT
## 1                                
## 2                                
## 3 DNA polymerase III subunit beta

The snp_file_list contains mutation information identified by the snippy software. A standard mutation file should include the following 14 columns:

Column	Description
`CHROM`	Chromosome identifier
`POS`	Genomic position of the mutation
`TYPE`	Mutation type (SNP, INDEL, etc.)
`REF`	Reference allele
`ALT`	Alternate allele
`EVIDENCE`	Read depth supporting the mutation
`FTYPE`	Functional type of mutation
`STRAND`	Coding strand (+/-)
`NT_POS`	Nucleotide position within gene
`AA_POS`	Amino acid position in protein
`EFFECT`	Functional consequence
`LOCUS_TAG`	Gene locus identifier
`GENE`	Gene name
`PRODUCT`	Protein product description

Mutation matrices are typically represented as binary (0/1) indicators denoting absence/presence/ of mutations across samples. Three key observations motivate our storage approach:

Sparsity: Most mutations occur in only a few samples.
Annotation redundancy: Many mutation types are identical across samples.
Scalability: Whole-genome datasets can be extremely large.

We therefore decompose the data into two components:

snp_anno: Stores the union of all mutation annotations (shared across samples)
snp_data: Stores the sparse binary matrix of mutation occurrences

This decomposition provides significant advantages:

Reduced storage requirements (Each unique annotations stored once)
Faster matrix operations on sparse data
Efficient memory usage during analysis

# Demo to provide snp files and sample names for integration and production of 
# snp_anno and snp_data
snp_file_list <- list.files("example",recursive =T, full.names = T)
all_smp_name <- str_extract(snp_file_list,"(?<=snippy_data/)[^/]+")

IntegrateData(
  file_path = snp_file_list,
  smp_name = all_smp_name,
  output_dir = "output",
  chunk_size = 1000,  # Process in chunks of 1000 samples to reduce memory usage
  nworkers = 4        # Parallel processing
)

The IntegrateData function generates two files:snp_anno and snp_data, as mentioned above.

For large-scale analyses (>10,000 samples), we recommend processing in chunks (set chunk_size appropriately).

This approach has been tested with >50,000 bacterial genomes while maintaining memory efficiency.

Now, we directly use snp_anno and snp_data files produced by IntegrateData function to construct the SNPObject.

# read in snp_anno, qread is from qs package
snp_anno <- qread("example/snp_anno.qs",nthreads = 8)

# edit Mtb gene names to make it more readable (optional). 
snp_anno$LOCUS_TAG <- gsub("VBD_","v",snp_anno$LOCUS_TAG)
snp_anno$GENE <- ifelse(str_detect(snp_anno$GENE,"RVBD_RS"),snp_anno$LOCUS_TAG,snp_anno$GENE)

head(snp_anno)

##             CHROM POS TYPE FTYPE STRAND  NT_POS AA_POS
## A8Csnp  NC_018143   8  snp   CDS      +  8/1524  3/507
## G10Csnp NC_018143  10  snp   CDS      + 10/1524  4/507
## A11Csnp NC_018143  11  snp   CDS      + 11/1524  4/507
## C12Tsnp NC_018143  12  snp   CDS      + 12/1524  4/507
## C15Tsnp NC_018143  15  snp   CDS      + 15/1524  5/507
## G16Csnp NC_018143  16  snp   CDS      + 16/1524  6/507
##                                       EFFECT LOCUS_TAG_NEW GENE
## A8Csnp     missense_variant c.8A>C p.Asp3Ala  RVBD_RS00005 dnaA
## G10Csnp   missense_variant c.10G>C p.Asp4His  RVBD_RS00005 dnaA
## A11Csnp   missense_variant c.11A>C p.Asp4Ala  RVBD_RS00005 dnaA
## C12Tsnp synonymous_variant c.12C>T p.Asp4Asp  RVBD_RS00005 dnaA
## C15Tsnp synonymous_variant c.15C>T p.Pro5Pro  RVBD_RS00005 dnaA
## G16Csnp   missense_variant c.16G>C p.Gly6Arg  RVBD_RS00005 dnaA
##                                                PRODUCT AA_effect_short
## A8Csnp  chromosomal replication initiator protein DnaA             D3A
## G10Csnp chromosomal replication initiator protein DnaA             D4H
## A11Csnp chromosomal replication initiator protein DnaA             D4A
## C12Tsnp chromosomal replication initiator protein DnaA             D4D
## C15Tsnp chromosomal replication initiator protein DnaA             P5P
## G16Csnp chromosomal replication initiator protein DnaA             G6R
##         LOCUS_TAG
## A8Csnp     Rv0001
## G10Csnp    Rv0001
## A11Csnp    Rv0001
## C12Tsnp    Rv0001
## C15Tsnp    Rv0001
## G16Csnp    Rv0001

# read in snp_data
snp_data <- qread("example/snp_data.qs",nthreads = 8)

# snp_data is a large sparse matrix, with over 50000 rows(samples) and >700k 
# non-redundant mutations.
snp_data[1:6,1:6]

## 6 x 6 sparse Matrix of class "dgCMatrix"
##           A8Csnp G10Csnp A11Csnp C12Tsnp C15Tsnp G16Csnp
## DRR034340      .       .       .       .       .       .
## DRR034341      .       .       .       .       .       .
## DRR034342      .       .       .       .       1       .
## DRR034343      .       .       .       .       .       .
## DRR034344      .       .       .       .       .       .
## DRR034345      .       .       .       .       .       .

3.2 Creating the SNP Object

# Be careful! The first column of metadata must named as Sample_name.
sample_info <- fread("example/sample_info.txt",data.table = F) %>%
  filter(ID %in% rownames(snp_data)) %>%
  dplyr::rename(Sample_name=ID) %>% 
  mutate(Lineage=paste0("L",str_extract(Lineage,"(?<=LINEAGE).+")))

head(sample_info)

##   Sample_name Genome_coverage Sequencing_depth Lineage     Sublineage BioPrject
## 1   DRR034340        0.976338          52.9203      L2   lineage2.2.1 PRJDB3875
## 2   DRR034341        0.971547          36.7679      L2   lineage2.2.1 PRJDB3875
## 3   DRR034342        0.981057          40.2434      L4 lineage4.9.x.x PRJDB3875
## 4   DRR034343        0.977891          62.5146      L2   lineage2.2.1 PRJDB3875
## 5   DRR034344        0.982537          48.4552      L4 lineage4.9.x.x PRJDB3875
## 6   DRR034345        0.976601          42.5142      L4     lineage4.5 PRJDB3875

# Create the object
Mtb_mut <- CreateSNPObj(snp_anno = snp_anno,
                        snp_data = snp_data,
                        metadata = sample_info,
                        remove_genes_dt = "Mtb", # remove genes for Mtb(including PE/PPE,phage seq,etc)
                        verbose=T) # show the progress

## [1/2] Remove selected genes

## [2/2] Annotate amino acid changes

The qMut package utilizes R6 objects that contain multiple accessible attributes and methods for comprehensive mutation analysis; we will show them in next chapter.

3.3 Merging Objects for Expanded Analysis

When incorporating new samples, the MergeData() function enables seamless integration of multiple qMut objects.

obj_merged <- MergeData(list(obj_old,obj_new))