2026-04-10
Reference-free -omics at scale
Hold k-mer counts for as many TCGA + GTex samples as possible in a single table, then join against a mass-spec sequenced aeTSA peptide table.
KCT (RNA-Seq, n samples) aeTSA peptide table
┌────────────────────────────┐ ┌───────────────────┐
│ k-mer │ s1 │ s2 │ ... │ sn │ │ peptide │ source │
│ MAEK │ 3 │ 0 │ ... │ 1 │ │ MAEKFW │ sample1 │
│ AEKF │ 0 │ 5 │ ... │ 0 │ │ ... │ ... │
│ ... │ │ │ │ │ └───────────────────┘
└────────────────────────────┘
↓ join on k-mer overlap
per-sample expression of each candidate aeTSA
across the full TCGA + GTex cohort
TCGA + GTex is on the order of thousands of samples. Every save byte means more samples held on memory at the same time in the table.
| Component | Role |
|---|---|
JelloFish.jl |
Multi-threaded FASTQ k-merization + parallel hash merge |
AAAlphabet.jl |
5-bit AA alphabet, k-mers ≤12 AA fit in one UInt64 |
PackedArray.jl |
Variable-width bit-packed count storage + boundary bitmap |
NeoKCT.jl |
K-mer-Count table + prefix-indexed binary search + push! logic |
KCTLoader.jl |
Versioned binary serialization, backward-compatible |
KCTBenchmarker.jl |
Component sizes, query speed, cardinality, elapsed time |
I went through a major refactoring of the KCT code to make it more maintainable, essentially catching up to a lot of “temporary solutions” to make the whole codebase more robust. I also plan on abstracting JelloFish behind a standalone module.
Every k-mer lives at a different heap address. Iterating the table = pointer chasing.
Serialization had to write type metadata per entry.
Worse: bugs caused words to not be deduplicated:
collapse! function doesn’t really work in place. The ! is there to warn that it makes the “in-place” kct unusable. This caused it to not properly apply
Countsdominates: mostly wasted words from duplication.
function Base.push!(kct::NeoKCT{K, Ab, W}, sample_hashtable::Dict{UInt64, UInt32}; collapse::Bool=false) where {K, Ab<:Alphabet, W<:Unsigned}
@showprogress desc="Adding Sample $(kct.samples.x+1) to Table..." for (k_bits, count) in sample_hashtable
tmp_seq = Kmer{Ab, K, 1}(Kmers.unsafe, (k_bits,))
k_pos = findfirst(kct, tmp_seq)
word_id = 0::Int
if k_pos == 0
word_id = _push_new_kmer_counts!(kct.counts, kct.samples.x, count)
ke = K_Element{K, Ab}(tmp_seq, NTuple{1, UInt32}(UInt32(word_id)))
push!(kct.table, ke)
else
word_id = push!(kct.counts, UInt64(count), kct.table[k_pos].chunk_ids[end])
if kct.table[k_pos].chunk_ids[end] != UInt32(word_id)
kct.table[k_pos] = K_Element{K, Ab}(kct.table[k_pos].seq, push(kct.table[k_pos].chunk_ids, UInt32(word_id)))
end
end
end
sort!(kct)
compute_index!(kct)
kct = collapse ? collapse!(kct) : kct
kct.samples.x += 1
return kct
end
function build_kct(samples::AbstractVector{String}, K::Int=30, chunks::Int = 500_000; save_at_samples::AbstractVector{Int}=Int[], save_path::String = "", collapse_every::Int=1)
kct = NeoKCT{K÷3, AAAlphabet, UInt64}(jello_superthreaded_hash(popfirst!(samples), K, chunks))
for (i, sample) in enumerate(samples)
sample_hashtable = jello_superthreaded_hash(sample, K, chunks)
push!(kct, sample_hashtable, collapse = (i+1) % collapse_every == 0)
i+1 in save_at_samples && save(kct, save_path*"$(day(now()))_$(month(now()))_$(year(now()))_$(i+1)samples__neokct.kct")
end
return kct
endcollapse! here does create a new collapsed kct that’s returned… But never caught because push! is assumed to work in place (which it does, except for that…)
Once fixed, packing density matches theory:
~8-20 small counts per UInt128 word for 50 to 100 samples.
The count table went from dominant cost to basically a rounding error thanks to bitpacking & word-dedupplication.
Reminder:
Count table collapsed. `K-mer Sequences & Chunk Indexes now dominate. This is a huge advancement from previous tables!
Vector{K_Element} decomposed into flat parallel arrays: same layout as a sparse matrix CSR (Yale Format):
# [V1.0](https://github.com/lemieux-lab/NeoKCT/tree/edd96dd73a8c0e2d0e75f8d62003fc8e5a4e0b43)
kct.table::Vector{K_Element} # heap-scattered objects, pointer per k-mer
# [V1.2](https://github.com/lemieux-lab/NeoKCT/tree/be421db7dd7052bfd41cdf6348f1bfab4d9f3363)
struct NeoKCT{K, Ab, W, C}
seqs::Vector{UInt64} # sorted k-mer bit-representations: contiguous
offsets::Vector{UInt32} # offsets[i] = start of k-mer i's CIDs in flat_cids
flat_cids::Vector{UInt32} # all PackedArray word IDs, concatenated
counts::PackedArray{UInt32, W}
# ...
endEverything lives on the stack (as flat vectors with known sizes).
Avoiding the heap saves on pointer cost, build, write & load times, and make the garbage collector hate me less.
| V1.0 | V1.2 | |
|---|---|---|
| Memory layout | Heap-scattered | Contiguous flat vectors |
| Full pointer walkthrough | O(n_kmers) pointer chases | Bounded: sizeof each array |
| Disk write | Per-entry metadata + data | Bulk write(io, vec) per array |
| Build time (100 samples) | ~36 hours | ~19.5 hours |
| Load time (100 samples) | ~20 min | ~30 sec |
Offset array is the only indirection. Everything else is a direct slice into a flat vector.
Predates V1.2, but worth showing again in its latest form.
function Base.findfirst(kct::NeoKCT{K, Ab}, key::Kmer{Ab, K}) where {K, Ab}
# Extract the 4-AA prefix (20 bits) to index into pre-computed bucket ranges
idx_key = key.data[1] >> (idx_prefix_size(kct) * bits_per_symbol(Ab())) + 1
r = kct.idx[2][idx_key] # UnitRange{Int64}: start:stop for this bucket
i = searchsortedfirst(kct.seqs, key.data[1], r.start, r.stop, Base.Forward)
return (i in r) && kct.seqs[i] == key.data[1] ? i : 0
end~160,000 buckets (4^5 AA prefix combinations) → each bucket covers ~1/160,000th of the table.
Lookup is O(log(n_kmers / 160,000)). With V1.2, we have a lookup time of 1.67e7 k-mers per second for 100 samples (3.5 billion unique k-mers)
Note: the index itself is a fixed
fill(0:-1, 4^15). ~16 GB constant overhead, always present regardless of table size.
# Simplified: old logic was flat and mixed concerns of words not forking correctly
function Base.push!(kct::NeoKCT, sample_hashtable::Dict{UInt64, UInt32})
shared_words = find_shared_words(kct) # O(n_flat_cids) scan every push
for (k_bits, count) in sample_hashtable
k_pos = findfirst(kct, ...)
if k_pos == 0
# inline: push new word to counts, append to seqs/flat_cids/n_cids
else
# inline: check shared words, fill missing samples, append CID
end
end
sort!(kct) # full re-sort of all arrays
endNew k-mers and existing k-mers handled inline, sort done unconditionally at the end.
Indexes(offsets) now visible.seqs+flat_cidsare the bulk. But offsets have a scaling problem. What happens when the numbre of words goes abovetypemax(UInt32)?
offsets stores absolute positions into flat_cids:
Replace with per-k-mer CID count. Recover offsets on the fly via cumsum:
offsets (V1.2) |
n_cids (V1.3) |
|
|---|---|---|
| Type | UInt32 |
UInt16 |
| Length | n_kmers + 1 |
n_kmers |
| Bytes/k-mer | 4 | 2 |
| Overflow risk | Yes (~4B CIDs) | No |
| V1.2 compat | – | n_cids = UInt16.(diff(offsets)) |
Replace with per-k-mer CID count. Recover offsets on the fly via cumsum:
Concern: Then isn’t finding a k-mer’s count slower?
Doing cumulative sums in julia is actually extremely fast. At 100 samples, our lookup speed is virtually the same at 1.56e7 k-mers per second.
Old flat logic split into three clearly scoped private functions:
function Base.push!(kct::NeoKCT, sample_hashtable::Dict{UInt64, UInt32})
shared_words = find_shared_words(kct)
offsets = _kmer_offsets(kct.n_cids) # O(n_kmers), computed once
ext_buf = Dict{Int, Vector{UInt32}}() # buffer new CIDs for existing k-mers
new_seqs = UInt64[]
new_cids = Vector{UInt32}[]
for (k_bits, count) in sample_hashtable
k_pos = findfirst(kct, ...)
if k_pos == 0
push!(new_seqs, k_bits)
push!(new_cids, _push_new_kmer_counts!(kct.counts, kct.samples.x, count))
else
_push_existing_kmer_counts!(kct, ext_buf, k_pos, shared_words, count, offsets)
end
end
_merge_and_sort!(kct, ext_buf, new_seqs, new_cids, offsets)
endext_buf collects new CIDs for existing k-mers during the scan.
_merge_and_sort! does a single parallel merge-sort pass at the end via psortperm.
_push_new_kmer_counts!# k-mer seen for the first time: backfill zeros for all previous samples, then add count
function _push_new_kmer_counts!(counts::PackedArray{UInt32, W}, prev_samples::Int, count::UInt32) where {W}
wid = new_word!(counts)
chunk_ids = UInt32[wid]
for _ in 1:prev_samples # fill missing sample slots with 0
new_wid = push!(counts, W(0), wid)
if new_wid != wid
push!(chunk_ids, UInt32(new_wid))
wid = new_wid
end
end
new_wid = push!(counts, W(count & typemax(W)), wid)
new_wid != wid && push!(chunk_ids, UInt32(new_wid))
return chunk_ids
endReturns the vector of PackedArray word IDs for this k-mer, appended directly to flat_cids. Note that we don’t store “trailing” zeros, they are assumed and retrieved from sample count.
K-mer sequences are the biggest factor of the table, and now that the CSR cannot overflow, we’re set to tackle that problem next.
CSR Table
Note that at 80 samples we switch to OV files that are much bigger
Count Table
Expected a more sub-linear growth of k-mers (even before the switch to OV cancer files)
Query Speed
Growth that somewhat follows the growth of k-mers in the table.
K-mer Cardinality
Expected a more sub-linear growth of k-mers (even before the switch to OV cancer files)
Time to build is still substantial, but GC time is now static ~2% instead of a growing amount of time due to the heap mess.
LabRegistry · Slurm Utils
lemieux-lab/LabRegistry two packages now registered:
NMacros macro-based CairoMakie plotting
@nscatter @nlines @nhist
@nboxplot @nheatmap @nhexbin
JuBox is deprecated. Migrate now if you depend on it.
Sent to Patrick – will be cluster-wide once deployed.
# Preset submission
micro_sbatch job python analyze.py # 2 CPU, 25G
kilo_sbatch job ./train.sh # 10 CPU, 50G
mega_sbatch job nextflow run nf # 20 CPU, 100G
giga_sbatch job ./pipeline.sh # 50 CPU, 500G
tera_sbatch job ./huge_job.sh # 100 CPU, 1TB
# Custom
var_sbatch job 48 384G bwa mem ref.fa reads.fastq| Detail | |
|---|---|
Delta-encoded seqs |
seqs is sorted, so consecutive entries are close in value. Storing deltas instead of full UInt64 should compress substantially. In progress |
| Full cohort benchmarks | V1.3 on TCGA + GTex |
| API stabilization | README flags it unstable |
| Peptide join | Connect KCT output to aeTSA discovery pipeline |
github.com/lemieux-lab/NeoKCT
Lab Meeting · Nicolas Jacquin