The fastest open-source foundation model for DNA.

Today we're releasing Carbon — three model sizes (500M, 3B, and 8B parameters), shipping with the full training code, the data pipeline, and the model weights. All open-source on the Hugging Face Hub.

Fig · Benchmark Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.

Throughput vs win rate across open DNA foundation models Log-scale throughput in base pairs per second on the x-axis and win-rate percentage on the y-axis. Carbon 3B and 8B sit at roughly 275 times the throughput of Arc Evo2 7B at comparable or better win rates. 100 80 60 40 20 0 200 500 1k 2k 5k 10k 20k 50k 100k 200k better faster 275× Evo2 20B Evo2 7B Evo2 1B GENERator-v2 3B GENERator-v2 1.2B Carbon 8B Carbon 3B Win rate (%) Throughput
Background

What Carbon reads

The model is fed long strings of four letters: A, C, G, T. Those letters are the bases of DNA. Stretches of it are genes, which cells copy into RNA and translate into proteins. A century of molecular biology has been spent working out how. Carbon is given only the letters. What they mean is what it has to learn.

§1 · Bases
A four-letter alphabet

DNA is written in four small molecules: adenine, cytosine, guanine, thymine. Two are purines (A and G, twin-ring), two are pyrimidines (C and T, single-ring). Everything that follows is built from these four.

A adenine
C cytosine
G guanine
T thymine
§2 · DNA
The double helix

Each base hangs off a sugar-phosphate backbone. Two backbones run anti-parallel and twist into a double helix. The bases on opposite strands pair by chemistry: A always with T, G always with C, so one strand fully determines the other. A human genome is about 3 billion base pairs of this.

AT
2 H bonds
GC
3 H bonds
complementary base pairing
§3 · Gene
Promoter, exons, introns

A gene is a stretch of DNA that the cell turns into protein. Most of the genome is not. Each gene begins with a promoter, where the cell starts reading. What follows is broken into two kinds of segment: exons, which the cell keeps, and introns, which it splices out and often serve regulatory purposes.

TATAAAATGGCCGAACTGGTAAGCATATAGCCCGGGTGGTTCGTACGCCATTAGAGCCGT
Legend promoter exon intron
§4 · RNA
Splicing into the working copy

The cell copies the gene into RNA. Then it splices out the introns and joins the exons together. What's left is the working mRNA: just the exons, in order. (T is rewritten as U along the way: a small alphabet quirk between DNA and RNA.)

TATAAAATGGCCGAACTGGTAAGCATATAGCCCGGGTGGTTCGTACGCCATTAGAGCCGT
AUGGCCGAACUGCCCGGGUGGUUCAGCCGU
Legend promoter exon intron
§5 · Protein
From chain to function

Every three RNA letters (a codon) encode one amino acid. There are only 20 amino acids in the standard alphabet; every protein in nature is built from this same set. The chain then folds into a 3D shape, and that shape is the function: hemoglobin · insulin · collagen · antibodies · enzymes.

mRNA AUGGCCGAACUGCCCGGGUGGUUCAGCCGU amino acids MAELPGWFSR MetAlaGluLeuProGlyTrpPheSerArg
fold
loading hemoglobin…
Human hemoglobin
the molecule that carries oxygen in your blood
4 chains · PDB 1A3N
§6 · Applications
What can the model do in the real world?

A model that understands and writes DNA is useful wherever DNA is the input or the output. There are three interesting use-cases for such models: tuning the genetics of the food we grow, designing the regulatory and coding sequences that drive biomanufacturing, and helping interpret the variants that show up in clinical sequencing.

Biotechnology · precision breeding
Crops and livestock

Map genotype to phenotype across crops and livestock: surface the variants that drive yield, quality, disease and pest resistance, and tolerance to drought, heat, cold, or salinity, so breeders can select for them directly.

Synthetic biology · biomanufacturing
Designing what cells express, and how

Design and tune promoters, enhancers, UTRs, and terminators to control expression strength, tissue specificity, timing, and inducibility. The same machinery powers codon optimization and host-specific engineering, letting microbial strains turn out enzymes, chemicals, fuels, antibiotics, and natural products more efficiently.

Biomedicine · diagnosis and personalized medicine
Triaging variants, designing therapies

Help prioritize the variants of uncertain significance that crowd clinical sequencing in rare disease and cancer, where it's often unclear whether a DNA change is actually driving the phenotype. Further out, support patient-tailored therapeutic design: mRNA vaccines, therapeutic proteins, enzymes, and antimicrobial peptides, with expression efficiency, stability, and manufacturability in the loop.

Intro

Carbon-3B is a 3-billion-parameter language model for DNA. It is trained on roughly 1 trillion tokens (6 trillion base pairs) of genomic sequence with a simple objective: given some DNA, predict what comes next (six bases at a time, autoregressively). Even though the objective is simple the resulting model is versatile. In the DNA lab you can explore all the cool things we can do with a DNA model.

Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes. It wasn't trained to tell which mutations are pathogenic or how genes differ between species. The sections below highlight what it picked up anyway: autocomplete a gene §1, see structure emerge in its confidence §2, score a disease variant against a healthy one §3, recognise a gene's species of origin §4, and then push further into folded protein structure §5, the embedding manifold §6, and the species tree §7. Each demo runs against the public HuggingFaceBio/Carbon-3B checkpoint behind a live inference endpoint.

§1 · Autocomplete
Autocomplete for the genome

Same idea as GPT completing a sentence, but for DNA. We feed the model a DNA sequence as input and the model produces an output sequence. The model streams the bases one 6-base token at a time. The model is better at predicting sequences of a gene's exons because they are the protein-coding parts of a gene and are under strong evolutionary constraint. As such they should be the most predictable stretches of DNA. The introns serve regulatory purposes on the other hand and are harder to predict. We overlay the real exon/intron annotations on top of the output so you can compare what Carbon produces to what's actually there.

gene
loading genes…
exon intron prompt → generated
pick a gene and hit generate
model output · prompt in gray · generated colored by logprob (red = uncertain) · _ match · _ mismatch
identity·
in-exon·
in-intron·
tokens·
mean logprob·
perplexity·

Try it Drag the dark ▼ ▲ markers to slide the prompt window and the green ▼ to set where generation stops, then hit ▶ generate. Land the green-shaded region inside an exon (dark green block) and note the count of green-underlined matches; repeat with a similar-length window over an intron and compare.

What to look for Exons are under selection pressure, so getting them right takes real biological understanding, not just DNA statistics. Boundaries between high- and low-confidence stretches in Carbon's output also tend to fall near real exon/intron edges, even though the model has never seen a single annotation.

Run this from code
from huggingface_hub import get_token
from openai import OpenAI

# Carbon-3B can be served behind any OpenAI-compatible API (vLLM, TGI, an
# HF inference endpoint, etc.). Point base_url at your deployment.
client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

# First ~60 bp of HBB. Replace with whatever gene opening you want.
prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"

r = client.completions.create(
    model="HuggingFaceBio/Carbon-3B",
    prompt=prompt,
    max_tokens=10,        # 10 6-mer tokens ~= 60 bp of continuation
    temperature=0.5, top_p=0.9,
)
print(r.choices[0].text)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained(
    "HuggingFaceBio/Carbon-3B", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=10,        # ~60 bp at 6 bp / token
        temperature=0.5, top_p=0.9, do_sample=True,
    )

# Slice off the prompt so we just print the continuation.
new_ids = out[0, inputs["input_ids"].shape[1]:]
print(tok.decode(new_ids))
§2 · Structure
Recognizing gene structure

The Carbon model assigns every 6-base chunk a log-probability under the surrounding context: how "expected" or "likely" that stretch of DNA is. The plot with the scores along a real gene shows the curve dips and rises. We overlay the exon/intron annotation on top: confidence reliably climbs in protein-coding regions and falls in repetitive or unconstrained intronic stretches, even though the model never saw a single label. The same score, summed up, is what powers the variant-effect call in §3 below.

gene
loading genes…
exon (shaded) y-axis: log P per 6-bp token (higher = more confident) 0 bp
mean (exon)·
mean (intron)·
Δ (exon − intron)·
tokens·
mean (overall)·

Try it Pick a gene and watch its per-token confidence curve. Each gene's exons are highlighted in green; the curve underneath is Carbon's log-probability for each 6-base token along the sequence.

What to look for Exons, especially the protein-coding portions, tend to score noticeably higher than introns because they're evolutionarily conserved and full of constrained patterns the model has learned to predict. The Δ tells you how strongly Carbon "noticed" the difference for this gene. Keep this curve in mind for §3: a variant that flips a base inside a high-confidence exon stretch is the kind of edit that should make Carbon surprised.

Run this from code
from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

# Echoed scoring: forward-pass the prompt and return per-token logprobs
# (no generation). The score per 6-mer chunk is what the per-base
# confidence track is built from.
prompt = "<dna>" + gene_sequence    # full gene, up to ~32k tokens

r = client.completions.create(
    model="HuggingFaceBio/Carbon-3B",
    prompt=prompt,
    max_tokens=0, echo=True, logprobs=1, temperature=0,
)

for tok, lp in zip(r.choices[0].logprobs.tokens,
                   r.choices[0].logprobs.token_logprobs):
    print(f"{tok}\t{lp}")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

ids = tok("<dna>" + gene_sequence, return_tensors="pt",
          add_special_tokens=False).input_ids.to("cuda")

with torch.inference_mode():
    logits = model(ids).logits

# Per-token log-prob of the actual next token (the standard "echo" pattern).
logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
per_tok_lp = logp.gather(2, ids[:, 1:].unsqueeze(-1)).squeeze(-1)[0]
for t, lp in zip(tok.convert_ids_to_tokens(ids[0, 1:].tolist()),
                 per_tok_lp.tolist()):
    print(f"{t}\t{lp:.3f}")
§3 · Variant effect
Predicting mutation effects

§2 showed that Carbon's per-base confidence rises and falls in step with gene structure. Now we use the same log-likelihood, but as a measure for individual mutations. For a real ClinVar variant we score a ~4 kb window of human DNA two ways: once with the original base, once with the mutation. Then we check which version looks more like real, functioning human sequence. Carbon was never trained on what "pathogenic" means; it just learned what natural DNA looks like. Variants that disrupt protein-coding or regulatory function show up as less likely sequence under the model's distribution.

variant
loading variants…

Try it Pick a known variant from the pills, then click any base in the mutation row to introduce a different change. The model re-scores on every edit.

What to look for Read each row two ways: the dot color is what ClinVar says (red = pathogenic, orange = risk, green = benign); the bar direction is what Carbon says (red bar pointing left = mutation less likely than original; charcoal bar pointing right = mutation looks fine or more likely). Watch the two VHL rows for the cleanest demonstration: a premature stop codon (c.475A>T) swings the bar hundreds of nats to the left, while a common 3' UTR variant (c.*820A>G) in the very same gene sits at zero. Same model, same window length, opposite verdicts. Carbon learned the distinction from raw sequence alone, with no labels.

Run this from code
from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

def score_sum(seq):
    """Sum of per-token log-probs for the given DNA sequence."""
    r = client.completions.create(
        model="HuggingFaceBio/Carbon-3B",
        prompt="<dna>" + seq,
        max_tokens=0, echo=True, logprobs=1, temperature=0,
    )
    return sum(lp for lp in r.choices[0].logprobs.token_logprobs if lp is not None)

# Score the same ~4 kb window two ways: original vs the one-base mutation.
delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f}  (less likely if negative)")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

def score_sum(seq):
    ids = tok("<dna>" + seq, return_tensors="pt",
              add_special_tokens=False).input_ids.to("cuda")
    with torch.inference_mode():
        logits = model(ids).logits
    logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
    return logp.gather(2, ids[:, 1:].unsqueeze(-1)).sum().item()

delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f}  (less likely if negative)")
§4 · Species
Species specific generation

The same gene (insulin, p53) exists in humans, mouse and chicken, but the surrounding sequence has accumulated different mutations along each lineage for hundreds of millions of years. For each species we feed Carbon up to ~400 bp and ask it to continue. Each continuation should match that species' real DNA better than another species' would. The model handles closely-related species well (mouse, chicken, even though they're ~300 My from human); the further you go back in evolutionary time, the more the surrounding sequence drifts and the harder this setup becomes.

gene prefix generate
loading species…
prompt in gray generated colored by logprob mismatches in reference highlighted

Try it Pick a gene shared across species, set the prefix length, then hit run all to score every species in parallel. Try the same gene at prefix 200 vs 400 and watch the per-species identity respond.

What to look for With 400 bp of context the model usually recognises which species' DNA it's been given and continues in that species' style; identity to that species' reference often runs 65–90% on the next 60 bp. Cut the prefix to 200 and the signal collapses to near-random: a few hundred bases is what it takes to "lock in" on a lineage. The gap between mouse and chicken is where you can read the evolutionary signal: 300+ My since the last common ancestor is enough drift that a 400 bp prefix still locks Carbon in, but the per-base identity sits a notch below mouse.

Run this from code
from huggingface_hub import get_token
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

def continue_species(species_prefix):
    r = client.completions.create(
        model="HuggingFaceBio/Carbon-3B",
        prompt="<dna>" + species_prefix,
        max_tokens=10,
        temperature=0.5, top_p=0.9,
    )
    return r.choices[0].text

# species_prefixes = { "human": ..., "mouse": ..., "chicken": ... }
with ThreadPoolExecutor() as pool:
    results = dict(zip(species_prefixes, pool.map(continue_species, species_prefixes.values())))

for name, cont in results.items():
    print(f"{name:10s}  {cont}")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()
tok.padding_side = "left"
if tok.pad_token is None: tok.pad_token = tok.eos_token

# Batch all species in one forward pass via left-padding.
prompts = ["<dna>" + p for p in species_prefixes.values()]
enc = tok(prompts, return_tensors="pt", padding=True, add_special_tokens=False).to("cuda")

with torch.inference_mode():
    out = model.generate(
        **enc, max_new_tokens=10,
        temperature=0.5, top_p=0.9, do_sample=True,
    )
new_ids = out[:, enc["input_ids"].shape[1]:]
for name, ids in zip(species_prefixes, new_ids):
    print(f"{name:10s}  {tok.decode(ids)}")
§5 · Folding
From DNA to proteins

When Carbon completes a protein coding region in a gene, the resulting bases translate to a protein: a protein that folds. We feed the resulting sequence into ESMFold (similar to AlphaFold) and render the 3D structure inline, alongside the same protein folded from the reference sequence so you can see whether Carbon's continuation produced something similar.

gene
loading genes…
·
carbon · aa
click fold
reference · aa
·
mismatches vs reference aligned position by position
carbon completion
no structure yet
reference
no structure yet
pLDDT low → high · drag to rotate
residues·
pLDDT mean (carbon)·
pLDDT mean (ref)·
identity (1D)·
What to look for A high pLDDT means ESMFold is confident in the predicted structure at that residue. The interesting case is when Carbon's completion diverges at the base level — sometimes drastically, like CFTR at ~22% identity — but still folds with high confidence into a shape that mirrors the reference backbone. That's the model reaching past memorization for the structural grammar underneath the sequence.
§6 · Embedding space
Mapping out genomes

We embed 571,810 genes from 27 species across six kingdoms (vertebrates, invertebrates, plants, fungi, bacteria, viruses) with Carbon, project to 2D with UMAP, color by attributes. Depending on the attribute, different kinds of organizations emerge from the same points: the model's embedding space encodes multiple axes of biology at once, most of which were never labeled.

color by
highlights

loading 571K points · ~5.8 MB gzipped
points·
species·
embedding dim3072
render·
drag to pan · wheel to zoom · hover for details
What to look for Switch coloring from species to biotype: same points, completely different organization emerges. The macro-clusters trace six kingdoms (vertebrates, invertebrates, plants, fungi, bacteria, viruses), discovered from raw sequence alone. Switch again to gc content and a perpendicular axis appears: AT-rich (cool blue) vs GC-rich (warm amber) regions cut across the species clusters, revealing the composition gradient the model has internalised. Points: 571,810 real Carbon 3B embeddings, projected to 2D via UMAP.
§7 · Species tree
Reconstructing the tree of life

If we take the same 571,810 sequences from §6 and average each species' embeddings into a single 3072-dim vector, then cluster those 27 centroids with hierarchical clustering, we can find species the model regards as closely related. Carbon was never trained on what the relation between organisms is. Yet the resulting tree groups vertebrates together, separates bacteria from fungi, and pairs sister clades (primates with primates, rodents with rodents, monocots with monocots).

linkage vs ncbi
· ·
match · ncbi kingdom
hover a row to see its top neighbours · toggle linkage / scope above
cosine distance ←
vertebrates invertebrates plants fungi bacteria viruses nearest carbon neighbour shares the ncbi group doesn't ·solo (no ncbi sibling in the dataset)
species·
sequences·
embedding dim3072
distancecosine
What to look for Toggle kingdom-level vs sister-level: at the kingdom scale the embedding is nearly perfect: vertebrates cluster with vertebrates, bacteria with bacteria. At the sister scale (primate-with-primate, etc.) it's lower because distances inside a kingdom are extremely tight (~0.0001) and the strict nearest neighbour bounces around; the model nails the broad strokes but blurs the fine branches at this resolution. Switch linkage from Ward to UPGMA to see how much of the structure is method-independent. Tree built from species centroids of mean-pooled Carbon-3B embeddings.
Intro

Carbon's architecture is deliberately vanilla. What's not vanilla, and what gets the headline numbers in the DNA Lab tab, is three things: a 6-mer tokenizer that lets the model see ~6× more genomic context per forward pass, a Factorized Nucleotide Supervision (FNS) loss that gives the model partial credit for near-miss tokens once cross-entropy training starts to wobble, and a multi-stage curated data mixture, biased toward functional genomic regions. Everything else (architecture, optimizer) is standard recipe. The technical report details each choice and the ablations behind it.

The sections below walk through each of those choices: how the tokenizer changes what a "token" means in DNA §1, how FNS rescues training in the BF16 regime §2, how bp-level generation and scoring fall out of the same marginalization §3, what's in the training corpus §4, what the architecture looks like §5, how 8k-token pretraining reaches 786 kbp at inference §6, how Carbon stacks up against Evo2-7B and GENERator-v2 on the full training-free suite §7, and why the model runs so fast §8.

§1 · Tokenizer
Read DNA in 6-base chunks

The most direct way to model DNA is one base per token. It works, but for a L-base sequence Transformer attention costs , and DNA contexts are long. Carbon instead reads in fixed 6-base blocks. Same DNA span, ⅙ the tokens, and because attention is quadratic, up to 36× cheaper at the same coverage. BPE was a tempting middle ground, but its variable-length tokens collide badly with autoregressive next-token prediction: DNA doesn't have stable "words."

type DNA 30 bp
1-mer · one token per base
6-mer (carbon) · one token per 6 bases
1-mer tokens·
1-mer attention·
1-mer vocab4
6-mer tokens·
6-mer attention·
6-mer vocab4,096
same DNA span ▼ shorter token sequence = cheaper attention 36× cheaper
Why not BPE BPE works for English because words have stable boundaries. DNA motifs don't: the TATA box is a family of patterns (TATATA, TATATT, …), not a single string. Worse, in autoregressive mode, BPE penalizes the model for predicting a valid prefix of the target token. 6-mer is a deterministic, neutral compression that avoids this trap.
§2 · Training objective
Partial credit for near-misses

Cross-entropy treats every 6-mer token as atomic: predict TATATT when the target was TATATA, get zero credit even though five of six bases matched. That gets brittle late in training. Carbon switches to Factorized Nucleotide Supervision: instead of one 4096-way classification, the model is supervised on six parallel 4-way nucleotide marginals derived from the same logits. Near-miss tokens get partial credit proportional to how many bases they got right.

target 6-mer
What the switch buys you CE first: the model learns the joint structure of bases inside each 6-mer (codon constraints, splice signals, motif composition). FNS later, when CE turns brittle (the "loss staircase," and BF16 inference starts diverging from FP32), FNS smooths the objective and restores numerical robustness without giving up the joint prior CE built.
§3 · BP-level inference
Bases, not 6-mers

The 6-mer tokenizer makes Carbon fast, but it's coarse in both directions of inference. When generating, each step advances the sequence by 6 bases at once and temperature acts on a 4,096-way distribution rather than per nucleotide. When scoring an existing sequence, the raw next-token likelihood answers "how likely is this 6-mer in context?", not "how likely is this exact base at this exact position?", which is the version you want for variant-effect prediction. The same marginalization that powers FNS at training time fixes both: softmax over the 6-mer logits, then for each position p sum the probabilities of every 6-mer that shares a given base at p, and you recover six per-position 4-way base distributions. To generate, sample (or argmax) each independently and force the matching 6-mer token. To score, read P(actual base | context) directly off the marginals at every position. Same logits, same math, two endpoints.

per-step pipeline · 4,096-way 6-mer logits → 6 × 4-way base marginals → reassembled token
step 1 · softmax over 4,096 DNA tokens
▼   sum over 6-mers sharing a base at position p
step 2 · six 4-way per-base distributions
pos 1
ATCG
pos 2
ATCG
pos 3
ATCG
pos 4
ATCG
pos 5
ATCG
pos 6
ATCG
▼   same marginals feed two endpoints: generate (force a token) or score (read off P(base))
step 3a · generate
ACGTAT
argmax / multinomial → force matching 6-mer token
step 3b · score
.83.71.92.67.48.79
read P(actual base | context) at each position
When to switch on bp-level Use plain 6-mer decoding when 6-base granularity is fine: throughput-bound generation, long retrieval haystacks, large-scale screening. Reach for bp-level generation when you need exact base counts, per-position masks, or temperature applied at the base axis rather than the 4,096-way 6-mer axis. Reach for bp-level scoring whenever the task is about a specific base: variant-effect prediction, single-nucleotide mutational scans, comparing the likelihood of a reference and an alternate allele at one position. Two complementary delivery paths: generation ships as a transformers custom_generate method at HuggingFaceBio/carbon-generate that works on the plain Carbon-3B/8B/500M checkpoints (standard LlamaForCausalLM, no custom modeling file). Scoring ships in the -remote variants of those same checkpoints, which add a score_sequence(seq) method that returns per-base distributions and the probability of the observed base at every position.
Run this from code
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained(
    "HuggingFaceBio/Carbon-3B", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    dtype=torch.bfloat16, device_map="auto",
)

prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)

# `custom_generate` injects a logits processor that marginalizes the
# 6-mer logits to per-base distributions and samples each of the 6
# positions independently, then forces the matching 6-mer token. All
# standard generation knobs (temperature, top_p, top_k, repetition_penalty)
# still apply, they just act on the per-base marginals.
out = model.generate(
    **inputs,
    max_new_tokens=128,         # 128 6-mer tokens ~= 768 bp of continuation
    custom_generate="HuggingFaceBio/carbon-generate",
    trust_remote_code=True,
    tokenizer=tok,
    do_sample=True, temperature=0.8, top_p=0.9,
)

# Slice off the prompt and decode the continuation as plain DNA.
new_ids = out[0, inputs["input_ids"].shape[1]:]
print(tok.decode(new_ids, skip_special_tokens=True))
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, math

# The -remote variants bundle modeling code that exposes
# `score_sequence(seq)` directly on the model. It returns, for every
# position in the input DNA, the marginal P(base | context) and the
# probability of the observed base.
tok = AutoTokenizer.from_pretrained(
    "HuggingFaceBio/Carbon-3B-remote", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B-remote",
    trust_remote_code=True,
    dtype=torch.bfloat16, device_map="auto",
)

ref = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
alt = ref[:20] + "G" + ref[21:]          # single-base substitution at pos 20

# bp_probs: [seq_len, 4]   marginal P(A/T/C/G | context) at each position
# actual:   [seq_len]      P(observed base | context) at each position
bp_probs_ref, actual_ref = model.score_sequence(ref)
bp_probs_alt, actual_alt = model.score_sequence(alt)

# log-likelihood delta at the substituted position
# is the per-base variant-effect score in its simplest form.
delta = math.log(actual_alt[20].item() + 1e-12) \
      - math.log(actual_ref[20].item() + 1e-12)
print(f"log P(alt) - log P(ref) at pos 20: {delta:+.3f}")
§4 · Data
Genomes are mostly background

A naive read of "more data is better" misses something specific to DNA: most of a eukaryotic genome is repeats, low-complexity, and weakly-constrained background. Train on raw sequence and a lot of your loss is dominated by easy-to-predict noise. Carbon's corpus is an annotation-aware mixture, biased toward gene-centric, transcript, and bacterial sequence, so the model spends more of its gradient updates on biologically meaningful sequence.

corpus composition · 1T tokens (6T base pairs)
signal-to-noise · raw genome vs annotation-aware curation
functional / annotated background curating raises the density of biological signal in the gradient
metadata templates · the model sees mixed contexts so it works with or without labels
The signal-to-noise math If only 5% of a raw corpus is informative, but you keep 80% of informative regions while discarding 95% of background, the effective informative fraction jumps from 5% to ≈ 46%. Same training compute, ~9× more learning signal per gradient step.
§5 · Architecture
A deliberately vanilla transformer

Decoder-only, RMSNorm + SwiGLU + RoPE + grouped-query attention, tied I/O embeddings, 8k-token context. Nothing exotic. The architectural surface is intentionally familiar so that any improvement Carbon shows on genomic tasks is attributable to the data, the tokenizer, and the loss, not to a custom block or a hand-crafted attention variant.

vocabulary = 4,096 6-mer DNA tokens + small set of special / metadata tokens · total 155,776
Why this matters Architecture innovation is one of the cheapest things to claim and one of the hardest things to attribute. Carbon's results (competitive with Evo2-7B at 3B parameters, ahead of it on a majority of tasks at 8B) come from changes that aren't the architecture. That's where the room for genomic foundation models still is.
§6 · Long context
Pretrain at 8k, retrieve at 786 kbp

Carbon's nominal training context is short by megabase-scale standards (8k tokens, ≈49 kbp). The reach comes from a two-step extension. First, a training-time long-context phase lifts the context to 32k tokens (≈197 kbp) with RoPE θ rescaled from 500k to 5M. Then, at inference, YaRN pushes that further: 2× to 65k tokens for the 3B model, 4× to 131k tokens for the 8B (≈786 kbp, the size of a small bacterial genome). The 8B has more capacity to absorb the YaRN stretch, which is why it extends further than the 3B.

context length · log scale, base pairs of DNA reachable in a single forward pass
Genome-NIAH retrieval · plain variant · find a planted 24 bp value inside a real-genome haystack
Carbon 8B (YaRN) Carbon 3B (YaRN) Evo2-7B (native 1M) accuracy at exact-match retrieval, 500 samples per cell
The headline number At 786 kbp, Carbon-8B retrieves the planted needle at 65% accuracy. Evo2-7B, natively trained at 1M tokens of single-nucleotide context (≈8× more wall-clock per token), scores 53% at the same length. So a 6-mer model trained to 32k tokens plus YaRN-4× at inference reaches further than a 1M-native single-nucleotide model, which is the entire bet of the Carbon recipe: nominal context length is not the same as effective context utilization.
§7 · Results
Training-free, head-to-head

Eight training-free tasks across four capability axes: generative sequence recovery, variant-effect prediction (BRCA2, TraitGym, ClinVar coding / non-coding), sequence-level perturbation (synthetic motif insertion and synonymous codon shuffling), and long-context retrieval (Genome-NIAH at 393 kbp). No fine-tuning, no head training, all four frozen pretrained models scored under the same protocol. Carbon-3B is competitive with Evo2-7B despite less than half the parameters; Carbon-8B is ahead on five of eight.

Carbon 8B Carbon 3B Evo2-7B GENERator-v2 3B
How to read it Carbon-8B leads on sequence recovery, BRCA2, ClinVar non-coding, triplet expansion, and Genome-NIAH at 393 kbp. Evo2-7B holds onto TraitGym Mendelian (a hard non-coding variant set), and edges Carbon-8B on ClinVar coding and synonymous codon shuffling by a fraction of a point each — small enough to be effectively a tie. The pattern is broad rather than peaky: Carbon's gains come from data, tokenizer, and objective design, distributed across tasks, not from a single specialised benchmark.
§8 · Efficiency
Why Carbon is fast

The throughput story is a two-factor multiplication, not one big trick. First, the architecture is deliberately vanilla: a stock Llama-3-shaped decoder. That means Carbon drops straight into vLLM and inherits the same paged-attention, fused kernels, and CUDA-graph capture that the open-source LLM stack has been optimizing for two years. Custom blocks would forfeit all of that. Second, 6-mer tokenization compresses a given DNA span by at the input, which under quadratic attention is up to a 36× reduction in prefill cost, and the decode loop emits 6 bases per step instead of one. Stacking the two: standard-stack inference speedups, multiplied by tokenizer compression, gets you the order-of-magnitude gap over Evo2 reported in the paper.

Inference throughput · output bp/s · single H100
Legend Carbon-8B Carbon-3B Carbon-500M Evo2 1B Evo2 7B Evo2 20B Evo2 40B
Source · carbon-inference-evals vLLM for Carbon · Evo2 native runner
The compound effect Neither factor on its own would be a story. Vanilla architecture without 6-mer compression would land Carbon at roughly Llama-3 throughput: fine but not remarkable. 6-mer compression on a custom architecture would force a hand-rolled inference stack to keep up with vLLM. Doing both together is what makes a 3B-parameter DNA model usable for large-scale evaluation on commodity hardware.
Intro

Open-ended DNA continuation. Type any prefix in {A, C, G, T}, watch the model continue token by token. Toggle base-coloring or per-token logprob coloring to see where Carbon is confident and where it's guessing. Track GC content, perplexity, and throughput live.

§ Input

Prompt

DNA prefix in {A, C, G, T}: pick an example or type your own.

Connected to
loading…
examples
color
§ Output

Sequence

Streams as the model generates · live stats on the right.

prompt + generated bases will stream here
prompt0bp
generated0bp
tokens0
elapsed0.0s
throughput0bp/s
GC content·
mean logprob·
perplexity·
token logprob
···