Carbon

The fastest open-source foundation model for DNA.

Today we're releasing Carbon — three model sizes (500M, 3B, and 8B parameters), shipping with the full training code, the data pipeline, and the model weights. All open-source on the Hugging Face Hub.

Fig · Benchmark Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.

Background

What Carbon reads

The model is fed long strings of four letters: A, C, G, T. Those letters are the bases of DNA. Stretches of it are genes, which cells copy into RNA and translate into proteins. A century of molecular biology has been spent working out how. Carbon is given only the letters.

What they mean is what it has to learn.

§1 · Bases

A four-letter alphabet

DNA is written in four small molecules: adenine, cytosine, guanine, thymine. Two are purines (A and G, twin-ring), two are pyrimidines (C and T, single-ring). Everything that follows is built from these four.

A adenine

C cytosine

G guanine

T thymine

§2 · DNA

The double helix

Each base hangs off a sugar-phosphate backbone. Two backbones run anti-parallel and twist into a double helix. The bases on opposite strands pair by chemistry: A always with T, G always with C, so one strand fully determines the other. A human genome is about 3 billion base pairs of this.

A═T

2 H bonds

G≡C

3 H bonds

complementary base pairing

§3 · Gene

Promoter, exons, introns

A gene is a stretch of DNA that the cell turns into protein. Most of the genome is not. Each gene begins with a promoter, where the cell starts reading. What follows is broken into two kinds of segment: exons, which the cell keeps, and introns, which it splices out and often serve regulatory purposes.

TATAAAATGGCCGAACTGGTAAGCATATAGCCCGGGTGGTTCGTACGCCATTAGAGCCGT

Legend promoter exon intron

§4 · RNA

Splicing into the working copy

The cell copies the gene into RNA. Then it splices out the introns and joins the exons together. What's left is the working mRNA: just the exons, in order. (T is rewritten as U along the way: a small alphabet quirk between DNA and RNA.)

TATAAAATGGCCGAACTGGTAAGCATATAGCCCGGGTGGTTCGTACGCCATTAGAGCCGT

AUGGCCGAACUGCCCGGGUGGUUCAGCCGU

Legend promoter exon intron

§5 · Protein

From chain to function

Every three RNA letters (a codon) encode one amino acid. There are only 20 amino acids in the standard alphabet; every protein in nature is built from this same set. The chain then folds into a 3D shape, and that shape is the function: hemoglobin · insulin · collagen · antibodies · enzymes.

mRNA AUGGCCGAACUGCCCGGGUGGUUCAGCCGU ↓↓↓↓↓↓↓↓↓↓ amino acids MAELPGWFSR MetAlaGluLeuProGlyTrpPheSerArg

↓

fold

loading hemoglobin…

Human hemoglobin

the molecule that carries oxygen in your blood

4 chains · PDB 1A3N

§6 · Applications

What can the model do in the real world?

A model that understands and writes DNA is useful wherever DNA is the input or the output. This can be used for a variety of tasks, such as tuning the genetics of the food we grow, designing the regulatory and coding sequences that drive biomanufacturing, and helping interpret the variants that show up in clinical sequencing.

Biotechnology · precision breeding

Crops and livestock

Map genotype to phenotype across crops and livestock: surface the variants that drive yield, quality, disease and pest resistance, and tolerance to drought, heat, cold, or salinity, so breeders can select for them directly.

Synthetic biology · biomanufacturing

Designing what cells express, and how

Design and tune promoters, enhancers, UTRs, and terminators to control expression strength, tissue specificity, timing, and inducibility. The same machinery powers codon optimization and host-specific engineering, letting microbial strains turn out enzymes, chemicals, fuels, antibiotics, and natural products more efficiently.

Biomedicine · diagnosis and personalized medicine

Triaging variants, designing therapies

Help prioritize the variants of uncertain significance that crowd clinical sequencing in rare disease and cancer, where it's often unclear whether a DNA change is actually driving the phenotype. Further out, support patient-tailored therapeutic design: mRNA vaccines, therapeutic proteins, enzymes, and antimicrobial peptides, with expression efficiency, stability, and manufacturability in the loop.

Intro

Carbon-3B is a 3-billion-parameter language model for DNA. It is trained on roughly 1 trillion tokens (6 trillion base pairs) of genomic sequence with a simple objective: given some DNA, predict what comes next (six bases at a time, autoregressively). Even though the objective is simple the resulting model is versatile. In the DNA lab you can explore all the cool things we can do with a DNA model.

Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes. It wasn't trained to tell which mutations are pathogenic or how genes differ between species. The sections below highlight what it picked up anyway: autocomplete a gene §1, see structure emerge in its confidence §2, score a disease variant against a healthy one §3, recognise a gene's species of origin §4, and then push further into folded protein structure §5, the embedding manifold §6, and the species tree §7. Each demo runs against the public HuggingFaceBio/Carbon-3B checkpoint behind a live inference endpoint.

§1 · Autocomplete

Autocomplete for the genome

Same idea as GPT completing a sentence, but for DNA. We feed the model a DNA sequence as input and the model produces an output sequence. The model streams the bases one 6-base token at a time. The model is better at predicting sequences of a gene's exons because they are the protein-coding parts of a gene and are under strong evolutionary constraint. As such they should be the most predictable stretches of DNA. The introns serve regulatory purposes on the other hand and are harder to predict. We overlay the real exon/intron annotations on top of the output so you can compare what Carbon produces to what's actually there.

loading genes…

exon intron prompt → generated

pick a gene and hit generate

model output · prompt in gray · generated colored by logprob (red = uncertain) · _ match · _ mismatch

identity·

in-exon·

in-intron·

tokens·

mean logprob·

perplexity·

Try it Drag the dark ▼ ▲ markers to slide the prompt window and the green ▼ to set where generation stops, then hit ▶ generate. Land the green-shaded region inside an exon (dark green block) and note the count of green-underlined matches; repeat with a similar-length window over an intron and compare.

What to look for Exons are under selection pressure, so getting them right takes real biological understanding, not just DNA statistics. Boundaries between high- and low-confidence stretches in Carbon's output also tend to fall near real exon/intron edges, even though the model has never seen a single annotation.

Run this from code

from huggingface_hub import get_token
from openai import OpenAI

# Carbon-3B can be served behind any OpenAI-compatible API (vLLM, TGI, an
# HF inference endpoint, etc.). Point base_url at your deployment.
client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

# First ~60 bp of HBB. Replace with whatever gene opening you want.
prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"

r = client.completions.create(
    model="HuggingFaceBio/Carbon-3B",
    prompt=prompt,
    max_tokens=10,        # 10 6-mer tokens ~= 60 bp of continuation
    temperature=0.5, top_p=0.9,
)
print(r.choices[0].text)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained(
    "HuggingFaceBio/Carbon-3B", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=10,        # ~60 bp at 6 bp / token
        temperature=0.5, top_p=0.9, do_sample=True,
    )

# Slice off the prompt so we just print the continuation.
new_ids = out[0, inputs["input_ids"].shape[1]:]
print(tok.decode(new_ids))

§2 · Structure

Recognizing gene structure

The Carbon model assigns every 6-base chunk a log-probability under the surrounding context: how "expected" or "likely" that stretch of DNA is. The plot with the scores along a real gene shows the curve dips and rises. We overlay the exon/intron annotation on top: confidence reliably climbs in protein-coding regions and falls in repetitive or unconstrained intronic stretches, even though the model never saw a single label. The same score, summed up, is what powers the variant-effect call in §3 below.

loading genes…

exon (shaded) y-axis: log P per 6-bp token (higher = more confident) 0 bp

mean (exon)·

mean (intron)·

Δ (exon − intron)·

tokens·

mean (overall)·

Try it Pick a gene and watch its per-token confidence curve. Each gene's exons are highlighted in green; the curve underneath is Carbon's log-probability for each 6-base token along the sequence.

What to look for Exons, especially the protein-coding portions, tend to score noticeably higher than introns because they're evolutionarily conserved and full of constrained patterns the model has learned to predict. The Δ tells you how strongly Carbon "noticed" the difference for this gene. Keep this curve in mind for §3: a variant that flips a base inside a high-confidence exon stretch is the kind of edit that should make Carbon surprised.

Run this from code

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

# Echoed scoring: forward-pass the prompt and return per-token logprobs
# (no generation). The score per 6-mer chunk is what the per-base
# confidence track is built from.
prompt = "<dna>" + gene_sequence    # full gene, up to ~32k tokens

r = client.completions.create(
    model="HuggingFaceBio/Carbon-3B",
    prompt=prompt,
    max_tokens=0, echo=True, logprobs=1, temperature=0,
)

for tok, lp in zip(r.choices[0].logprobs.tokens,
                   r.choices[0].logprobs.token_logprobs):
    print(f"{tok}\t{lp}")

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

ids = tok("<dna>" + gene_sequence, return_tensors="pt",
          add_special_tokens=False).input_ids.to("cuda")

with torch.inference_mode():
    logits = model(ids).logits

# Per-token log-prob of the actual next token (the standard "echo" pattern).
logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
per_tok_lp = logp.gather(2, ids[:, 1:].unsqueeze(-1)).squeeze(-1)[0]
for t, lp in zip(tok.convert_ids_to_tokens(ids[0, 1:].tolist()),
                 per_tok_lp.tolist()):
    print(f"{t}\t{lp:.3f}")

§3 · Variant effect

Predicting mutation effects

§2 showed that Carbon's per-base confidence rises and falls in step with gene structure. Now we use the same log-likelihood, but as a measure for individual mutations. For a real ClinVar variant we score a ~4 kb window of human DNA two ways: once with the original base, once with the mutation. Then we check which version looks more like real, functioning human sequence. Carbon was never trained on what "pathogenic" means; it just learned what natural DNA looks like. Variants that disrupt protein-coding or regulatory function show up as less likely sequence under the model's distribution.

loading variants…

Try it Pick a known variant from the pills, then click any base in the mutation row to introduce a different change. The model re-scores on every edit.

What to look for Read each row two ways: the dot color is what ClinVar says (red = pathogenic, orange = risk, green = benign); the bar direction is what Carbon says (red bar pointing left = mutation less likely than original; charcoal bar pointing right = mutation looks fine or more likely). Watch the two VHL rows for the cleanest demonstration: a premature stop codon (c.475A>T) swings the bar hundreds of nats to the left, while a common 3' UTR variant (c.*820A>G) in the very same gene sits at zero. Same model, same window length, opposite verdicts. Carbon learned the distinction from raw sequence alone, with no labels.

Run this from code

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

def score_sum(seq):
    """Sum of per-token log-probs for the given DNA sequence."""
    r = client.completions.create(
        model="HuggingFaceBio/Carbon-3B",
        prompt="<dna>" + seq,
        max_tokens=0, echo=True, logprobs=1, temperature=0,
    )
    return sum(lp for lp in r.choices[0].logprobs.token_logprobs if lp is not None)

# Score the same ~4 kb window two ways: original vs the one-base mutation.
delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f}  (less likely if negative)")

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()

def score_sum(seq):
    ids = tok("<dna>" + seq, return_tensors="pt",
              add_special_tokens=False).input_ids.to("cuda")
    with torch.inference_mode():
        logits = model(ids).logits
    logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
    return logp.gather(2, ids[:, 1:].unsqueeze(-1)).sum().item()

delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f}  (less likely if negative)")

§4 · Species

Species specific generation

The same gene (insulin, p53) exists in humans, mouse and chicken, but the surrounding sequence has accumulated different mutations along each lineage for hundreds of millions of years. For each species we feed Carbon up to ~400 bp and ask it to continue. Each continuation should match that species' real DNA better than another species' would. The model handles closely-related species well (mouse, chicken, even though they're ~300 My from human); the further you go back in evolutionary time, the more the surrounding sequence drifts and the harder this setup becomes.

loading species…

prompt in gray generated colored by logprob mismatches in reference highlighted

Try it Pick a gene shared across species, set the prefix length, then hit run all to score every species in parallel. Try the same gene at prefix 200 vs 400 and watch the per-species identity respond.

What to look for With 400 bp of context the model usually recognises which species' DNA it's been given and continues in that species' style; identity to that species' reference often runs 65–90% on the next 60 bp. Cut the prefix to 200 and the signal collapses to near-random: a few hundred bases is what it takes to "lock in" on a lineage. The gap between mouse and chicken is where you can read the evolutionary signal: 300+ My since the last common ancestor is enough drift that a 400 bp prefix still locks Carbon in, but the per-base identity sits a notch below mouse.

Run this from code

from huggingface_hub import get_token
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor

client = OpenAI(
    base_url="https://<your-endpoint>/v1/",
    api_key=get_token(),
)

def continue_species(species_prefix):
    r = client.completions.create(
        model="HuggingFaceBio/Carbon-3B",
        prompt="<dna>" + species_prefix,
        max_tokens=10,
        temperature=0.5, top_p=0.9,
    )
    return r.choices[0].text

# species_prefixes = { "human": ..., "mouse": ..., "chicken": ... }
with ThreadPoolExecutor() as pool:
    results = dict(zip(species_prefixes, pool.map(continue_species, species_prefixes.values())))

for name, cont in results.items():
    print(f"{name:10s}  {cont}")

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceBio/Carbon-3B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda").eval()
tok.padding_side = "left"
if tok.pad_token is None: tok.pad_token = tok.eos_token

# Batch all species in one forward pass via left-padding.
prompts = ["<dna>" + p for p in species_prefixes.values()]
enc = tok(prompts, return_tensors="pt", padding=True, add_special_tokens=False).to("cuda")

with torch.inference_mode():
    out = model.generate(
        **enc, max_new_tokens=10,
        temperature=0.5, top_p=0.9, do_sample=True,
    )
new_ids = out[:, enc["input_ids"].shape[1]:]
for name, ids in zip(species_prefixes, new_ids):
    print(f"{name:10s}  {tok.decode(ids)}")

§5 · Folding

From DNA to proteins

When Carbon completes a protein coding region in a gene, the resulting bases translate to a protein: a protein that folds. We feed the resulting sequence into ESMFold (similar to AlphaFold) and render the 3D structure inline, alongside the same protein folded from the reference sequence so you can see whether Carbon's continuation produced something similar.

loading genes…

carbon · aa

click fold

reference · aa

mismatches vs reference aligned position by position

carbon completion

no structure yet

reference

no structure yet

pLDDT low → high · drag to rotate

residues·

pLDDT mean (carbon)·

pLDDT mean (ref)·

identity (1D)·

What to look for A high pLDDT means ESMFold is confident in the predicted structure at that residue. The interesting case is when Carbon's completion diverges at the base level — sometimes drastically, like CFTR at ~22% identity — but still folds with high confidence into a shape that mirrors the reference backbone. That's the model reaching past memorization for the structural grammar underneath the sequence.

§6 · Embedding space

Mapping out genomes

We embed 571,810 genes from 27 species across six kingdoms (vertebrates, invertebrates, plants, fungi, bacteria, viruses) with Carbon, project to 2D with UMAP, color by attributes. Depending on the attribute, different kinds of organizations emerge from the same points: the model's embedding space encodes multiple axes of biology at once, most of which were never labeled.

highlights

loading 571K points · ~5.8 MB gzipped

points·

species·

embedding dim3072

render·

drag to pan · wheel to zoom · hover for details

What to look for Switch coloring from species to biotype: same points, completely different organization emerges. The macro-clusters trace six kingdoms (vertebrates, invertebrates, plants, fungi, bacteria, viruses), discovered from raw sequence alone. Switch again to gc content and a perpendicular axis appears: AT-rich (cool blue) vs GC-rich (warm amber) regions cut across the species clusters, revealing the composition gradient the model has internalised. Points: 571,810 real Carbon 3B embeddings, projected to 2D via UMAP.

§7 · Species tree

How Carbon groups species from DNA

If we take 571,789 of the sequences from §6 (excluding the two viruses, which are not part of the tree of life) and average each species' embeddings into a single 3072-dim vector, then cluster those 25 centroids with hierarchical clustering, we can find species the model regards as closely related. This dendrogram is not intended as a phylogenetic tree, instead, it asks a simpler question: whether a model trained only on DNA sequences learns representations whose geometry reflects broad biological structure. Carbon was never trained on what the relation between organisms is. Yet the resulting tree groups vertebrates together, separates bacteria from fungi, and pairs sister clades (primates with primates, rodents with rodents, monocots with monocots).

hover a row to see its top neighbours · toggle linkage / scope above

cosine distance ←

vertebrates invertebrates plants fungi bacteria ✓nearest carbon neighbour shares the ncbi group ✗doesn't ·solo (no ncbi sibling in the dataset)

species·

sequences·

embedding dim3072

distancecosine

What to look for Toggle kingdom-level vs sister-level: at the kingdom scale the embedding is strong and stable: animals cluster with animals, bacteria with bacteria. At the sister scale (primate-with-primate, etc.) it's lower as distances are extremely small, so the nearest neighbor can change with sampling, pooling, or linkage choice. The model nails the broad strokes but blurs the fine branches at this resolution. Switch linkage from Ward to UPGMA to see how much of the structure is method-independent. Tree built from species centroids of mean-pooled Carbon-3B embeddings.

Intro

Carbon's architecture is deliberately vanilla. What's not vanilla, and what gets the headline numbers in the DNA Lab tab, is three things: a 6-mer tokenizer that lets the model see ~6× more genomic context per forward pass, a Factorized Nucleotide Supervision (FNS) loss that gives the model partial credit for near-miss tokens once cross-entropy training starts to wobble, and a multi-stage curated data mixture, biased toward functional genomic regions. Everything else (architecture, optimizer) is standard recipe. The technical report details each choice and the ablations behind it.

The sections below walk through each of those choices: how the tokenizer changes what a "token" means in DNA §1, how FNS rescues training in the BF16 regime §2, how bp-level generation and scoring fall out of the same marginalization §3, what's in the training corpus §4, what the architecture looks like §5, how 8k-token pretraining reaches 786 kbp at inference §6, how Carbon stacks up against Evo2-7B and GENERator-v2 on the full training-free suite §7, and why the model runs so fast §8.

§1 · Tokenizer

Read DNA in 6-base chunks

The most direct way to model DNA is one base per token. It works, but for a L-base sequence Transformer attention costs L², and DNA contexts are long. Carbon instead reads in fixed 6-base blocks. Same DNA span, ⅙ the tokens, and because attention is quadratic, up to 36× cheaper at the same coverage. BPE was a tempting middle ground, but its variable-length tokens collide badly with autoregressive next-token prediction: DNA doesn't have stable "words."

1-mer · one token per base

6-mer (carbon) · one token per 6 bases

1-mer tokens·

1-mer attention·

1-mer vocab4

6-mer tokens·

6-mer attention·

6-mer vocab4,096

same DNA span ▼ shorter token sequence = cheaper attention 36× cheaper

Why not BPE BPE works for English because words have stable boundaries. DNA motifs don't: the TATA box is a family of patterns (TATATA, TATATT, …), not a single string. Worse, in autoregressive mode, BPE penalizes the model for predicting a valid prefix of the target token. 6-mer is a deterministic, neutral compression that avoids this trap.

§2 · Training objective

Partial credit for near-misses

Cross-entropy treats every 6-mer token as atomic: predict TATATT when the target was TATATA, get zero credit even though five of six bases matched. That gets brittle late in training. Carbon switches to Factorized Nucleotide Supervision: instead of one 4096-way classification, the model is supervised on six parallel 4-way nucleotide marginals derived from the same logits. Near-miss tokens get partial credit proportional to how many bases they got right.

What the switch buys you CE first: the model learns the joint structure of bases inside each 6-mer (codon constraints, splice signals, motif composition). FNS later, when CE turns brittle (the "loss staircase," and BF16 inference starts diverging from FP32), FNS smooths the objective and restores numerical robustness without giving up the joint prior CE built.

§3 · BP-level inference

Bases, not 6-mers

The 6-mer tokenizer makes Carbon fast, but it's coarse in both directions of inference. When generating, each step advances the sequence by 6 bases at once and temperature acts on a 4,096-way distribution rather than per nucleotide. When scoring an existing sequence, the raw next-token likelihood answers "how likely is this 6-mer in context?", not "how likely is this exact base at this exact position?", which is the version you want for variant-effect prediction. The same marginalization that powers FNS at training time fixes both: softmax over the 6-mer logits, then for each position p sum the probabilities of every 6-mer that shares a given base at p, and you recover six per-position 4-way base distributions. To generate, sample (or argmax) each independently and force the matching 6-mer token. To score, read P(actual base | context) directly off the marginals at every position. Same logits, same math, two endpoints.

per-step pipeline · 4,096-way 6-mer logits → 6 × 4-way base marginals → reassembled token

step 1 · softmax over 4,096 DNA tokens

▼   sum over 6-mers sharing a base at position p
step 2 · six 4-way per-base distributions
pos 1

              ATCG
            
pos 2

              ATCG
            
pos 3

              ATCG
            
pos 4

              ATCG
            
pos 5

              ATCG
            
pos 6

              ATCG
            
▼   same marginals feed two endpoints: generate (force a token) or score (read off P(base))
step 3a · generate

              ACGTAT
            
              argmax / multinomial → force matching 6-mer token
            
step 3b · score

              .83.71.92.67.48.79
            
              read P(actual base | context) at each position

When to switch on bp-level Use plain 6-mer decoding when 6-base granularity is fine: throughput-bound generation, long retrieval haystacks, large-scale screening. Reach for bp-level generation when you need exact base counts, per-position masks, or temperature applied at the base axis rather than the 4,096-way 6-mer axis. Reach for bp-level scoring whenever the task is about a specific base: variant-effect prediction, single-nucleotide mutational scans, comparing the likelihood of a reference and an alternate allele at one position. Both paths ship together on the fns revision of the Carbon-3B/8B/500M checkpoints: plain .generate() already produces bp-resolution output (the tokenizer exposes the kmer width as tokenizer.k), and the model gains a score_sequence(seqs) method that batches a list of sequences and returns per-base distributions plus the probability of the observed base at every position.

Run this from code

import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceBio/Carbon-3B"
revision = "fns"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    revision=revision,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()

context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
n_bp = 60

inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=math.ceil(n_bp / tokenizer.k),
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]

print(generated_dna)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceBio/Carbon-3B"
revision = "fns"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    revision=revision,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()

reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"

# score_sequence accepts a list of sequences and returns, for each one,
# the [seq_len, 4] marginal P(A/T/C/G | context) and the [seq_len]
# probability of the observed base.
with torch.no_grad():
    bp_probs, actual_probs = model.score_sequence([reference, perturbed])

scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]

print(f"reference mean bp logp: {scores[0]:.4f}")
print(f"perturbed mean bp logp: {scores[1]:.4f}")
print(f"reference preferred: {scores[0] > scores[1]}")

§4 · Data

Genomes are mostly background

A naive read of "more data is better" misses something specific to DNA: most of a eukaryotic genome is repeats, low-complexity, and weakly-constrained background. Train on raw sequence and a lot of your loss is dominated by easy-to-predict noise. Carbon's corpus is an annotation-aware mixture, biased toward gene-centric, transcript, and bacterial sequence, so the model spends more of its gradient updates on biologically meaningful sequence.

corpus composition · 1T tokens (6T base pairs)

signal-to-noise · raw genome vs annotation-aware curation

functional / annotated background curating raises the density of biological signal in the gradient

metadata templates · the model sees mixed contexts so it works with or without labels

The signal-to-noise math If only 5% of a raw corpus is informative, but you keep 80% of informative regions while discarding 95% of background, the effective informative fraction jumps from 5% to ≈ 46%. Same training compute, ~9× more learning signal per gradient step.

§5 · Architecture

A deliberately vanilla transformer

Decoder-only, RMSNorm + SwiGLU + RoPE + grouped-query attention, tied I/O embeddings, 8k-token context. Nothing exotic. The architectural surface is intentionally familiar so that any improvement Carbon shows on genomic tasks is attributable to the data, the tokenizer, and the loss, not to a custom block or a hand-crafted attention variant.

      vocabulary = 4,096 6-mer DNA tokens + small set of special / metadata tokens · total 155,776
    

Why this matters Architecture innovation is one of the cheapest things to claim and one of the hardest things to attribute. Carbon's results (competitive with Evo2-7B at 3B parameters, ahead of it on a majority of tasks at 8B) come from changes that aren't the architecture. That's where the room for genomic foundation models still is.

§6 · Long context

Pretrain at 8k, retrieve at 786 kbp

Carbon's nominal training context is short by megabase-scale standards (8k tokens, ≈49 kbp). The reach comes from a two-step extension. First, a training-time long-context phase lifts the context to 32k tokens (≈197 kbp) with RoPE θ rescaled from 500k to 5M. Then, at inference, YaRN pushes that further: 2× to 65k tokens for the 3B model, 4× to 131k tokens for the 8B (≈786 kbp, the size of a small bacterial genome). The 8B has more capacity to absorb the YaRN stretch, which is why it extends further than the 3B.

context length · log scale, base pairs of DNA reachable in a single forward pass

Genome-NIAH retrieval · plain variant · find a planted 24 bp value inside a real-genome haystack

Carbon 8B (YaRN) Carbon 3B (YaRN) Evo2-7B (native 1M) accuracy at exact-match retrieval, 500 samples per cell

The headline number At 786 kbp, Carbon-8B retrieves the planted needle at 65% accuracy. Evo2-7B, natively trained at 1M tokens of single-nucleotide context (≈8× more wall-clock per token), scores 53% at the same length. So a 6-mer model trained to 32k tokens plus YaRN-4× at inference reaches further than a 1M-native single-nucleotide model, which is the entire bet of the Carbon recipe: nominal context length is not the same as effective context utilization.

§7 · Results

Training-free, head-to-head

Eight training-free tasks across four capability axes: generative sequence recovery, variant-effect prediction (BRCA2, TraitGym, ClinVar coding / non-coding), sequence-level perturbation (synthetic motif insertion and synonymous codon shuffling), and long-context retrieval (Genome-NIAH at 393 kbp). No fine-tuning, no head training, all four frozen pretrained models scored under the same protocol. Carbon-3B is competitive with Evo2-7B despite less than half the parameters; Carbon-8B is ahead on five of eight.

Carbon 8B Carbon 3B Evo2-7B GENERator-v2 3B

How to read it Carbon-8B leads on sequence recovery, BRCA2, ClinVar non-coding, triplet expansion, and Genome-NIAH at 393 kbp. Evo2-7B holds onto TraitGym Mendelian (a hard non-coding variant set), and edges Carbon-8B on ClinVar coding and synonymous codon shuffling by a fraction of a point each — small enough to be effectively a tie. The pattern is broad rather than peaky: Carbon's gains come from data, tokenizer, and objective design, distributed across tasks, not from a single specialised benchmark.

§8 · Efficiency

Why Carbon is fast

The throughput story is a two-factor multiplication, not one big trick. First, the architecture is deliberately vanilla: a stock Llama-3-shaped decoder. That means Carbon drops straight into vLLM and inherits the same paged-attention, fused kernels, and CUDA-graph capture that the open-source LLM stack has been optimizing for two years. Custom blocks would forfeit all of that. Second, 6-mer tokenization compresses a given DNA span by 6× at the input, which under quadratic attention is up to a 36× reduction in prefill cost, and the decode loop emits 6 bases per step instead of one. Stacking the two: standard-stack inference speedups, multiplied by tokenizer compression, gets you the order-of-magnitude gap over Evo2 reported in the paper.

Inference throughput · output bp/s · single H100

Legend Carbon-8B Carbon-3B Carbon-500M Evo2 1B Evo2 7B Evo2 20B Evo2 40B

      Source · carbon-inference-evals
      vLLM for Carbon · Evo2 native runner
    

The compound effect Neither factor on its own would be a story. Vanilla architecture without 6-mer compression would land Carbon at roughly Llama-3 throughput: fine but not remarkable. 6-mer compression on a custom architecture would force a hand-rolled inference stack to keep up with vLLM. Doing both together is what makes a 3B-parameter DNA model usable for large-scale evaluation on commodity hardware.

The fastest open-source foundation model for DNA.

What Carbon reads

Prompt

Sequence