Today we're releasing Carbon — three model sizes (500M, 3B, and 8B parameters), shipping with the full training code, the data pipeline, and the model weights. All open-source on the Hugging Face Hub.
Fig · Benchmark Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.
The model is fed long strings of four letters: A, C, G, T. Those letters are the bases of DNA. Stretches of it are genes, which cells copy into RNA and translate into proteins. A century of molecular biology has been spent working out how. Carbon is given only the letters. What they mean is what it has to learn.
DNA is written in four small molecules: adenine, cytosine, guanine, thymine. Two are purines (A and G, twin-ring), two are pyrimidines (C and T, single-ring). Everything that follows is built from these four.
Each base hangs off a sugar-phosphate backbone. Two backbones run anti-parallel and twist into a double helix. The bases on opposite strands pair by chemistry: A always with T, G always with C, so one strand fully determines the other. A human genome is about 3 billion base pairs of this.
A gene is a stretch of DNA that the cell turns into protein. Most of the genome is not. Each gene begins with a promoter, where the cell starts reading. What follows is broken into two kinds of segment: exons, which the cell keeps, and introns, which it splices out and often serve regulatory purposes.
The cell copies the gene into RNA. Then it splices out the introns and joins the exons together. What's left is the working mRNA: just the exons, in order. (T is rewritten as U along the way: a small alphabet quirk between DNA and RNA.)
Every three RNA letters (a codon) encode one amino acid. There are only 20 amino acids in the standard alphabet; every protein in nature is built from this same set. The chain then folds into a 3D shape, and that shape is the function: hemoglobin · insulin · collagen · antibodies · enzymes.
A model that understands and writes DNA is useful wherever DNA is the input or the output. There are three interesting use-cases for such models: tuning the genetics of the food we grow, designing the regulatory and coding sequences that drive biomanufacturing, and helping interpret the variants that show up in clinical sequencing.
Map genotype to phenotype across crops and livestock: surface the variants that drive yield, quality, disease and pest resistance, and tolerance to drought, heat, cold, or salinity, so breeders can select for them directly.
Design and tune promoters, enhancers, UTRs, and terminators to control expression strength, tissue specificity, timing, and inducibility. The same machinery powers codon optimization and host-specific engineering, letting microbial strains turn out enzymes, chemicals, fuels, antibiotics, and natural products more efficiently.
Help prioritize the variants of uncertain significance that crowd clinical sequencing in rare disease and cancer, where it's often unclear whether a DNA change is actually driving the phenotype. Further out, support patient-tailored therapeutic design: mRNA vaccines, therapeutic proteins, enzymes, and antimicrobial peptides, with expression efficiency, stability, and manufacturability in the loop.
Carbon-3B is a 3-billion-parameter language model for DNA. It is trained on roughly 1 trillion tokens (6 trillion base pairs) of genomic sequence with a simple objective: given some DNA, predict what comes next (six bases at a time, autoregressively). Even though the objective is simple the resulting model is versatile. In the DNA lab you can explore all the cool things we can do with a DNA model.
Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes.
It wasn't trained to tell which mutations are pathogenic or how genes differ between species.
The sections below highlight what it picked up
anyway: autocomplete a gene §1, see
structure emerge in its confidence §2, score
a disease variant against a healthy one §3,
recognise a gene's species of origin §4,
and then push further into folded protein structure
§5, the embedding manifold
§6, and the species tree
§7. Each demo runs against the public
HuggingFaceBio/Carbon-3B checkpoint behind a live inference endpoint.
Same idea as GPT completing a sentence, but for DNA. We feed the model a DNA sequence as input and the model produces an output sequence. The model streams the bases one 6-base token at a time. The model is better at predicting sequences of a gene's exons because they are the protein-coding parts of a gene and are under strong evolutionary constraint. As such they should be the most predictable stretches of DNA. The introns serve regulatory purposes on the other hand and are harder to predict. We overlay the real exon/intron annotations on top of the output so you can compare what Carbon produces to what's actually there.
Try it Drag the dark ▼ ▲ markers to slide the prompt window and the green ▼ to set where generation stops, then hit ▶ generate. Land the green-shaded region inside an exon (dark green block) and note the count of green-underlined matches; repeat with a similar-length window over an intron and compare.
What to look for Exons are under selection pressure, so getting them right takes real biological understanding, not just DNA statistics. Boundaries between high- and low-confidence stretches in Carbon's output also tend to fall near real exon/intron edges, even though the model has never seen a single annotation.
from huggingface_hub import get_token
from openai import OpenAI
# Carbon-3B can be served behind any OpenAI-compatible API (vLLM, TGI, an
# HF inference endpoint, etc.). Point base_url at your deployment.
client = OpenAI(
base_url="https://<your-endpoint>/v1/",
api_key=get_token(),
)
# First ~60 bp of HBB. Replace with whatever gene opening you want.
prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"
r = client.completions.create(
model="HuggingFaceBio/Carbon-3B",
prompt=prompt,
max_tokens=10, # 10 6-mer tokens ~= 60 bp of continuation
temperature=0.5, top_p=0.9,
)
print(r.choices[0].text)from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained(
"HuggingFaceBio/Carbon-3B", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B",
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda").eval()
prompt = "<dna>AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTT"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=10, # ~60 bp at 6 bp / token
temperature=0.5, top_p=0.9, do_sample=True,
)
# Slice off the prompt so we just print the continuation.
new_ids = out[0, inputs["input_ids"].shape[1]:]
print(tok.decode(new_ids))The Carbon model assigns every 6-base chunk a log-probability under the surrounding context: how "expected" or "likely" that stretch of DNA is. The plot with the scores along a real gene shows the curve dips and rises. We overlay the exon/intron annotation on top: confidence reliably climbs in protein-coding regions and falls in repetitive or unconstrained intronic stretches, even though the model never saw a single label. The same score, summed up, is what powers the variant-effect call in §3 below.
Try it Pick a gene and watch its per-token confidence curve. Each gene's exons are highlighted in green; the curve underneath is Carbon's log-probability for each 6-base token along the sequence.
What to look for Exons, especially the protein-coding portions, tend to score noticeably higher than introns because they're evolutionarily conserved and full of constrained patterns the model has learned to predict. The Δ tells you how strongly Carbon "noticed" the difference for this gene. Keep this curve in mind for §3: a variant that flips a base inside a high-confidence exon stretch is the kind of edit that should make Carbon surprised.
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(
base_url="https://<your-endpoint>/v1/",
api_key=get_token(),
)
# Echoed scoring: forward-pass the prompt and return per-token logprobs
# (no generation). The score per 6-mer chunk is what the per-base
# confidence track is built from.
prompt = "<dna>" + gene_sequence # full gene, up to ~32k tokens
r = client.completions.create(
model="HuggingFaceBio/Carbon-3B",
prompt=prompt,
max_tokens=0, echo=True, logprobs=1, temperature=0,
)
for tok, lp in zip(r.choices[0].logprobs.tokens,
r.choices[0].logprobs.token_logprobs):
print(f"{tok}\t{lp}")from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F
tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B",
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda").eval()
ids = tok("<dna>" + gene_sequence, return_tensors="pt",
add_special_tokens=False).input_ids.to("cuda")
with torch.inference_mode():
logits = model(ids).logits
# Per-token log-prob of the actual next token (the standard "echo" pattern).
logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
per_tok_lp = logp.gather(2, ids[:, 1:].unsqueeze(-1)).squeeze(-1)[0]
for t, lp in zip(tok.convert_ids_to_tokens(ids[0, 1:].tolist()),
per_tok_lp.tolist()):
print(f"{t}\t{lp:.3f}")§2 showed that Carbon's per-base confidence rises and falls in step with gene structure. Now we use the same log-likelihood, but as a measure for individual mutations. For a real ClinVar variant we score a ~4 kb window of human DNA two ways: once with the original base, once with the mutation. Then we check which version looks more like real, functioning human sequence. Carbon was never trained on what "pathogenic" means; it just learned what natural DNA looks like. Variants that disrupt protein-coding or regulatory function show up as less likely sequence under the model's distribution.
Try it Pick a known variant from the pills, then click any base in the mutation row to introduce a different change. The model re-scores on every edit.
What to look for Read each row two ways: the dot color is what ClinVar says (red = pathogenic, orange = risk, green = benign); the bar direction is what Carbon says (red bar pointing left = mutation less likely than original; charcoal bar pointing right = mutation looks fine or more likely). Watch the two VHL rows for the cleanest demonstration: a premature stop codon (c.475A>T) swings the bar hundreds of nats to the left, while a common 3' UTR variant (c.*820A>G) in the very same gene sits at zero. Same model, same window length, opposite verdicts. Carbon learned the distinction from raw sequence alone, with no labels.
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(
base_url="https://<your-endpoint>/v1/",
api_key=get_token(),
)
def score_sum(seq):
"""Sum of per-token log-probs for the given DNA sequence."""
r = client.completions.create(
model="HuggingFaceBio/Carbon-3B",
prompt="<dna>" + seq,
max_tokens=0, echo=True, logprobs=1, temperature=0,
)
return sum(lp for lp in r.choices[0].logprobs.token_logprobs if lp is not None)
# Score the same ~4 kb window two ways: original vs the one-base mutation.
delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f} (less likely if negative)")from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn.functional as F
tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B",
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda").eval()
def score_sum(seq):
ids = tok("<dna>" + seq, return_tensors="pt",
add_special_tokens=False).input_ids.to("cuda")
with torch.inference_mode():
logits = model(ids).logits
logp = F.log_softmax(logits.float(), dim=-1)[:, :-1, :]
return logp.gather(2, ids[:, 1:].unsqueeze(-1)).sum().item()
delta = score_sum(var_seq) - score_sum(ref_seq)
print(f"delta = {delta:+.2f} (less likely if negative)")The same gene (insulin, p53) exists in humans, mouse and chicken, but the surrounding sequence has accumulated different mutations along each lineage for hundreds of millions of years. For each species we feed Carbon up to ~400 bp and ask it to continue. Each continuation should match that species' real DNA better than another species' would. The model handles closely-related species well (mouse, chicken, even though they're ~300 My from human); the further you go back in evolutionary time, the more the surrounding sequence drifts and the harder this setup becomes.
Try it Pick a gene shared across species, set the prefix length, then hit run all to score every species in parallel. Try the same gene at prefix 200 vs 400 and watch the per-species identity respond.
What to look for With 400 bp of context the model usually recognises which species' DNA it's been given and continues in that species' style; identity to that species' reference often runs 65–90% on the next 60 bp. Cut the prefix to 200 and the signal collapses to near-random: a few hundred bases is what it takes to "lock in" on a lineage. The gap between mouse and chicken is where you can read the evolutionary signal: 300+ My since the last common ancestor is enough drift that a 400 bp prefix still locks Carbon in, but the per-base identity sits a notch below mouse.
from huggingface_hub import get_token
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
client = OpenAI(
base_url="https://<your-endpoint>/v1/",
api_key=get_token(),
)
def continue_species(species_prefix):
r = client.completions.create(
model="HuggingFaceBio/Carbon-3B",
prompt="<dna>" + species_prefix,
max_tokens=10,
temperature=0.5, top_p=0.9,
)
return r.choices[0].text
# species_prefixes = { "human": ..., "mouse": ..., "chicken": ... }
with ThreadPoolExecutor() as pool:
results = dict(zip(species_prefixes, pool.map(continue_species, species_prefixes.values())))
for name, cont in results.items():
print(f"{name:10s} {cont}")from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("HuggingFaceBio/Carbon-3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B",
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda").eval()
tok.padding_side = "left"
if tok.pad_token is None: tok.pad_token = tok.eos_token
# Batch all species in one forward pass via left-padding.
prompts = ["<dna>" + p for p in species_prefixes.values()]
enc = tok(prompts, return_tensors="pt", padding=True, add_special_tokens=False).to("cuda")
with torch.inference_mode():
out = model.generate(
**enc, max_new_tokens=10,
temperature=0.5, top_p=0.9, do_sample=True,
)
new_ids = out[:, enc["input_ids"].shape[1]:]
for name, ids in zip(species_prefixes, new_ids):
print(f"{name:10s} {tok.decode(ids)}")When Carbon completes a protein coding region in a gene, the resulting bases translate to a protein: a protein that folds. We feed the resulting sequence into ESMFold (similar to AlphaFold) and render the 3D structure inline, alongside the same protein folded from the reference sequence so you can see whether Carbon's continuation produced something similar.
We embed 571,810 genes from 27 species across six kingdoms (vertebrates, invertebrates, plants, fungi, bacteria, viruses) with Carbon, project to 2D with UMAP, color by attributes. Depending on the attribute, different kinds of organizations emerge from the same points: the model's embedding space encodes multiple axes of biology at once, most of which were never labeled.
If we take the same 571,810 sequences from §6 and average each species' embeddings into a single 3072-dim vector, then cluster those 27 centroids with hierarchical clustering, we can find species the model regards as closely related. Carbon was never trained on what the relation between organisms is. Yet the resulting tree groups vertebrates together, separates bacteria from fungi, and pairs sister clades (primates with primates, rodents with rodents, monocots with monocots).
Carbon's architecture is deliberately vanilla. What's not vanilla, and what gets the headline numbers in the DNA Lab tab, is three things: a 6-mer tokenizer that lets the model see ~6× more genomic context per forward pass, a Factorized Nucleotide Supervision (FNS) loss that gives the model partial credit for near-miss tokens once cross-entropy training starts to wobble, and a multi-stage curated data mixture, biased toward functional genomic regions. Everything else (architecture, optimizer) is standard recipe. The technical report details each choice and the ablations behind it.
The sections below walk through each of those choices: how the tokenizer changes what a "token" means in DNA §1, how FNS rescues training in the BF16 regime §2, how bp-level generation and scoring fall out of the same marginalization §3, what's in the training corpus §4, what the architecture looks like §5, how 8k-token pretraining reaches 786 kbp at inference §6, how Carbon stacks up against Evo2-7B and GENERator-v2 on the full training-free suite §7, and why the model runs so fast §8.
The most direct way to model DNA is one base per token. It works, but for a
L-base sequence Transformer attention costs L², and DNA contexts
are long. Carbon instead reads in fixed 6-base blocks. Same DNA span, ⅙ the tokens, and
because attention is quadratic, up to 36× cheaper at the same coverage.
BPE was a tempting middle ground, but its variable-length tokens collide badly with
autoregressive next-token prediction: DNA doesn't have stable "words."
TATATA, TATATT, …),
not a single string. Worse, in autoregressive mode, BPE penalizes the model for predicting
a valid prefix of the target token. 6-mer is a deterministic, neutral compression
that avoids this trap.
Cross-entropy treats every 6-mer token as atomic: predict TATATT when the
target was TATATA, get zero credit even though five of six bases matched.
That gets brittle late in training. Carbon switches to Factorized Nucleotide
Supervision: instead of one 4096-way classification, the model is supervised on
six parallel 4-way nucleotide marginals derived from the same logits. Near-miss tokens
get partial credit proportional to how many bases they got right.
The 6-mer tokenizer makes Carbon fast, but it's coarse in both directions
of inference. When generating, each step advances the sequence by
6 bases at once and temperature acts on a 4,096-way distribution rather
than per nucleotide. When scoring an existing sequence, the raw
next-token likelihood answers "how likely is this 6-mer in context?", not
"how likely is this exact base at this exact position?", which is the
version you want for variant-effect prediction. The same marginalization
that powers FNS at training time fixes both: softmax over the 6-mer
logits, then for each position p sum the probabilities of
every 6-mer that shares a given base at p, and you recover
six per-position 4-way base distributions. To generate, sample (or argmax)
each independently and force the matching 6-mer token. To score, read
P(actual base | context) directly off the marginals at every
position. Same logits, same math, two endpoints.
custom_generate method at
HuggingFaceBio/carbon-generate that works on the plain
Carbon-3B/8B/500M checkpoints
(standard LlamaForCausalLM, no custom modeling file).
Scoring ships in the -remote variants of those same
checkpoints, which add a score_sequence(seq) method that
returns per-base distributions and the probability of the observed base
at every position.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained(
"HuggingFaceBio/Carbon-3B", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B",
dtype=torch.bfloat16, device_map="auto",
)
prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
# `custom_generate` injects a logits processor that marginalizes the
# 6-mer logits to per-base distributions and samples each of the 6
# positions independently, then forces the matching 6-mer token. All
# standard generation knobs (temperature, top_p, top_k, repetition_penalty)
# still apply, they just act on the per-base marginals.
out = model.generate(
**inputs,
max_new_tokens=128, # 128 6-mer tokens ~= 768 bp of continuation
custom_generate="HuggingFaceBio/carbon-generate",
trust_remote_code=True,
tokenizer=tok,
do_sample=True, temperature=0.8, top_p=0.9,
)
# Slice off the prompt and decode the continuation as plain DNA.
new_ids = out[0, inputs["input_ids"].shape[1]:]
print(tok.decode(new_ids, skip_special_tokens=True))from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, math
# The -remote variants bundle modeling code that exposes
# `score_sequence(seq)` directly on the model. It returns, for every
# position in the input DNA, the marginal P(base | context) and the
# probability of the observed base.
tok = AutoTokenizer.from_pretrained(
"HuggingFaceBio/Carbon-3B-remote", trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceBio/Carbon-3B-remote",
trust_remote_code=True,
dtype=torch.bfloat16, device_map="auto",
)
ref = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
alt = ref[:20] + "G" + ref[21:] # single-base substitution at pos 20
# bp_probs: [seq_len, 4] marginal P(A/T/C/G | context) at each position
# actual: [seq_len] P(observed base | context) at each position
bp_probs_ref, actual_ref = model.score_sequence(ref)
bp_probs_alt, actual_alt = model.score_sequence(alt)
# log-likelihood delta at the substituted position
# is the per-base variant-effect score in its simplest form.
delta = math.log(actual_alt[20].item() + 1e-12) \
- math.log(actual_ref[20].item() + 1e-12)
print(f"log P(alt) - log P(ref) at pos 20: {delta:+.3f}")A naive read of "more data is better" misses something specific to DNA: most of a eukaryotic genome is repeats, low-complexity, and weakly-constrained background. Train on raw sequence and a lot of your loss is dominated by easy-to-predict noise. Carbon's corpus is an annotation-aware mixture, biased toward gene-centric, transcript, and bacterial sequence, so the model spends more of its gradient updates on biologically meaningful sequence.
Decoder-only, RMSNorm + SwiGLU + RoPE + grouped-query attention, tied I/O embeddings, 8k-token context. Nothing exotic. The architectural surface is intentionally familiar so that any improvement Carbon shows on genomic tasks is attributable to the data, the tokenizer, and the loss, not to a custom block or a hand-crafted attention variant.
Carbon's nominal training context is short by megabase-scale standards (8k tokens, ≈49 kbp). The reach comes from a two-step extension. First, a training-time long-context phase lifts the context to 32k tokens (≈197 kbp) with RoPE θ rescaled from 500k to 5M. Then, at inference, YaRN pushes that further: 2× to 65k tokens for the 3B model, 4× to 131k tokens for the 8B (≈786 kbp, the size of a small bacterial genome). The 8B has more capacity to absorb the YaRN stretch, which is why it extends further than the 3B.
Eight training-free tasks across four capability axes: generative sequence recovery, variant-effect prediction (BRCA2, TraitGym, ClinVar coding / non-coding), sequence-level perturbation (synthetic motif insertion and synonymous codon shuffling), and long-context retrieval (Genome-NIAH at 393 kbp). No fine-tuning, no head training, all four frozen pretrained models scored under the same protocol. Carbon-3B is competitive with Evo2-7B despite less than half the parameters; Carbon-8B is ahead on five of eight.
The throughput story is a two-factor multiplication, not one big trick. First, the architecture is deliberately vanilla: a stock Llama-3-shaped decoder. That means Carbon drops straight into vLLM and inherits the same paged-attention, fused kernels, and CUDA-graph capture that the open-source LLM stack has been optimizing for two years. Custom blocks would forfeit all of that. Second, 6-mer tokenization compresses a given DNA span by 6× at the input, which under quadratic attention is up to a 36× reduction in prefill cost, and the decode loop emits 6 bases per step instead of one. Stacking the two: standard-stack inference speedups, multiplied by tokenizer compression, gets you the order-of-magnitude gap over Evo2 reported in the paper.
Open-ended DNA continuation. Type any prefix in {A, C, G, T}, watch the model continue token by token. Toggle base-coloring or per-token logprob coloring to see where Carbon is confident and where it's guessing. Track GC content, perplexity, and throughput live.
DNA prefix in {A, C, G, T}: pick an example or type your own.
Streams as the model generates · live stats on the right.