By Mark Hahnel in AI — 17 Jun 2026

BioSingularity. The substrate determines the ceiling

Biology becoming computationally tractable depends less on model size than on the data substrate underneath. The ceiling is infrastructure.

Why the infrastructure underneath the models decides how far they can go.

In the past two years a phrase has been circulating in the AI-in-biology community that I find genuinely useful for thinking about where the field is heading. People talk about a “biosingularity”, meaning the point at which biology becomes computationally tractable in the way that, weather or fluid dynamics already are. The cell becomes a system you can simulate. The drug becomes a thing you can design before you make. Jensen Huang has been making the same argument from the NVIDIA side, framing it as biology shifting from a life science into life engineering.

I agree with this hypothetical end goal. The path to it is still very bloody complex. An AI system is only ever as good as the verifiability and the provenance of the thing it reasons over. We call this the substrate. The substrate determines the ceiling.

Biology is trying to build a substrate it can trust

Figure 1. Foundation models now span almost every scale of biology. The substrate they rely on is the cross-cutting requirement.

There are now foundation models at almost every scale of biology. AlphaFold and its successors at the molecular level. Models like scGPT and Geneformer at the single-cell level. UNI and Virchow at the pathology level. Organ digital twins above that. It is fantastic that we are building these things. The Cell Perspective by the Chan Zuckerberg-led group on the AI Virtual Cell, along with the Arc Institute Virtual Cell Challenge launched in 2025, gives a sense of how much organised effort (and money) is now going into building the next layer of the stack.

There is however, a caveat. A model fit to observational data only ever learns the correlation structure of that data. It learns the patterns, and it learns the biases, and it cannot tell the difference between them without help. The single-cell company Noetik has described its own model’s outputs as “not at all guaranteed to be causal, merely strongly correlated.” Recent benchmarking by Ahlmann-Eltze and colleagues found that five state-of-the-art perturbation models could not beat a simple linear baseline when tested on gene knockouts they had genuinely never seen. The models are excellent at interpolating within what they have observed, and unreliable outside of that.

The fix the field has converged on is to engineer causal structure into the substrate itself. Rather than gathering more observational data, the move is to feed the models perturbation experiments, the results of actually intervening on a system rather than just watching it. You build a substrate that encodes cause as well as correlation, and the model’s ceiling rises accordingly. The Virtual Cell Challenge is, in a sense, a community-scale attempt to do exactly this in public.

Software can do this, because its substrate was verifiable all along

The reason AI agents arrived in software engineering before almost anywhere else is that code is executable and feedback-rich. Kenny Workman of LatchBio makes this argument well. You can run code, watch it fail, read the error, and try again. The substrate hands you ground truth on every loop. Or as he puts it “Agentic biology is shaped like software“

Figure 2. The gradient from locally verifiable to globally verifiable. After Kenny Workman (2026).

Take a hard scientific question, like whether a particular set of gene mutations drives a disease. At the top level it has no global verifiability, with no single correct answer to grade against. But it decomposes into smaller steps, some of which are locally verifiable. Did this cell line pass quality control? Do these genes show differential expression? Each of those has a crisp answer. The way to answer this is to build up from the locally verifiable steps toward the globally uncertain claim. That gradient, from locally verifiable to globally verifiable, is the single most useful lens I have found for the biology problem. It tells you where progress is fast (the left end) and where it is slow (the right end). It also tells you what kind of infrastructure each end needs.

What this looks like in drug discovery

For pharma, the locally verifiable layer is where most of the immediate value sits. Was this target mentioned in a paper? Was that paper funded by which grants? Did the trials supporting it use the right comparator? Has the molecule been characterised in patents? Each of these is a crisp yes-or-no, and each is the kind of question that AI ought to be able to answer with confidence, but only if the underlying substrate is structured and provenance-bearing rather than a soup of free text.

This is where the kind of resource we have spent a long time building at Digital Science comes in. The Dimensions Knowledge Graph links publications, grants, patents, clinical trials, datasets and policy documents into a single typed structure. Every entity carries a persistent identifier. Every relationship can be traced back to its source. If a model says “this target appears in seventeen recent papers funded by this consortium and three clinical trials at this phase,” that string is not coming from a statistical pattern. It is coming from edges in a graph. The model becomes a query interface to a substrate, and the substrate carries the truth.

The same principle holds one layer down. Figshare exists in part because the perturbation data that biology is now scrambling to generate needs persistent identifiers, version history and provenance metadata, or it cannot be reused. A virtual cell model trained on undocumented data is a virtual cell model whose ceiling is set by the lowest-quality dataset it ingested. The persistent identifier (DOI for publications and datasets, ORCID for researchers, ROR for institutions) is the plumbing that decides whether any of this works downstream.

FAIR has been pointing here for a decade

The FAIR data principles, findable, accessible, interoperable, reusable, have been making this argument since 2016, long before anyone needed a large language model to care. For most of that decade they sounded like infrastructure housekeeping, the sort of thing you nod along to and deprioritise. What has changed is that AI has suddenly made the cost of an unverifiable substrate visible, and urgent, and measurable in failed drug trials.

A trial fails for many reasons, but a meaningful fraction of late-stage failures trace back to evidence chains that no one ever audited at the start. If the original target paper relied on a misinterpreted figure, or the dose-finding work cited a study with a flawed control arm, that information was always in principle recoverable. It just was not connected to the system that picked the molecule. A substrate with provenance built in does that connecting by default.

We’re on a slope as opposed to waiting for a Eureka moment

The biosingularity is probably better understood as a slope than a moment. The labs and companies that climb it fastest will be the ones whose models reason over the most verifiable, best-provenanced substrate, rather than the ones with the cleverest models. The ceiling is set by the ground, not by what stands on it. We need to continue to improve provenance and reproducibility in research. We need to continue to push for Open Science. The infrastructure work, persistent identifiers and knowledge graphs and perturbation datasets and FAIR-compliant repositories, is what determines how high the ceiling rises. We are getting better at this as a research community. Preprints help. So does open data. The substrate beneath biology is next, and the alternative is the same one playing out elsewhere in the research ecosystem. Ungrounded generation rushes in to fill the gap, with things that were never true.

As mentioned in my last post “We have a deadline, because something is already busy filling the substrate with things that were never true.”