A Simple Heuristic

First written: 22 July 2025

Last published: 22 July 2025

Will artificial intelligence take your job?

Ask the question: Is it easy to explain good work in your line of work?

Free Software, Open Source and Productization

First written: 14 July 2025

Last published: 20 July 2025

Software drives the majority of the world's productivity. It enables global commerce, exchange of information and coordination of human services. It is heavily used recreationally. Software is dominated by global monopolies (with exceptions in countries like China, which have national monopolies). The consolidation of power and infrastructure enable the human institutions that emerge around these monopolies to have unmatched influence to control the lives of humanity. Decisions made by these institutions are always an attempt to ensure their hegemony.

The incumbents use their ecosystems to leverage user data and create AI integrations. But people don't like "artificial intelligence" features being forced into existing applications. Because it simply isn't idiomatic. Microsoft Word was not designed with next-token predictors in mind. We are entering a period of necessary reinvention. Why would the incumbents force products which aren't profitable? Mixture of FOMO, herd thinking and genuine utility. And they are probably beginning to become profitable. Inference is not that expensive. And attaching AI features to existing products is an excuse to raise prices.

Productizing large language models should be done on the basis that their power lies in the breadth and depth of their training data, in addition to how sensible the inputs and outputs were in post-training reinforcement (suitability of the reward model). That's it. Everything else is optimization. The best use of current large language models is for using them as companions for dialectics (including human-in-the-loop agentic systems). Besides using them as classifiers or labellers, no-human-in-the-loop agentic use is not recommended for anything worth doing.

But what about the Fortune 100 CEOs which claim AI is boosting productivity, generating more lines or code, replacing developers, giving them superpowers, or all of the above? What do they actually mean? What parts of the job are being replaced? What is the role of natural language in programming?

Programming languages are languages made up by humans to specify programs for computers to execute. Depending on the kind of program needed to fulfil a piece of business logic, most of the time spent programming might be reading documentation, creating helper functions, organising the code. We build on top of abstractions and standards that were invented by humans and maintained by humans.

The W3C develops standards for the web that browser creators are expected to follow. There is nobody using force to make sure that the standards are followed. The internet is a collection of agreements and rules, strict and formalised, but which also can't be enforced by its issuing body. These standards and conventions are human-invented. e.g. see HTTP code 418: I'm a teapot. Any application that interfaces with the web has to conform to certain standards. The standardisation around JavaScript, and hence the invention of TypeScript. Further, frameworks exist to provide tooling for common functionality. As our technology gets more advanced, we need clearer standards and a strong global open source community to support and create those standards. All our modern software is built on open source foundations. And this must continue to be the case for a decentralised, free future.

Any large language model would need to be able to query up-to-date documentation (which is now a standard feature in all modern coding agents). But access to documentation is not the bottleneck. Subjective choices are needed, when choosing dependencies, or architecting software systems that large language models are not able to handle. Lateral thinking is needed to get out of ruts, which large language models are infamously bad at, despite modern post-training techniques. So what about pair programming with LLMs? This is now a standard feature in IDEs. In addition to coding agents submitting PRs and CLI tools.

The arithmetic seems simple. Spend $1000 a year on developer AI tooling per software engineer and even if you get only a 5% productivity boost, that's a win. But this needs an understanding of what developer productivity is. Number of lines of code written? Tests passed? Tickets closed? Pull requests accepted? Problems: Unmaintainable codebases, Inability to debug, Skill atrophy A deadly trio that pushes the problem down the line. LLMs can help with boilerplate and greenfield code for developers who don't have specific expertise but are generally competent. (Can't reading documentation do that too?) Rubber ducking with LLMs can help software engineers spot problems and find solutions quicker. But this is a rubber duck that talks back! And if you're not careful, the rubber duck may lie to you or lead you astray. Code reviews - LLMs can help with this. Help spot potential issues missed by humans. But this requires judicious context management and dependency tracing to make sure the LLM can perform the best.

Quality control needs to be higher than ever. Which begs the question, do LLMs speed up or slow down the development process in aggregate? Time spent by humans reviewing LLM generated code. importance of testing. Importance of code familiarity and deep understanding. Is it easier to evaluate than write code? Programming with LLMs is a skill in itself. Knowing when and what to delegate. Knowing when to override. Knowing what tasks it excels at, what it doesn't.

If you execute the same piece of code a million times, you should expect the same output a million times. The power of code is creating reusable functions with precisely defined behaviour. Hence, test driven development. However, functions can be non-deterministic at runtime, e.g. for generating random secrets. We interface with the machine at various levels of abstraction: from machine code to Python. Now we introduce a new level of abstraction: prompt-driven development. It represents the first time we inject non-determinism directly at the code-writing level. But large language models can also be used as functions for usage at runtime. For example, agentic RAG where LLMs generate search queries and evaluate results. However, there is no stack trace for a LLM response. LLMs are deterministic if we set temperature to 0 (i.e. the most likely token is always outputted). Its entropy is both at runtime, and baked in its weights and the operation of chaining autoregressive generation. (Similar to a hash function.)

Thinking beyond language tokens (and token-centric AI) is the new frontier for research and development. Programming may be more important than ever. Without people to make decisions about new systems, we sucuumb to stagnation. But LLMs are fundamentally empowering for humanity, especially as they become more decentralised and open source. We need customisable tools for context and prompt management to maximise output quality. The next wave of programmers will have more agency that ever before, if they can wield it properly. Designing and making programs. Defining inputs and outputs. Deterministic symbolic representation of logic will always be important. Deep understanding of the primitives of technology is more important than ever, lest we become cultish slaves to systems that came before us.

Large Language Models for Differential Diagnosis

First written: 15 June 2025

Last published: 14 July 2025

The heart of medicine is the differential diagnosis. Humans share an underlying biology so it is likely that whatever you are sick with has been discovered before. And if it has been discovered before, hopefully its aetiology and treatment has been researched enough for the medical system to treat you.

Why not trust a large language model to diagnose you? If you give an LLM an exhaustive and accurate list of your symptoms (the complete set of facts), signs and test results (like scenarios in a licensing exam), they'll be able to make the correct diagnosis, faster than your human doctor. With tool calls and retrieval-augmented generation, they might even be able to quote up to date clinical guidelines. There are problems with overfitting and non-representative training sets. However, it is generally true that large language models will outperform any doctor in a standardised written testing environment for broad, well-defined and long-standing medical knowledge. Especially if the model is appropriately trained and given access to functions that allow it to query up to date information (i.e. agentic). Given the cost of running these models, compared to the cost of hiring and training human doctors, this seems amazing. So when do we see large language models and vision language model robots running clinics? Or an app on your phone that acts as your primary care physician? What year will it be before the first step of a medical consultation will be an interaction with an automated system? Before we get ahead of ourselves, let's assess the actual process of medical diagnosis and why benchmark performance masks real world limitations.

Current medicine is overwhelmingly centered around the diagnosis of disease: a human defined classification for a biological state. Syndromes are also defined, and the diagnosis is often made clinically - by definition, syndromes are constellations of symptoms and signs that we've been unable to find an underlying cause for. Alternatively, diseases have defined pathophysiology and despite some diseases having idiopathic causes, pathophysiology dictates the intervention, pharmacology or protocol that we prescribe.
The invention of medical practice normally follows this pattern:

Clustering together signs and symptoms
Finding underlying pathological mechanisms - defining a disease, if we can't find a mechanism let's just call it a syndrome
Creating treatments for diseases based on our pathophysiological theory and statistically validating those treatments through studies and trials. This can be true for both curative and symptomatic treatment.

But in fullness of time, not all treatments were created nor all diseases defined in this manner. Medical tradition predates the invention of the scientific method and evidence-based medicine, with treatment safety and effectiveness known historically in the absence of fully understanding disease and treatment mechanisms. And so subsequently, evidence-based medicine rushes to re-evaluate and justify clinical practice. This is to say: current evidence-based practice is based on historical understanding, biochemistry, understanding of human physiology and the scientific method to validate claims, justify use and create clinical practice guidelines. And in addition, clinical expertise in the form of native pattern recognition and subjective judgement of patient values and preferences by human doctors.

Large language models seem proficient at pattern recognition and in most instances behave reasonably sensitively to human preferences. Not only that, but they seem exceptionally well poised to ingest large amounts of medical literature (which the large model providers have already trained on) and reinforcement learning in post-training and inference time seems like the perfect mechanism to improve performance for making successful diagnoses. What seems to be the problem?

If all medical practice was scoped to taking in clean patient information as input and outputting a diagnosis, we could train a great model that could make clinical diagnoses now (or even use contemporaneous frontier models). The issue is acquiring that information and assembling them in-context. Differential diagnosis is more akin to detective work than anything else. Your top differentials are your suspects, and you have to navigate lying witnesses, suspects masquerading as each other, limited detective agency resources and limited facilities. Having a physical body is useful, you need to interact and examine the suspects in three dimensions, lest you miss a knife in the back pocket that they forgot to tell you about. Witnesses might only tell the truth when spoken to in certain ways, and many of them forget, and need rigorous cross-examination. Another analogy is translation, What about clinical coders? People whose job it is to create discrete codes from medical documentation? Interpreters? Expert generation of documentation Dialectic Early Google encoder-decoder transformers were used for human language translation.

A large language model excels at putting together a list of diagnoses after all the facts are laid bare. But I argue that is the easiest part of the medicine: applying the definition of disease. The hard part is getting the facts straight and dealing with environments in which time and resources are constrained and with people who have varying abilities of articulation. Large language models are designed to finish text. They are good at answering questions and sustaining conversations when the user wants something from them. Differential diagnosis is a process. For large language models, it’s a journey fraught with semantic traps. People hope, that because they are incredibly I know, almost for certain, that human health management can be handled by artificial intelligence. Health is a fundamentally statistical process with a ground truth of underlying physical biology. Through systematic measurement and enquiry, we can transform objective conditions.

I don’t mention vision language models, which if factored in, introduces a plethora more biases and opportunities for failure. They confidently overfit and lack physical intuition in ways that can completely derail a diagnosis. Misinterpreting physical signs even happens with doctors operating outside their expertise: one of the reasons for the existence of the specialisation system. And we have to completely ignore the importance of physical examination by a skilled expert (self-examination is not useful for any non-trivial diagnosis).

There are certain devotees for which artificial intelligence has taken on a religious element, where blind faith in scaling laws and building systems with current architectures leads to real adaptive intelligence. If just given enough faith. And money. There is reason to believe that future systems may possess more generalisable capabilities, but there is no shot from large language models to superintelligence. The problem has never been access to information, it has always been the accessibility of information. All the information you could ever want is available behind a search engine. Treat language statistically and programatically. New wave of non-deterministic programming interfaces.

This is why skepticism of AI in general is healthcare is prudent. AI is so broadly useful and has become so much of a buzzword that it’s difficult not to believe in it. It’s overwhelmingly likely that all these companies will be steamrolled in 10 years, because they’ve been taking the wrong approach. Also, unlike other fields, selling real patient data as training data is dangerous, unprecedented and unethical. Patient data needs to have sovereignty, with cryptographic auditability of who has accessed it, when they accessed it, and why they accessed it. We need opt-out deidentified data collection for medical system training, likely implemented in national public health systems. There should be no way for human workers who work with these systems to access raw data unless it is specifically consented or with special clinical or legal overrides. Additional data santitation. Federated systems for machine learning. Zero-trust systems. The upside potential of creating machine learning systems for medicine is enormous, and large language models will remain useful, and have more importantly shown that neural networks can have abilities to reason and generalise.

There are three necessary steps to make this work:

Giving people control of their health data with full auditability
Creating systems that people use to manage their health, using primary data
Allow medical domain experts to tune these interpretable systems and manage experiments