How we benchmark against the OpenAI labor market study

Today someone sent us a paper with a note: "use their method." The paper was GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models, by Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock at OpenAI and the University of Pennsylvania. Published in 2023, peer-reviewed in Science in 2024, and arguably the most-cited study on AI and the labour market to date. You can read it at arxiv.org/abs/2303.10130.

We read it carefully. We did not simply adopt their method — but we did add it to the site as a benchmark. Here is what they did, how it differs from our approach, and what we think the two together are worth.

What the paper does

Eloundou et al. took the O*NET database — roughly 19,000 tasks across 923 US occupations — and asked a simple question for every task: could access to an LLM reduce the time required to complete this by at least 50%? They called this "exposure." A task was rated E0 (no exposure), E1 (direct exposure — GPT alone does it), or E2 (LLM+ exposure — needs software built on top). Human annotators labelled a sample. GPT-4 labelled the rest. They combined the two into an occupation-level score, with E1 + 0.5×E2 as their primary estimate (they call it β).

The headline finding: around 80% of US workers could have at least 10% of their tasks affected. About 19% could see half or more of their tasks exposed. Higher-income, more educated roles showed more exposure, not less — a reversal of how earlier automation waves played out.

How automatable.me works differently

Our method is not a single binary question. We use three layers.

First, the I/O layer: what goes into the task, and what comes out? A task that starts and ends entirely in digital form (DD) sits in the high-automation zone by default. A task that requires physical presence or real-world output (AA) is the furthest from automation. This gives us a naive score — a ceiling based on the nature of the work alone.

Second, substrate thickness: how much informal knowledge does this task require? A task anyone could do on their first day scores differently from one that takes years of accumulated context to do well. Thick substrate compresses the score significantly.

Third, social function: is human presence part of the value itself? A negotiation, a therapy session, a performance review — these are not just tasks that happen to involve people. The human relationship is the product. Removing the person does not automate the task; it destroys it.

The result is an effective score that is almost always lower than the naive ceiling. That gap — between what AI could theoretically handle and what it realistically can — is the core insight we try to show every person who uses the site.

A comparison

The Eloundou method is simpler, standardised, peer-reviewed, and covers all US occupations from a consistent base. It answers one question cleanly: can an LLM cut this task's time in half? That is exactly the right question for a labour economist building a macro picture of AI exposure across the economy.

Our method is personalised. The task list is generated for the specific person doing the assessment, not pulled from a database of averages. The person themselves allocates their time across those tasks. The substrate and social layers capture dimensions that a time-reduction threshold cannot: whether a task requires trust, relationship, or years of informal knowledge that no LLM can replicate. That is exactly the right framing for an individual trying to understand their own position.

Neither method is simply better. They are answering related but different questions. The Eloundou score tells you where the typical person in your occupation sits in a research dataset. Our score tells you where you sit, given how you actually spend your time.

What we built

Every results page and profession page on automatable.me now shows both scores side by side. Your personalised effective score appears alongside the research benchmark for your occupation — drawn directly from the human-annotated β estimates in the Eloundou et al. dataset.

If your score is above the benchmark, your specific task mix or context may carry more exposure than the average for your role. If it is below, the way you have described your work suggests more protection than the typical person in your field. If it is close, you are roughly where the outside research would expect.

We think showing both is more honest than showing either alone. One number, however carefully derived, is always a reduction. Two numbers from different methods, pointing in a similar direction, start to feel like evidence.