Human in the Loop

Week 1: Can Linear Regression Predict Tomorrow’s Market?

Oyetade — Sat, 20 Jun 2026 11:27:29 GMT

If you want to understand why this belong in my “Human in the Loop” series, skip to the end of the article.

This is the first of fifty-two weekly articles. Each one takes a single machine learning technique and turns it loose on financial data that anybody can obtain without paying for it. The series will climb, over the course of the year, from the plainest tool in the box to the architectures that sit behind the present excitement about artificial intelligence. We begin, deliberately, at the bottom of that climb, with ordinary linear regression, and we point it at the most unforgiving target there is: the daily return of the stock market.

The choice is not an accident. Before any of the clever methods arrive, it is worth establishing a discipline, because in this field the discipline matters a great deal more than the algorithm. A linear regression carried out with care will teach you more about financial machine learning than a transformer thrown together without it. So I want to start with a clear and slightly provocative question. Using nothing more than the recent history of returns and trading volume, can a simple linear model tell us whether the market will rise or fall tomorrow?

Saying the result out loud before we begin

It is only fair to state, before we touch the data, what we expect to find. Daily index returns behave very nearly like a random walk. Tomorrow’s move is governed for the most part by information that has not yet arrived, and the small remainder that might in principle be foreseeable is exactly the remainder that thousands of well-resourced participants are paid to compete away. If a model a hobbyist could build from a handful of past returns were able to call tomorrow’s direction reliably, that edge would have been bid out of existence long ago.

The honest expectation, then, is an out-of-sample fit hovering around zero, very possibly the wrong side of it, and a hit rate on direction that sits within a whisker of one half. I want to be plain that this is not the experiment failing. This is the experiment succeeding at telling us the truth. The whole value of the week lies in the method by which we reach that truth without deceiving ourselves, because it is the very same method that will keep us honest later, when the models are powerful enough to manufacture a thoroughly convincing illusion of skill.

The data

We use the daily closing price and trading volume of the S&P 500, retrieved through the yfinance library. It is free, it is everywhere, and it is entirely sufficient. The script caches what it downloads to a local file, so that the analysis can be reproduced even after the data provider has quietly changed something underneath it, which it will. It also records the exact span of dates it used, because a result you cannot date is a result you cannot trust.

One practical warning, which will return again and again across this series. The single most common way to produce a spectacular and entirely false result in financial machine learning is to allow information from the future to seep into the past. We will guard against it with something close to paranoia.

Building features without cheating

Every feature must be something we could genuinely have known at the moment we would have had to act on it. We are predicting the return from today’s close to tomorrow’s close, so every input has to be available by today’s close and not one second after it.

With that rule held firmly in mind, the features are modest and intuitive. We take the five most recent daily returns, the one that has just finished and the four before it. We add a five-day average of returns and two measures of recent volatility, over ten days and twenty-one, so the model can see both the recent drift and the recent turbulence. For volume we do not hand the model the raw figure, because volume creeps upward across the decades and a model given a steadily rising number will cheerfully mistake the mere passage of time for a signal. Instead we measure log volume against its own twenty-one-day average, which removes the trend and keeps the part that actually carries information, namely whether today was unusually busy.

The target is the next day’s return. In the code this is the return series shifted back by one position, so that the row labelled today carries tomorrow’s outcome. That single line is the precise spot at which most amateur backtests quietly cheat, and it earns both a comment in the source and a moment of your attention.

The rule that governs everything: respect time

Here is the point on which the entire series will insist, and it is worth stating without hedging. You must never shuffle financial data into a random training and testing split. The ordinary cross-validation that serves you so well on a collection of unrelated photographs is actively harmful here, because it trains on the future in order to predict the past and then reports back a fantasy.

So we split by time. The model learns on the earlier stretch of history and is judged on the most recent stretch, which it has never seen. For comparison and tuning we use walk-forward validation, expanding the training window forward through time, which is what scikit-learn’s TimeSeriesSplit provides. Any rescaling of the data is fitted on the training portion alone and then applied to the test portion, never the other way around, and we enforce this by binding the scaler and the regression together into a single pipeline, so that leakage becomes structurally impossible rather than merely something we tried to remember to avoid.

Baselines come before the model

A model is only ever as impressive as the dim-witted alternative it manages to beat. So before fitting anything, we write down three baselines. Predict zero every day. Predict the historical average return every day. Predict that tomorrow will simply repeat today. Each is trivial, and our regression must be measured against them rather than against some abstract notion of accuracy. A model that cannot beat “predict the average” has learned precisely nothing, and we want to discover that at once, rather than after we have talked ourselves into a comforting story.

What actually happened

With the apparatus in place, the results arrive quickly and without ceremony. The figures below come from a run over roughly five thousand trading days, with the final fifth held back for testing.

Begin with the baselines, because they frame everything that follows. Predicting the training mean every single day produced an out-of-sample fit of essentially zero, an R-squared of around -0.0002, which is what “no information, no error beyond the irreducible” looks like. More telling is the direction. Both the flat-zero and the flat-mean rules called the market correctly on 52.7 per cent of test days. They achieved this for the dullest of reasons: the market rose on more than half the days in the sample, so the rule “always guess up” is quietly rather good. Hold on to that number, 52.7 per cent, because it is the bar the model has to clear.

The model does not clear it. The linear regression returned an out-of-sample R-squared of -0.028. That minus sign is not a rounding artefact. It means the model performed worse than simply predicting the average, which is the signature of a model that has fitted noise: the relationships it learned on the training years did not survive into the test years, so applying them did active harm. Its directional accuracy was 50.7 per cent. The model, in other words, was beaten on direction by the most trivial rule available to us, the one that ignores the data entirely and always bets on a rise.

The walk-forward folds tell the same story, and they tell it in a more revealing way. They ran from -0.043 to -0.04 p, a mean a touch below zero and a spread more than twice the size of that mean. That instability is itself the finding. A genuine and durable signal would hold its sign from one window to the next. What we have instead is sampling noise wearing a different mask in each period.

The coefficients confirm it once more. After standardisation, every one of them sits at the third decimal place or smaller, with none standing out as carrying real weight. The largest, on yesterday’s return, is mildly negative, a faint scent of next-day mean reversion, far too weak to do anything with. The two volatility terms come out roughly equal and opposite, which is the classic appearance of two collinear features dividing a single weak effect between them rather than reporting two real ones.

Then comes the test that actually settles the matter, the one with money in it. We let the model go long whenever it predicted a rise and sit in cash otherwise, and we charged a single basis point on every change of position. The strategy earned a Sharpe ratio of 0.49. Taken on its own that looks respectable, which is exactly the trap. Buying the index and holding it returned a Sharpe of 0.65 over the same window. The strategy underperformed doing nothing, and it did so after the gentlest possible allowance for costs. It spent stretches of time sitting in cash, missing the upward drift that is the market’s most dependable feature, and paid a toll for the privilege. The two equity curves below wander around one another for years before the trading rule quietly falls behind.

The one result that looks like good news, and why it is not

There is a loose thread to tie off, and it is the most instructive part of the whole exercise. When we fit the same model across the entire dataset and ask the formal statistical question, the regression comes back significant. The full-sample R-squared is 0.019 and the F-test returns a p-value of 0.009, comfortably past the one per cent threshold at which we conventionally declare an effect real.

A reader in a hurry would seize on this and announce that the model works after all. It does not, and reconciling the two findings is the lesson of the week.

The first point is the one we built the whole apparatus to respect. That significant R-squared was measured in-sample, on the same data the model was fitted to, including the very test period it then failed on. The minus zero point zero two eight was measured out-of-sample, on data the model had never met. A relationship can be statistically detectable in-sample and completely useless out-of-sample, and the gap between those two numbers is precisely the distance between a result and a delusion.

The second point is subtler and more important. Statistical significance and economic significance are not the same thing, and conflating them is one of the most expensive habits a person can carry into markets. With five thousand observations we have enough statistical power to detect an effect that accounts for under two per cent of the variation in daily returns. The p-value of 0.009 is reporting, quite truthfully, that a faint linear structure exists. The out-of-sample R-squared and the backtest are reporting, equally truthfully, that the structure is far too small to capture cleanly or to trade profitably once the world charges you to act on it. Both statements are correct at the same time. Markets are nearly efficient rather than perfectly so, and this is what “nearly” looks like when you write it down in numbers: real, detectable, and worthless.

There is also a small warning that the software will raise about the rank of the coefficient covariance. It is telling us that two of our features are very nearly the same thing, almost certainly the five-day mean of returns standing in for the lagged returns it is built from. It does not disturb the headline R-squared or the F-test, but it does mean the individual coefficient standard errors should not be read too closely. If clean per-coefficient inference were the goal, the redundant feature would come out. For our purpose, which is the overall verdict, the headline figures stand.

Why this is the right answer

It would have been the easiest thing in the world to present a tortured version of this analysis that produced a winner. Shuffle the data so the future leaks backward, scale before splitting rather than after, quietly try a few dozen combinations of features and report only the flattering one, leave out transaction costs altogether, and you can conjure a Sharpe ratio that would make a hedge fund blush. Every one of those moves is a real mistake that real people make with real money, and every one of them is a way of asking the data to lie to you. The discipline of this week exists to make each of them either impossible or visible.

The efficient-markets view predicts very nearly what we found. The readily foreseeable part of daily index returns, the part you could hope to catch with five lagged returns and a volume ratio, is close to nothing, because it is the cheapest edge imaginable and therefore the first to be competed away. The opportunities that do exist, to whatever extent they exist at all, live in richer data, in faster reactions, or in genuinely harder modelling, and we will meet a few of them as the year goes on.

What carries forward

The technique this week was disposable. The habits are permanent. A target built so the future cannot leak backward into it. Features that respect the moment of decision. Splitting by time, and validating by walking forward through it. Baselines written down before the model is fitted. An evaluation that ends with costs and a Sharpe ratio rather than beginning and ending with R-squared. Coefficients examined for stability across periods rather than admired in a single comfortable fit. And, underpinning all of it, a refusal to mistake a significant p-value for a useful one.

Carry these forward and the rest of the series rests on solid ground. Abandon them and no amount of architecture will rescue you, because a powerful model trained on a leaky setup does not fail loudly and helpfully. It succeeds, gloriously and falsely, which is by far the more dangerous outcome.

Next week we stay with regression but give it something with a fighting chance, explaining a single stock’s returns with the Fama and French factors. We shall see what changes when the right-hand side of the equation finally holds variables with a real economic claim on the left.

The complete, runnable script accompanies this article. It tries live data first, caches it locally, and falls back to a freely mirrored daily series, and then to a simulated one with realistic statistical properties, so that it runs anywhere. The figures quoted above come from a run over roughly five thousand trading days; run it on your own machine and the exact decimals will shift, but the shape of the result will not.

The full repository, including the mlfin core package, the test suite, and this article, lives at github.com/Oyetade/ml-finance-52. Clone it, run pip install -e ., and python run.py inside the Week 1 folder to reproduce every number above.

A note on how this was made, since this newsletter is partly about exactly that.

This project, and the code beneath it, came out of a working session with Claude, an AI assistant, and the collaboration is worth describing honestly because it was neither of the two caricatures people tend to reach for. It was not the AI handing me a finished codebase to ship under my name, and it was not me using the AI as a glorified autocomplete. It was something more like working with a capable, tireless engineering colleague who needed supervising.

The division of labour fell out naturally. The model was genuinely useful at the mechanical and the structural: scaffolding the repository, refactoring the shared logic into a clean installable package with its own tests, writing the leakage-proof split and the costed backtest once and properly so they could be reused across all fifty-two weeks, and catching that a careless GARCH parameterisation would make my synthetic fallback data explode before it ever reached a chart. It worked through the unglamorous discipline that good engineering actually consists of, the validation guards, the offline fallbacks, the reproducibility, without tiring of it, which is precisely the part a human is tempted to skimp on. But every judgement that mattered stayed with me. Which technique to apply, how to structure the series so it would still cohere at week fifty, whether a result was being framed honestly, and above all whether the thing was correct: those were mine, and the model was at its most useful when I treated its output as a draft to interrogate and test rather than an answer to accept.

The most instructive moments were the corrections in both directions. When the live data source was blocked, the AI quietly substituted a synthetic series and reported results from it, and it would have been easy to let those numbers stand; catching that, and insisting on real market data even when it was inconvenient to obtain, was the human keeping the work honest. Equally, the model pushed back usefully on my own loose thinking more than once, and querying its explanations line by line is how I made sure I understood the code I was about to put my name to rather than merely trusting it. The lesson I take from it, and the reason this newsletter exists, is that the value was not in the AI replacing the engineering but in compressing the distance between an idea and a working, tested implementation, on the firm condition that a human stayed in the loop to own the judgement and carry responsibility for what was shipped. The repository carries my name because the accountability is mine. The assistance simply let me do more of the work I actually wanted to do, and less of the work I didn’t.

Leveraging Machines

Oyetade — Fri, 05 Jun 2026 16:30:34 GMT

Act I: The First Attempt

Two months ago, I decided to build a photo culling system, though “build” may be generous; I mostly vibe-coded my way through it. But I didn’t want to just delete bad photos. I wanted to understand why they were bad.

So I built a 3-stage local ML pipeline. It was ambitious. It was elegant. It was also my biggest mistake.

The Original Architecture

The pipeline ran locally with no API dependencies:

Stage 1: Technical Gatekeeping
↓ (Blur detection: Laplacian variance)
↓ (Exposure check: mean image intensity)
↓ (Duplicate detection: perceptual hashing)

Stage 2: Aesthetic Scoring
↓ (CLIP embeddings + scoring)
↓ (K-means clustering for scene diversity)

Stage 3: AI Semantic Review (Optional)
↓ (Ollama/LLaVA running locally)
↓ (Check: eyes closed? distracting elements? composition?)

The appeal was obvious: no API costs, no latency waiting for servers, everything runs on your machine.

On paper, it was beautiful. In practice, it was a nightmare.

The Dependency Hell

The pipeline required:

PyTorch + CUDA 11.8+ (or CPU mode, but slow)
transformers library (for CLIP)
CLIP model checkpoint (~1 GB download)
Ollama local LLM runtime (~5 GB for the LLaVA model)
OpenCV for Laplacian blur detection
scikit-learn for K-means clustering
PIL, pandas, numpy — and all their subversions

On a fresh machine, this could take 45 minutes to set up correctly. On an existing machine, it was chaos.

Here’s what happened when I tried to run it after a week away:

ModuleNotFoundError: No module named ‘transformers.utils.quantization_config’
ImportError: cannot import name ‘version’ from ‘importlib.metadata’
RuntimeError: CUDA out of memory (even on CPU mode)
FileNotFoundError: ~/.cache/huggingface/CLIP model not found

It took some debugging to fix. The culprit? An orphaned numpy installation from a previous project was shadowing the current one. Python was loading the wrong module, which broke transformers, which broke CLIP, which broke the entire pipeline.

I had built a system so brittle that it broke if you looked at it sideways.

But It Worked (Eventually)

Once I fixed the environment, the results were actually good. For a 7,000+ image shoot:

· Stage 1 filtered out more 3,520 blurry/duplicated images

· Stage 2 scored gave thumbs up to 88 of the remaining images with CLIP embeddings minutes

· Stage 3 reviewed the remain 88 images with Ollama and gave thumbs up to 81.

· Total time: several hours.

The output was a CSV with: - Blur metrics (Laplacian variance, center-crop sharpness, patch-wise max) - Exposure data (mean intensity, highlights, shadows) - Duplicate hashes - CLIP aesthetic scores - Scene clusters - Ollama VLM review (eyes closed, distracting elements, composition notes)

It was comprehensive. It was useful. And unleashing on thousands of images took too long.

The Moment of Clarity

Fast forward to last month. I came back from the Norwegian Fjords with over 7,000 images. I pulled up the photo_ai_workflow folder to run it. I had to leave it running overnight.

And in that moment of frustration, I realized something: I was optimizing for the wrong thing.

I had optimized for “no API costs” and “runs locally”. What I actually needed was “works reliably” and “gives me useful insights.”

I wondered: What if I just… paid Claude a few dollars instead?

Act II: The Pivot (A Complete Rewrite in 4 Hours)

What if the entire pipeline could be replaced with a single Claude call?

Not three stages. Not K-means clustering. Not LLM review of specific artifacts. Just: “Tell me if this image is stock-worthy and portfolio-worthy.”

I wrote assess_images_claude.py in an afternoon, or rather, I vibed it.

The New Architecture

send image + prompt → Claude → structured JSON response → CSV

That’s it. One API call per image. Done.

But the prompt was sophisticated:

SYSTEM_PROMPT = “”“You are two expert evaluators...

EVALUATOR 1 — STOCK REVIEWER:
Reject soft focus, motion blur, noise, poor exposure, chromatic aberration...
Approve technically clean, well-exposed, commercially useful images.

EVALUATOR 2 — PORTFOLIO JUDGE:
Assess compelling composition, distinctive light, emotional impact...
Technical perfection alone is not enough.
“”“

The response structure was JSON:

{
“technical”: {
“blur”: {”present”: bool, “severity”: “none|minor|moderate|severe”},
“exposure”: {”overall”: “correct|under|over”, “severity”: “...”},
“clipping”: {”highlights”: bool, “shadows”: bool, “severity”: “...”}
},
“stock”: {”verdict”: “APPROVE|REJECT|BORDERLINE”, “score”: 1-10},
“portfolio”: {”verdict”: “STRONG|CONSIDER|REJECT”, “score”: 1-10}
}

The Cost Problem

Submitting more than 7,000 images individually would cost ~$20. (Claude charges ~$0.003 per image at Haiku pricing.)

But Claude has a Batch API that costs half as much. Catch: each batch needs to be under ~25 MB.

So I implemented chunking:

for chunk of 50 images:
1. Resize to 768px (saves 80% of tokens)
2. Encode to base64
3. Submit batch immediately
4. Move to next 50 images

If one batch failed, I only lost ~50 images. And I could resume from saved batch IDs.

What I Gained

Local Pipeline → API Pipeline
─────────────────────────────────────────────────
45 min setup (+ env fixes) → 5 min to write
3 hours debug time → 0 debug time
Dependency management → Just `pip install anthropic pillow`
Laplacian blur variance → Claude’s reasoning
CLIP aesthetic score → Two explicit verdicts
Ollama VLM review → Structured reasoning
Fragile environment → No environment
$0 in API costs → $3 in API costs

The API pipeline was simpler, faster to write, more reliable, and only $3 more expensive because it was so much faster.

The Data: Local vs API

Here’s what surprised me:

The local pipeline gave me 11 metrics per image. The API pipeline gives me 1 verdict and reasoning.

Which is more useful?

The local pipeline told me: - Blur variance: 87.3 - Laplacian subject variance: 102.1 - Max patch variance: 156.2 - Perceptual hash: e7c3a9f

So… which images should I keep?

The API pipeline told me: - Stock: APPROVE — Sharp focus, well-balanced exposure, travel utility - Portfolio: CONSIDER — Solid composition but conventional framing

Which one actually tells me what to do with the image?

The API approach replaced low-level metrics with high-level reasoning. It wasn’t trying to measure blur; it was trying to understand the image.

The Real Difference

Here’s the thing: a Laplacian variance of 87 is meaningless. You know what’s meaningful? Claude saying: “Motion blur on the subject, hand-holding at 1/30th. Can’t use this for stock. But the composition was interesting — might fix in post for portfolio.”

That’s not a score. That’s a conversation.

The local pipeline was trying to be objective with metrics. The API pipeline is trying to be useful with explanation.

By The Numbers

On my Norwegian Fjords images:

What the local pipeline would have said:

3520 images failed Stage 1 (blur+dedup+exposure)
88 images scored > 0.70 on CLIP aesthetic
81 images passed Ollama evaluation

What the API pipeline actually said:

6172 images APPROVE for stock (26%)
306 images STRONG for portfolio (8%)
290 images are “gems” (both stock-approvable AND portfolio-strong)
203 images are “technically perfect but boring”
76 images are “risky but visually interesting”

I could act on the API results. The local pipeline metrics? I’d still be staring at them, wondering what to do next.

Why This Matters

This wasn’t just about building a better tool. It was about understanding when to use local intelligence vs remote intelligence.

Local ML is great when:

You have a clear, measurable objective (blur = bad, sharpness = good) -
You need to process terabytes without API overhead
You want to understand the mechanism (why is it blurry?)
Cost is critical and API fees would be prohibitive

API-based AI is great when:

The problem is subjective (is this portfolio-worthy?)
You want reasoning, not metrics
The problem is complex (artistic judgment vs image statistics)
You can afford to trade dollars for reliability and simplicity - You want to reason about multiple axes at once (stock vs portfolio simultaneously)

Photo culling is subjective. It requires judgment. It benefits from reasoning. The local pipeline was optimized for the wrong problem.

The Subtext

Here’s what I actually learned:

The API pipeline is:

✓ More reliable (no environment issues)

✓ Faster to develop (API > ML infrastructure)

✓ Better results (reasoning > metrics)

✓ Cheaper at scale (Batch API < GPU hardware)

✓ Easier to explain (verdicts > variance numbers)

The only advantage of the local pipeline was “$0 in API costs.” But I was spending that cost in time, complexity, and pain.

What I Did With 7,000+ Photos

1. Ran assess_images_claude.py on the full folder

2. Got a CSV with stock/portfolio verdicts for 7000+ unique images (first batch, API throttled)

3. Identified 290 “gems” (stock + portfolio strong) — straight to Lightroom

4. Postpone review of 608 “BORDERLINE” images.

5. Deleted 566 “stock rejects” — no point editing technically broken images

From 7,000+ images to 290 keepers in ~3 hours. And I understood why each image landed where it did.

The Code

If you want to steal this approach:

# Install dependencies
pip install anthropic pillow

# Run assessment
python assess_images_claude.py --input ./photos --output ./results

# Get CSV with verdicts
# → results/assessment_results.csv

The script handles everything:

Resizes images to 768px (80% token savings)
Chunks them in groups of 50 (~10 MB each)
Submits via Batch API (50% cheaper)
Polls until complete
Saves batch IDs for resume capability
Parses JSON responses
Outputs flat CSV for Excel

The Takeaway

Zero API cost not always the best approach.

The cheapest tool is the one you actually use.

A complex local pipeline that breaks every week and costs 3 hours to debug isn’t “$0 in API costs.” It extracts mental cost.”

A simple API call that costs $3 and works reliably is actually the cheaper option.

P.S.

The photo_ai_workflow project still lives on GitHub. Get it here. It’s open source. It’s a good reference for local ML pipelines, CLIP scoring, Ollama integration.

But my actual photo workflow? That’s now 500 lines of Python and a $3 API bill.

Sometimes the sophisticated solution is simpler than you think.

Want to try it? The full script is ready to go. Get it here. Takes 5 minutes to set up.

Got 5,000+ photos from a trip? Run it tonight and see what your real gems look like by morning.

What We Built Without Knowing We Were Building It

Oyetade — Sat, 02 May 2026 18:46:18 GMT

For readers curious about working with AI — no technical background required.

How It Started

A colleague asked if I could put together a practical illustration for a machine learning workshop he was running. I spent three Saturdays preparing materials, then couldn’t present due to unforeseen events. Sitting with the unused work, I wondered: could I start entirely from scratch with an AI assistant and reach the same destination?

The answer turned out to be yes, and faster. But more interestingly, the destination I reached wasn’t quite the one I had originally prepared. It was better.

Before reading further, please note this is experimental and should not be basis for making investment or trading decision.

The first request to the AI was modest: take a folder of stock price data and help me identify fifty companies that broadly represent the five hundred largest US publicly listed companies. No plan. No clear picture of where this was going.

What followed illustrates something I’ve come to think of as the defining characteristic of working with an AI collaborator: the destination becomes visible only as you walk toward it. That first request about fifty stocks was, without either of us knowing it at the time, the opening move of a project that would eventually produce a system trained on sixty years of financial history and capable of outputting a daily reading about the state of the market.

This is an account of how that happened, and what it felt like from the practitioner’s side.

Teaching a System What Normal Looks Like

Before anything could be built, there was a more fundamental question: what should the system pay attention to?

A stock price on its own tells you almost nothing useful about the broader character of the market, what practitioners call the market regime. A 2% move in a single day means something very different in 2013, a calm, rising market, than it does in March 2020, when the same move might happen before lunch and reverse by close. The raw number is meaningless without context.

What we needed was a way to describe the texture of the market on any given day, not just price levels, but the quality of movement. How volatile has this stock been lately? Is the whole market moving in lockstep (a sign of fear-driven trading) or are individual stocks behaving independently (a sign of normal conditions)? Is this stock acting like itself today, or has something changed?

We assembled thirty-six such measurements for each of the fifty stocks, updated daily. The result was something like a daily health report for the market, not a single number, but thirty-six dimensions of information that together describe whether conditions are normal or unusual.

One of those measurements deserves a brief mention because it captures the idea well. We calculated, for each stock, how unusual its behaviour was relative to its own recent history. A stock that normally moves 0.5% per day but is suddenly moving 3% is drawing attention to itself. This self-referential check, is this stock acting like itself? — turned out to be one of the more sensitive early-warning signals.

The Central Idea: Learn Normal, Then Flag Deviations

The core design choice was this: don’t try to predict what the market will do. Instead, learn what normal looks like, and measure how far today departs from it.

This approach has a compelling property for financial markets: you don’t need to define in advance what a crisis looks like. The system is never told “this was the 2008 financial crisis” or “this was a normal trading week.” It learns purely from the structure of the data itself, what patterns appear regularly, what combinations of measurements characterise ordinary market conditions.

The analogy I find useful is a doctor who has seen thousands of healthy patients. They develop an intuition for what normal looks like, normal blood pressure, normal gait, normal reflexes, that is difficult to articulate as a set of rules. When something is wrong, they often sense it before they can name it. The system works similarly: exposed to decades of market data, it builds an internal sense of what normal market conditions look like. When today’s conditions diverge significantly from that, it struggles — and that struggle is the signal.

Concretely: the system compresses each day’s thirty-six measurements down to a thirty-two-number summary, a kind of fingerprint of current market conditions. It then tries to reconstruct the original thirty-six numbers from that compressed fingerprint. Under normal conditions, it does this well. Under unusual conditions, the reconstruction is poor, and the error is large. A large reconstruction error means the system has encountered something it cannot explain using what it learned from normal markets.

The system was taught using market data from 1962 to 2017, a period spanning the dot-com bubble, Black Monday, the 2008 Global Financial Crisis, and the quiet bull market of 2013-2017. It had to learn through all of it to develop a genuine sense of what normal means across a wide range of conditions.

The Bug That Announced Itself Quietly

The teaching process appeared to go well. The charts showing the system’s progress looked right. Everything seemed fine.

It wasn’t.

Part of the teaching process involved a mechanism that was supposed to gradually slow the system’s adjustment speed as it got closer to a good solution. Think of it like a car decelerating as it approaches a destination: making large adjustments early, smaller and smaller ones as you get closer, to avoid overshooting. The mechanism was supposed to watch for progress to plateau, then reduce the adjustment speed.

The problem was that progress never quite plateaued. The system was finding tiny, fractional improvements on every single cycle, not because it was genuinely learning something new each time, but because the adjustment mechanism was always finding some marginal gain. The deceleration trigger never fired. The adjustment speed stayed at its initial fast rate for all 150 cycles.

The fix was to replace the conditional mechanism with an unconditional one: instead of waiting to detect a plateau, simply reduce the adjustment speed on a smooth, predetermined curve throughout the entire process. No conditions. Guaranteed deceleration from cycle one.

This episode matters because it illustrates something important about building with AI: problems in complex learning systems often don’t announce themselves as errors. Everything runs. Progress happens. The charts look plausible. The only sign that something is wrong is that one carefully monitored value, in this case, the adjustment speed, printed at each cycle, never changes. You have to know what to look for.

The AI could implement the fix quickly once the problem was identified. Finding the problem required a practitioner paying close attention to things that weren’t obviously broken.

Getting the Alarm Right: Three Attempts

The system produces a number every day: how hard did it find today’s market to reconstruct? The practical question is: at what level does that number constitute an alarm?

Getting this calibration right took three attempts. Each failure was instructive.

First attempt: the wrong type of number. The initial alarm threshold was set by looking at the raw daily error values from a recent checking period (a separate block of market history kept aside to test whether the system was learning real patterns, not just memorising the data it was taught on). Applied to more recent data, the alarm never fired. Not once.

The problem was a subtle category mismatch. The threshold had been set against the raw, day-to-day error values, which are noisy and volatile. But the signal being tested was a three-week rolling average of those errors , much smoother and calmer. Comparing a threshold built for noisy raw values against a smoothed average is a bit like setting a speed camera for 100 mph and pointing it at the average speed of cars over an hour-long journey. The average will never trigger it.

Second attempt: the wrong period. The fix was to match the threshold to the same smoothed signal, but the checking period used as the reference included the COVID crash of 2020, which turned out to be the most extreme event in the entire dataset. Setting the threshold against COVID implicitly says: nothing counts as a crisis unless it is at least as severe as a global pandemic. The prolonged market decline of 2022, the bank collapse of March 2023, the tariff shock of April 2025, all genuine market disruptions — were not extreme enough to clear that bar.

Third attempt: the right reference. The correct approach was to use the full fifty-five-year teaching period as the reference, which includes Black Monday (1987), the dot-com collapse (2000-2002), and the Global Financial Crisis (2008). These define what “historically unusual” actually means. Against that backdrop, two alarm levels emerged: an amber warning zone and a red crisis zone.

With this calibration, the results became economically sensible. The 2022 market decline registered as sustained amber, which is exactly what it was, a gradual, anticipated decline driven by rising interest rates rather than a sudden shock. The April 2025 tariff disruption registered as red, a sharp, broad, and sustained disruption that the system had never seen at that combination of scale and breadth.

The lesson: the choice of reference period is a substantive decision. It encodes a view about what counts as normal.

Sixty Years of Memory

One of the more important presentation decisions was to chart the system’s daily reading all the way back to 1970, he full period it was taught on, rather than only the recent years.

When you see the signal only for 2018-2025, it is hard to interpret. Is a particular spike large or small relative to history? When you see it against sixty years, each event finds its place. The Oil Embargo of 1973. The Federal Reserve raising interest rates to 20% in 1979 to break runaway inflation. Black Monday in October 1987. The Russian government debt default in 1998.

The system had to develop its sense of normal across all of these. Seeing them annotated on the chart, crash dates and recovery dates marked with vertical lines, makes the daily reading interpretable in a way no shorter window can.

One annotation was added in real time: the Liberation Day Tariff announcement of April 2025. By the time recent market data reached that point in the analysis, the system was flagging it as one of the most anomalous episodes in the entire sixty-year record.

What the System Found About Crashes

One of the most striking findings was unplanned. It emerged from asking a simple question: what were the ten worst days in each period of the dataset?

The answer was consistent across all three periods, and it contradicted the intuitive expectation.

The system’s worst days are not the crash days. They are the days that come after.

The worst days in the historical teaching period were not October 19, 1987 , the day of Black Monday, but the three weeks following it. Not September 15, 2008, the day Lehman Brothers collapsed, but the peak of the chaos that followed. In the checking period, the worst days were not February 19, 2020, when markets first began to fall on COVID news, but a thirteen-day stretch in March and April: record unemployment claims, emergency central bank interventions, violent daily reversals, and maximum uncertainty. In the most recent period, all ten worst days fell in April and May 2025, not on the day of the tariff announcement itself, but in the weeks after.

The pattern makes sense once you see it. A single-day crash produces a large reading on one or two of the system’s thirty-six measurements: how much the market moved, and how volatile it has been. The aftermath is harder to handle: it produces simultaneous extremes across many measurements at once, volatility, cross-market correlation, trading volumes, momentum signals, and more, all held at unusual levels for days or weeks together. That combination is genuinely outside the range of experience in a way a single crash day is not.

The crash is the event. The aftermath is the regime change. The system finds the weeks after a crisis harder to explain than the crisis itself, because the aftermath looks like nothing it was taught was normal.

From Experiment to Working System

At some point in the project, the question shifted from “does this work?” to “how would someone actually use it?”

That question required restructuring. Exploratory work happens in interactive workspaces where you run one piece at a time, inspect charts, and adjust on the fly. A working system needs something more disciplined: a fixed sequence of automated steps that runs the same way every time, produces the same outputs, and can be re-run six months later without needing to remember what you did.

The restructuring separated the work into clear layers: a data-processing stage that takes raw stock prices and converts them into the thirty-six daily measurements; the system itself; and a daily scoring step that runs every evening and produces a fresh reading.

One design choice worth naming: all the settings that control the system, how far back to look, where the alarm thresholds sit, live in a single shared settings document. Change one number there, re-run everything, and every part of the system inherits the updated value automatically. This matters in practice: adjusting the system for different conditions, or extending the historical data forward by a year, is a matter of editing one file.

The daily scoring step produces a colour-coded output: normal, amber, or red, and writes a small output file that other tools (a risk dashboard, an automated alert, a position-sizing system) can read directly.

Making Sure It Still Works

Once the system was restructured into separate components, a new problem appeared: how do you know it still works after you change something?

The answer was a set of 208 automated checks that run in seconds and report immediately if anything has broken. Writing them turned up two problems worth describing, because both illustrate something general about complex systems.

The first was a check that appeared to pass while doing nothing useful. One part of the system acts as a switchboard; it receives a command and routes it to the appropriate process. To test the switchboard, we replaced the real process with a harmless stand-in that simply records whether it was called. The problem was that by the time the substitution was made, the switchboard had already been wired up to the original process at startup. The stand-in was in place in name, but the switchboard still pointed to the real thing. The check appeared to pass, but it was actually running the full real system, processing all fifty stocks, taking five minutes. Everything looked fine. Nothing was fine.

The fix required understanding exactly when the wiring happens internally, and replacing the connection at the right moment rather than just the name. This is a narrow trap, but once you have seen it, it is easy to spot and easy to fix.

The second required constructing unusual situations deliberately. One component has two safety paths: what to do when there is not enough historical data yet, and what to do when today’s data has a gap. Neither situation arises naturally with clean, complete data, so to test them, we had to build artificial datasets that forced each condition. This required precise reasoning about what internal conditions trigger each path. The reward was confirmation that the system handles both gracefully, which matters, because both occur regularly with real market data.

Reading the Map

With a trained system in hand, those thirty-two-number daily fingerprints contained a compressed record of sixty years of market behaviour. The question was how to read it.

Three different approaches were applied in sequence, each asking a different question.

The first sorted all sixty years of fingerprints into groups, then asked whether the groups corresponded to different market conditions. The answer was clear: two or three groups, where one group consistently showed higher stress indicators — more volatility, sharper price declines, higher system error — than the others. Three groups proved most useful economically, separating a normal state, an elevated state, and a stress-like state.

The second used a more flexible approach that allowed overlapping groups with partial membership — a fingerprint could belong partly to one group and partly to another. The result was richer: six overlapping patterns rather than three clean groups. This is not a contradiction. When you allow the method to look for finer detail, it finds it. The correct reading is not “there are six market regimes” but “the fingerprint space has at least six distinguishable patterns, and market conditions often sit between them.”

The third incorporated time directly. Rather than asking “which group does today’s fingerprint belong to?”, it asked “given everything seen up to today, what sequence of hidden states best explains the full history?” This approach selected three states, validated by testing how well it predicted a period it had never seen during the learning phase. The three states had durations that matched economic intuition: roughly nine months in the normal state, three months in the elevated state, and two months in the stress state.

The synthesis that emerged was not three competing answers but three complementary perspectives:

What the AI Cannot Do

This is the part of the story that felt most revealing.

The AI built the analysis machinery competently. Given a specification, “evaluate these three approaches against known crisis periods and stress indicators”, it produced working code that ran the analysis and generated the outputs.

What it could not do was interpret those outputs.

One method for choosing the number of groups returned the answer: six. The code could not say whether that means “the market has six regimes” or “the underlying pattern of this data is best captured by six overlapping probability clouds.” Those are different claims. The first is a statement about economics. The second is a statement about statistical structure. Knowing which interpretation is appropriate required understanding what the method actually measures: knowledge that lived with the practitioner, not the system.

More broadly, the three-approach synthesis, use simple grouping as a benchmark, overlapping patterns for substructure, and the time-sequence approach as the main regime model, was not produced by the analysis. It was imposed on the analysis by the practitioner. The analysis produced numbers. The practitioner decided what the numbers meant and how they fitted together.

This felt like a clarifying moment. Throughout the project, there was a rough division of labour: the AI managed breadth, holding the full structure of the system in working memory, implementing across many files, catching inconsistencies, while the practitioner managed depth, deciding what mattered and what the results meant.

In the final analysis phase this became explicit. The practitioner ran the analysis independently, formed their own interpretation, and returned with conclusions already in hand. The AI’s role in that session was largely to receive the synthesis and incorporate it. The economic content, which approach was most useful, why the differing results were not contradictory, what the typical state durations implied about how markets actually behave, came entirely from domain knowledge the AI did not have.

This is probably the honest description of what the collaboration actually is: the AI handles the surface area of implementation; the practitioner handles the depth of interpretation. Both are necessary. Neither substitutes for the other.

On Working This Way

Reading this account in sequence might suggest a clean, logical progression. It was not. The adjustment-speed bug appeared after the teaching process was already complete. The alarm threshold was recalibrated three times. The sixty-year historical view was added only after a shorter chart made the signal uninterpretable. The worst-days finding came from asking a simple question: what were the actual dates?

None of this is unusual in complex knowledge work. What was unusual was the texture of iteration with an AI collaborator.

In ordinary work, changing direction carries a cost: re-explaining context, re-establishing shared understanding. With the AI, changing direction was cheap. “Actually, let’s try it this way instead” produced a working result quickly, with no loss of context about everything that had come before. The full history of the project , every measurement, every threshold, every decision , remained available at all times.

What this enabled was a faster, tighter feedback loop between observation and adjustment. “The 2022 market decline doesn’t cross the threshold” is an observation that takes seconds to make. Acting on it, recalibrating the threshold, re-running the analysis, checking whether the result is now sensible, took minutes rather than hours. The cost of being wrong and needing to adjust was low enough that trying things, seeing what happened, and adjusting became the natural mode of working.

The project began with a list of stock tickers in a spreadsheet and ended with a trained system, an automated processing sequence, a sixty-year historical analysis, calibrated alarm thresholds, 208 automated checks, and a three-approach analysis of the system’s internal regime map. None of that was in the original brief.

The brief was fifty stocks and three columns. The rest emerged from the combination of a practitioner who knew what questions to ask and a collaborator that could hold the context, implement the answers, and iterate without friction. Neither party knew where it was going until they got there together.

What It Is Not Yet

The system as described produces a real daily reading. It has been tested against sixty years of market history. It correctly identifies the COVID crash as a crisis, the 2022 market decline as sustained stress, and the April 2025 tariff aftermath as the most extreme episode in the recent record.

What it is not yet is connected to live data. It currently runs on a fixed historical archive. Connecting it to a daily data source, so that it updates automatically each evening and produces a current reading, is the remaining gap between “working on historical data” and “running in practice.”

There is also an unrealised capability worth naming: given today’s market fingerprint, the system could search the sixty-year archive for the closest historical matches and report which periods today most resembles. “Current conditions most resemble mid-2011 and late 2002” is more actionable than any single number. The underlying infrastructure for this is already in place. The capability is not yet built.

Closing

There is something worth noting about the sessions themselves.

We began with a spreadsheet. We ended with a working system and a documented analysis of how markets move through regimes, including a reading of conditions through the April 2025 tariff shock.

But the thing that stays with me is not the output. It is how the collaboration changed over time. In the early sessions, the AI and I were building together in the straightforward sense: I described what I wanted, it implemented, I reviewed, we adjusted. In the later sessions, something shifted: I ran the analysis independently, formed my own interpretation, and came back with conclusions. The AI’s role became to incorporate what I had found, not to help me find it.

That shift from co-construction to expert-with-instrument, may be the general pattern. Early on, the AI is necessary to build what the practitioner imagines. Later, the practitioner is necessary to interpret what the AI has built. The most interesting part of the collaboration may be the boundary between those two phases, where the instrument becomes capable enough that the practitioner can work with it independently, and the practitioner becomes fluent enough that they know what to look for.

That is probably not unique to this project. But this project made it visible.

Notebook and code can be downloaded from here https://github.com/Oyetade/regime_detector

Written as a reflection on an extended series of working sessions over a period of two days