The Memory Illusion- AI Doesn’t Really Remember

LLMs are rapidly reshaping modern knowledge work. What began as a tool for AI researchers and software engineers is now spreading into science, finance, consulting, engineering, law, and countless other domains.

As adoption accelerates, professionals across domains are beginning to encounter the fundamental limitations of these systems. These limitations might be very well understood by the researchers and engineers who design them, but they are far less obvious to the people integrating AI into their daily workflows. The reason is simple: the real limitations of LLMs often do not appear during short demonstrations or simple tasks. They emerge only when the models are pushed into complex, domain-specific workflows that require long-term reasoning, evolving context, and deep subject-matter expertise. A scientist designing experiments, a biotech investor valuing a pipeline, or a consultant building a strategic recommendation will inevitably stress-test the system in ways that generic benchmarks never do.

The pace of progress is astonishing, but so is the pace at which new problems emerge. Anthropic, for instance is partnering with Goldman Sachs to co-develop AI agents that automate significant portions of knowledge work- trade accounting, client onboarding, compliance, research (ref).

Nearly all modern Large Language Models (LLMs) and Vision Language Models (VLMs) are built on the Transformer architecture — a neural network design that revolutionized AI through mechanisms like self-attention. The math is complex, but the central idea is simple: these models learn statistical relationships between pieces of data (tokens) and use those relationships to predict what comes next.

Transformers are trained on enormous datasets, often terabytes of text and code, that capture statical patterns of that data. During training, they learn how one token relates to another, which lets them produce fluent language and by extension makes them a great fit for problems as complex as human communication. Development of transformers is one of the most impressive engineering achievements of the last decade.

Despite this apparent intelligence, these models are fundamentally stateless.

What does that mean?

Imagine a brilliant co-worker or assistant, who has severe short-term amnesia. Every time you open their office door, their memory is completely wiped clean. They forget your name, your company, and what you talked about five minutes ago.

To get them to help you, you must hand them a giant binder of your entire history every time you walk into the room. They read it instantly, give you a genius answer, and then immediately forget everything again the moment you walk out.

That is an LLM- the underlying model itself has no intrinsic permanent memory of your specific session.

The model does not truly know you, your projects, your long-term goals, or the reasoning that led to earlier decisions. Every response is generated from whatever information is available in the current context window and any external memory systems attached to it.

Modern systems try to overcome this in two ways.

Some techniques modify the model’s weights directly: LoRA, QLoRA, full fine-tuning. These produce changes inside the model weights, but these are batch processes used for task or domain specialisation, not for tracking the live state of an evolving workflow. You cannot fine-tune your way through a Tuesday afternoon analysis.

Others leave the weights untouched and operate around the model: Retrieval-Augmented Generation (RAG), vector databases, long-context prompting, memory buffers, conversation summarisation. These retrieve, summarise, and re-inject information into the prompt on demand. They can make the system appear persistent across sessions, but this persistence is an engineering illusion, not intrinsic memory. The model itself is not continuously learning from your interactions the way a human collaborator would.

Neither route fixes the underlying problem: a consistent internal state, carried forward turn after turn, as assumptions evolve.

How this “amnesia” disrupts the workflows

For simple tasks, it doesn’t. Writing an email, generating HTML, summarising a paper, debugging a small script — none of these require persistent memory.

The problems emerge in long-running knowledge workflows.

The Consultant: The “Context Window” Trap

The Workflow: A consultant wants to paste a 200-page corporate strategy report and ask the AI to find flaws.
The Problem: Because the model is stateless, it must re-read all 200 pages more often. If the consultant asks a follow-up question, eg “What about option B?”, the system sends the full 200 pages plus the new question back to the model.
The Impact: This hits a wall called the “context window” limit. If the chat goes on too long, the oldest pages of the report fall out of the model’s temporary memory. The AI may lose track of earlier assumptions or constraints, producing inconsistent or logically incompatible outputs.

The Analyst: Skyrocketing Costs & Speed Bumps

The Workflow: An analyst uses an AI tool to monitor live market feeds and execute portfolio changes based on morning data.
The Problem: In computing, processing data costs money (API tokens). Because the model cannot hold onto data statefully, the software must constantly re-upload the morning data, the historical charts, and the compliance rules with every single new prompt.
The Impact: The bills rack up fast. Instead of paying to process a new 10-word question, the system may repeatedly process 50,000 words of background data over and over again, increasing both the token costs and latency. Transferring that massive “binder” of data every second slows down response times, costing money in fast-moving markets.

The Scientist: Loss of Continuity & Logical Drift

The Workflow: A chemist uses an AI to help write a complex Python script to simulate a molecular bond, iterating through 50 versions of the code.
The Problem: As the conversation grows longer, external engineering systems (like conversation summarizers) try to compress the old chat history to save space. They might summarize steps 1 through 10 into a brief paragraph.
The Impact: The model loses the exact mathematical nuances discussed an hour ago. The scientist will notice the AI suddenly introducing errors into the code or contradicting a logical rule agreed upon at the start of the session. The “train of thought” is broken.

As AI automates routine cognitive work, the remaining value increasingly lies in human judgment, evaluation, and domain expertise. Knowledge workers and the organisations that employ them often work on projects that evolve over long stretches of time, with assumptions continuously evolving. This is why the demand for specialist humans is projected to sharply increase (https://www.mckinsey.de/~/media/mckinsey/locations/europe%20and%20middle%20east/deutschland/news/presse/2024/2024%20-%2005%20-%2023%20mgi%20genai%20future%20of%20work/mgi%20report_a-new-future-of-work-the-race-to-deploy-ai.pdf). Because LLMs excel at rapidly generating human language, the primary bottleneck for organisations has shifted from generating information to evaluating and checking the underlying assumptions.

Hypotheses shift, assumptions get refined, previous failures become important, and subtle contextual decisions accumulate. Human collaborators naturally build this shared understanding over time. LLMs do not.

So users must continuously rebuild context, verify assumptions, and check that critical information has not been lost during retrieval or summarisation. The AI may appear to remember the project, but it is reconstructing it from external sources each time.

The memory limitation is really a continuity limitation.

LLMs excel at producing brilliant local answers. Humans excel at maintaining a coherent global understanding over long periods. That is a profound difference.

How these problems are ultimately resolved will have a profound impact on the coming years. We are only beginning to uncover the practical limitations of LLMs, from memory and continuity issues to the enormous computational and token costs associated with scaling these systems.

Whether AI engineers develop new architectures or continue to overcome these limitations through clever engineering techniques and external memory systems remains an open question. It is certainly one worth watching closely.

The Memory Illusion- AI Doesn’t Really Remember

Share this:

Leave a comment Cancel reply