Research·April 2026

Neutrally scores 89.4% on LongMemEval

LongMemEval is an ICLR 2025 benchmark for long-term conversational memory. We ran Neutrally's production pipeline against all 500 questions. This is what we found.

What is LongMemEval?

LongMemEval is an open benchmark published at ICLR 2025 by researchers at the University of Southern California. It tests how well AI memory systems answer questions that require recalling facts from long conversation histories.

The benchmark contains 500 questions across five categories: single-session recall, knowledge updates, temporal reasoning, and multi-session recall across many separate conversations. Each question requires locating the right information from roughly 53 sessions totalling around 115,000 tokens.

The key constraint: systems do not get the full history as a context window. They must retrieve the right memories from storage under realistic conditions, with realistic noise. That is what makes it a useful signal.

A note on production vs. research systems

Most high-scoring systems on this benchmark were built specifically for it. They use architectures, prompting strategies, or data assumptions that work in the benchmark setting but are not deployed with real users. The scores are real; the comparisons require context.

We ran Neutrally's production pipeline. The same extraction, the same retrieval, the same ranking that handles real user conversations every day. No benchmark-specific modifications.

Three runs

Getting to 89.4% took three full runs. Each one identified a specific class of failures and informed the next version of the pipeline.

Run 1

~65.4%

Baseline production pipeline. Retrieval was working but shallow extraction missed many facts.

Run 2

84.2%

Deeper extraction and better handling of knowledge updates. Temporal reasoning remained inconsistent.

Run 3

89.4%

Dynamic routing between fast and deep analysis paths, hybrid retrieval with re-ranking, and targeted optimisation for preference questions.

Results by category

447 out of 500 questions answered correctly.

Single-session (assistant)100%

56/56

Single-session (assistant)

100% (56/56)

Single-session (user)100%

70/70

Single-session (user)

100% (70/70)

Knowledge update89.7%

70/78

Knowledge update

89.7% (70/78)

Temporal reasoning88%

117/133

Temporal reasoning

88% (117/133)

Multi-session83.5%

111/133

Multi-session

83.5% (111/133)

Single-session preference76.7%

23/30

Single-session preference

76.7% (23/30)

LongMemEval-S (500 questions). Production pipeline, no benchmark-specific modifications.

How the pipeline works

Neutrally uses a five-layer memory architecture. Conversations flow through extraction, structured storage, and hybrid retrieval with re-ranking. The pipeline is cross-LLM by design: memory persists whether the user is talking to Claude, GPT, Grok, or any other provider.

At query time, a router decides between a fast retrieval path for straightforward questions and a deeper analysis path for questions that require temporal or preference reasoning. Straightforward retrieval is insufficient for those; they require understanding relationships between facts across time.

The main gap in the results is single-session preference questions at 76.7%. These require inferring preferences from behaviour rather than explicit statements. We have a hypothesis for closing this gap and it is the primary engineering focus for the next iteration.

What we are building toward

LongMemEval is one measure of how the memory layer performs. The real measure is whether it makes AI assistants genuinely more useful in people's daily work.

The goal is a memory infrastructure layer that any AI agent, on any model, can read from and write to. Cross-model, persistent, production-grade. That is what Neutrally is built around and what we are continuing to build.

Methodology

All scores reflect Neutrally's production memory pipeline evaluated against the LongMemEval-S dataset (500 questions). No benchmark-specific modifications were made. The benchmark data and evaluation code are available at the LongMemEval repository. Competitor scores taken from published leaderboard results.

Try Neutrally free Developer docs