Something I’m finding while testing SWE-context-bench for the agent memory layer I’m building: evaluating memory is harder than checking whether the agent solved the next task with fewer tokens. The setup: An agent solves a coding task. Later, it gets a related task that should benefit from the...
Source: [Hacker News](https://news.ycombinator.com/item?id=48475850)