Coding Agent Memory Benchmarks
Something Iām finding while testing SWE-context-bench for the agent memory layer Iām building: evaluating memory is harder than checking whether the agent solved the next task with fewer tokens. The setup: An agent solves a coding task. Later, it gets a related task that should benefit from the...