Research Question
Can learned attention over neighbors improve calibration and robustness versus uniform and distance-weighted kNN, with minimal compute overhead?
Status: CONCLUDED - December 2025. No further investigation needed.
The Research Journey
This research began with a hypothesis: learned attention would improve kNN classification by learning sophisticated neighbor weighting patterns. Over the course of systematic experimentation, I built a complex architecture with multi-head attention, learned temperature, label-conditioned bias, and prototype-guided scoring.
Through seven experimental iterations, I discovered something unexpected: the simplest technique (Test-Time Augmentation) provided the biggest gains, while our complex architecture (attention) provided no measurable benefit.
I wanted to validate this hypothesis, and I did. I wanted to demonstrate that scientific rigor meeting honest self-assessment—demonstrating that negative results are valuable scientific contributions.
I love conducting research, and I need to be honest about my findings and contribute to the scientific community.
Key Results
Final results from Experiment 7 (December 2025) - Research concluded:
| Method | Accuracy | ECE | NLL |
|---|---|---|---|
| Uniform kNN | 91.53% | 0.0796 | 1.225 |
| Distance kNN | 91.52% | 0.0783 | 1.225 |
| Attn-KNN | 91.55% | 0.0811 | 1.236 |
| Attn-KNN + TTA | 90.99% | 0.0267 | 0.513 |
| CNN Baseline | 96.51% | 0.0253 | 0.184 |
Test-Time Augmentation (TTA) provides dramatic calibration improvements: 67% ECE reduction (0.0811 → 0.0267) and 58% NLL reduction (1.236 → 0.513). However, TTA works with any kNN method—it's not unique to attention. Attention alone provides only +0.02% accuracy improvement (within noise margin) and no calibration benefit over distance-weighted kNN.
Visualizations & Results
Calibration Analysis
Reliability diagrams showing calibration performance across different methods. TTA dramatically improves calibration (ECE reduction from 0.0811 to 0.0267 - 67% improvement).


Left: Uniform kNN (ECE: 0.0796). Right: Attn-KNN + TTA (ECE: 0.0267) — showing 67% calibration improvement.
k Parameter Analysis
Accuracy remains stable across k ∈ [1, 50], validating robustness. NLL decreases with k as predicted by theory.



Noise Robustness
Noise robustness analysis reveals that attention is LESS robust than uniform kNN: Uniform kNN drops only 0.04% at 30% noise, while Attn-KNN drops 1.05%—invalidating the robustness claim.


Method Comparison
Visual comparison of all methods showing convergence between uniform, distance, and attention-weighted kNN.

Training Dynamics
Training curves showing loss convergence and validation performance across epochs.

Experimental Timeline
Baseline Establishment
ResNet18, 128-dim, single-head attention. Result: Attention ≈ Uniform ≈ Distance (88.61% accuracy).
Discovery: Fixed training-evaluation disconnect—loss computed on classifier logits but evaluation used kNN predictions.
Scaling Up
ResNet50, 256-dim, 4-heads, contrastive loss. Result: +1.08% accuracy (89.70%) but no attention advantage.
Finding: Architectural improvements don't create separation—all methods converge to identical performance.
Enhanced Training with TTA
TTA, MixUp, label smoothing, optimized config. Result: 78% ECE reduction with TTA.
Discovery: TTA is the real innovation, not attention. Attention alone provides no calibration benefit.
Reproducibility
Pattern confirmed—TTA consistently improves calibration across experiments.
Best Possible Results
ResNet50, 256-dim, 4-heads, 50 epochs. Result: 91.55% accuracy, 0.0267 ECE (with TTA).
Critical Discovery: Attention is LESS robust to noise than uniform kNN (1.05% drop vs 0.04% drop). Project concluded - no further investigation needed.
Additional Results
k Parameter Robustness
Accuracy remains stable across k ∈ [1, 50], validating the robustness of kNN methods. Optimal k is 10-20 for CIFAR-10.
| k | Uniform | Distance | Attention |
|---|---|---|---|
| 1 | 86.82% | 86.82% | 86.82% |
| 5 | 86.83% | 86.83% | 86.86% |
| 16 | 86.82% | 86.85% | 86.85% |
| 20 | 86.83% | 86.83% | 86.85% |
| 50 | 86.80% | 86.80% | 86.84% |
Noise Robustness (30% label noise)
Critical finding: Attention is LESS robust to noise than uniform kNN. Uniform kNN drops only 0.04%, while Attn-KNN drops 1.05%—invalidating the robustness claim.
| Method | Accuracy Drop |
|---|---|
| Uniform kNN | 0.04% |
| Attn-KNN | 1.05% |
Finding: Attention is LESS robust to noise than uniform kNN.
Key Findings
What Works
- Test-Time Augmentation: 67% ECE reduction (0.0811 → 0.0267), 58% NLL reduction (1.236 → 0.513)
- Minimal Overhead: <1ms additional compute per query
- Distance-weighted kNN: Simple and effective, equally good as attention
What Doesn't Work
- Attention Alone: Only +0.02% accuracy improvement (within noise margin), no calibration benefit without TTA
- Robustness: Attention is LESS robust to label noise than uniform kNN (1.05% drop vs 0.04% drop)
- Accuracy Gap: 4.96% gap to CNN persists (fundamental kNN limitation)
- Core Novelty: Attention mechanism does not provide measurable benefits over simpler baselines
Architecture Overview
1. Embedding Network
ResNet18/50 backbone (ImageNet pretrained) with projection head (256-dim). L2-normalized embeddings ensure unit sphere representation.
2. Memory Bank (FAISS)
Efficient similarity search using FAISS. Stores embeddings of all training samples (50K for CIFAR-10). Supports both exact (Flat L2) and approximate (HNSW) search.
3. Multi-Head Attention
Query-key attention with learned temperature per head. Novel components: neighbor self-attention, label-conditioned bias, prototype-guided scoring. However, experimental results show these provide no measurable benefit over simpler distance weighting.
4. Test-Time Augmentation
Averages predictions over 5 augmented views (original, horizontal flip, 3 random crops). This simple technique provides the largest improvements—67% ECE reduction (0.0811 → 0.0267).
Honest Assessment
Core Finding: Attention-weighted kNN provides no meaningful benefit over distance-weighted kNN. The calibration improvements come from Test-Time Augmentation (TTA), not attention. Attention alone provides only +0.02% accuracy improvement (within noise margin) and is LESS robust to label noise than uniform kNN.
Recommendation: Use distance-weighted kNN with TTA for production. Attention adds complexity without meaningful benefit.
However, this is still a valuable scientific contribution:
- Negative result: Shows that attention doesn't help in this setting (saves others' time)
- TTA finding: Demonstrates TTA's effectiveness for kNN calibration (method-agnostic)
- Methodological lessons: Highlights importance of ablation studies and baseline establishment
- Honest reporting: Transparent assessment of limitations and failures
Project Status: CONCLUDED - December 2025. No further investigation needed.
Repository & Documentation
Complete codebase, experiments, and results available on GitHub.
Comprehensive documentation includes:
- RESEARCH_NARRATIVE.md: Complete chronological story of the research journey
- HONEST_ASSESSMENT.md: Transparent evaluation of limitations and failures
- EXPERIMENTS.md: Detailed experiment documentation
- ARCHITECTURE_DIAGRAMS.md: Comprehensive ASCII art diagrams