Why Accuracy Lied: Lessons from Building an NLP System

I recently rebuilt the intelligence layer of my personal finance app, Velar, with a focus on making transaction understanding both reliable and adaptable.

The initial approach was straightforward: a rule-based system mapping known merchants (like Swiggy or Netflix) to categories. While this worked for common cases, it quickly broke down when faced with noisy inputs, inconsistent phrasing, or unseen vendors.

The Limits of Pure Machine Learning

For example:

"₹1200 paid"
→ No merchant, no context → model fails

"Netflix electricity bill"
→ Conflicting signals → incorrect classification

To address this, I engineered a synthetic dataset (3,000+ samples) with controlled complexity—mixing clean, noisy, and adversarial patterns. This allowed me to evaluate the model across difficulty levels and revealed a key insight:

"High accuracy on clean data does not translate to robustness."

In controlled settings, the model achieved ~97% accuracy.

But when evaluated on realistic, noisy inputs, performance dropped to ~48%.

On highly ambiguous transactions (e.g., "₹1200 paid"), accuracy fell to 0%.

Even replacing the model with BERT only improved performance marginally (~50%).

The Hybrid Pipeline

This led me to rethink the system entirely—not as a single model, but as a pipeline.

I designed a hybrid architecture where:

deterministic rules handle high-confidence patterns
ML handles uncertain cases
low-confidence predictions are filtered out

This prevents the system from making confident but incorrect decisions.

// Architecture flow:
text
  → normalization
  → extraction (amount, merchant)
  → rules
  → ML fallback
  → confidence filtering

Conclusion

Finally, I deployed the model as a standalone inference service and integrated it with a Node.js backend, adding a feedback loop where user corrections are stored for future retraining.

The result is not just a classifier, but a continuously improving system—one that learns from real usage rather than static datasets. This shift from "model accuracy" to "system reliability" was the most important takeaway from the entire process.