NDCG Score: Fine-Tuning Recommendation System Evaluation
Introduction: Why NDCG?
NDCG (Normalized Discounted Cumulative Gain) is a ranking-centric metric. Unlike accuracy-based approaches, it emphasizes the position of relevant items, crucial to understand user satisfaction in recommender systems.
NDCG score
also explains how well the top-N recommendations match the actual user interactions.DCG
explains the gain of items based on positions.NDCG
normalises it.
From a ML lead point of view, a higher NDCG means users see more relevant items up front. It is consistent with other KPIs like engagement and conversion.
Basic Fundamentals: DCG & IDCG
- DCG: Summation of relevance scores, discounted by log of the position.
def dcg_at_k(relevance, k):
"""Compute DCG@K"""
relevance = np.asarray(relevance)[:k]
dcg = np.sum(relevance / np.log2(np.arange(2, relevance.size + 2)))
return dcg
-
IDCG: Ideal DCG—what your DCG would be if all relevant items were at the top.
-
NDCG: Normalizes results for easy comparison across different datasets.
Hello World Example: Simple NDCG Demo
How a small difference in ordering affects NDCG.
Scenario | k | NDCG Score |
---|---|---|
Bad Predictions | 5 | 0.6957 |
Better Predictions | 5 | 0.7182 |
Ideal Ranking | 5 | 1.0000 |
Truncated (k=4) | 4 | 0.3520 |
Tied Scores | 5 | 0.5000 |
Handling Larger Systems: Implicit vs. Explicit Data
-
Explicit: Ratings, likes, or feedback forms. NDCG evaluates how well you’re ranking items users explicitly rated highly.
🔗 Script for NDCG score on explicit dataset & algo comparison
surpriselib
is used for trying different algorithm.- SVD++ performed better but took most time to run.
- Technically, NDCG scores calculated for all algorithms suggest that recommendations are not great.
- NDCG score is an offline metric, it doesn’t replace any online metrics (KPIs) which needs to be calculated through an A/B test.
Algorithm | NDCG@10 | Runtime (s) |
---|---|---|
SVD | 0.0470 | 3.35 |
SVD++ | 0.0692 | 48.73 |
NMF | 0.0062 | 3.02 |
KNNBasic | 0.0004 | 28.34 |
KNNWithMeans | 0.0016 | 31.12 |
KNNWithZScore | 0.0014 | 33.73 |
KNNBaseline | 0.0010 | 37.54 |
SlopeOne | 0.0015 | 23.65 |
CoClustering | 0.0058 | 2.91 |
- Implicit: Clicks, spend time (behavioural data), or purchases. NDCG helps interpret real-world engagement signals, highlighting the top actions you most want to rank first.
🔗 Script for NDCG score calculation on implicit dataset & comparison
Once you check / run the script preparing a linear rec-sys model on lastfm
data available publicly (here).
The results are:
- NDCG@10 - All Products: 0.0025
- NDCG@10 - Top Played Products Only: 0.0022
This is not a good NDCG. A low NDCG score suggests weak recommendation quality. This is expected because of high sparsity and noise in implicit datasets. Also, the model we trained is a linear ALS model.
ALS: a matrix factorisation model which assumes structured rating behaviour. In an implicit dataset (clicks, plays, purchases, etc.), such structure is not usually captured.
An improvement could be to try:
- Weighted confidence-based ALS
- Hybrid model
- Noise filtering before trying ALS
Managerial Perspective: Leveraging NDCG Insights
- Strategy: Align recommended items with high-relevance or high-value categories.
- Goal Tracking: A higher NDCG correlates with increased user satisfaction, retention, and ultimately revenue.
- Dashboard Integration: Combine NDCG with business KPIs (sales, CTR) to see direct impact of your recommendation algorithm adjustments.
New Metrics: Hitrate NDCG & Beyond
- Hitrate NDCG: A blend of immediate user “hits” and the ranking-based viewpoint.
- Other Variants: MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), etc.
- Why Expand?: Different business objectives—some might care about quick discovery (hitrate), others about deep engagement (DCG-based).
Conclusion
NDCG is not just a metric—it’s a direct KPI through which one can see
- ranking performance
- user engagement
and can connect business outcomes.