NDCG Score: Fine-Tuning Recommendation System Evaluation

Introduction: Why NDCG?

NDCG (Normalized Discounted Cumulative Gain) is a ranking-centric metric. Unlike accuracy-based approaches, it emphasizes the position of relevant items, crucial to understand user satisfaction in recommender systems.

NDCG score also explains how well the top-N recommendations match the actual user interactions.
DCG explains the gain of items based on positions. NDCG normalises it.

From a ML lead point of view, a higher NDCG means users see more relevant items up front. It is consistent with other KPIs like engagement and conversion.

Basic Fundamentals: DCG & IDCG

DCG: Summation of relevance scores, discounted by log of the position.

\[\begin{aligned} DCG = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i+1)} \end{aligned}\]

def dcg_at_k(relevance, k):
    """Compute DCG@K"""

    relevance = np.asarray(relevance)[:k]
    dcg = np.sum(relevance / np.log2(np.arange(2, relevance.size + 2)))
    return dcg

IDCG: Ideal DCG—what your DCG would be if all relevant items were at the top.
NDCG: Normalizes results for easy comparison across different datasets.

\[\begin{aligned} NDCG = \frac{DCG}{IDCG} \end{aligned}\]

Hello World Example: Simple NDCG Demo

How a small difference in ordering affects NDCG.

Scenario	k	NDCG Score
Bad Predictions	5	0.6957
Better Predictions	5	0.7182
Ideal Ranking	5	1.0000
Truncated (k=4)	4	0.3520
Tied Scores	5	0.5000

🔗 Open NDCG Sample Notebook

Handling Larger Systems: Implicit vs. Explicit Data

Explicit: Ratings, likes, or feedback forms. NDCG evaluates how well you’re ranking items users explicitly rated highly.

🔗 Script for NDCG score on explicit dataset & algo comparison
- surpriselib is used for trying different algorithm.
- SVD++ performed better but took most time to run.
- Technically, NDCG scores calculated for all algorithms suggest that recommendations are not great.
  - NDCG score is an offline metric, it doesn’t replace any online metrics (KPIs) which needs to be calculated through an A/B test.

Algorithm	NDCG@10	Runtime (s)
SVD	0.0470	3.35
SVD++	0.0692	48.73
NMF	0.0062	3.02
KNNBasic	0.0004	28.34
KNNWithMeans	0.0016	31.12
KNNWithZScore	0.0014	33.73
KNNBaseline	0.0010	37.54
SlopeOne	0.0015	23.65
CoClustering	0.0058	2.91

Implicit: Clicks, spend time (behavioural data), or purchases. NDCG helps interpret real-world engagement signals, highlighting the top actions you most want to rank first.

🔗 Script for NDCG score calculation on implicit dataset & comparison

Once you check / run the script preparing a linear rec-sys model on lastfm data available publicly (here).

The results are:

- NDCG@10 - All Products: 0.0025                                                  
- NDCG@10 - Top Played Products Only: 0.0022

This is not a good NDCG. A low NDCG score suggests weak recommendation quality. This is expected because of high sparsity and noise in implicit datasets. Also, the model we trained is a linear ALS model.

ALS: a matrix factorisation model which assumes structured rating behaviour. In an implicit dataset (clicks, plays, purchases, etc.), such structure is not usually captured.

An improvement could be to try:

Weighted confidence-based ALS
Hybrid model
Noise filtering before trying ALS

Managerial Perspective: Leveraging NDCG Insights

Strategy: Align recommended items with high-relevance or high-value categories.
Goal Tracking: A higher NDCG correlates with increased user satisfaction, retention, and ultimately revenue.
Dashboard Integration: Combine NDCG with business KPIs (sales, CTR) to see direct impact of your recommendation algorithm adjustments.

New Metrics: Hitrate NDCG & Beyond

Hitrate NDCG: A blend of immediate user “hits” and the ranking-based viewpoint.
Other Variants: MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), etc.
Why Expand?: Different business objectives—some might care about quick discovery (hitrate), others about deep engagement (DCG-based).

Conclusion

NDCG is not just a metric—it’s a direct KPI through which one can see

ranking performance
user engagement

and can connect business outcomes.

References

🔗 NDCG paper
MovieLens Dataset

NDCG Score: Fine-Tuning Recommendation System Evaluation

Nikhil Singh

NDCG Score: Fine-Tuning Recommendation System Evaluation

Introduction: Why NDCG?

Basic Fundamentals: DCG & IDCG

Hello World Example: Simple NDCG Demo

Handling Larger Systems: Implicit vs. Explicit Data

Managerial Perspective: Leveraging NDCG Insights

New Metrics: Hitrate NDCG & Beyond

Conclusion

References