Introduction

Data Collection - CCPairs

let (q,p) denote query, text pairs. The data sources used are:

~ 1.3B text pairs (mostly from Reddit and Common Crawl)

Assumption - When trained on noisy datasets, neural networks tend to memorize the clean labels first and then gradually overfit the noisy labels.

Distinguish relevant text pairs from other irrelevant pairs
Given a collection of text pairs \(\{(q_i, p_i)\}_{i=1}^n\), assign a list of negative passages \(\{p_{i,j}^-\}_{j=1}^m\) for the i-th example.
InfoNCE contrastive loss \(min L = -\frac{1}{n}\sum_ilog\frac{e^{s_\theta(q_i,\; p_i)}} {e^{s_\theta(q_i,\; p_i)} + \sum_je^{s_\theta(q_i,\; p_{ij}^-)}}\) , where \(s_\theta(q,p)\) is a scoring function between q and p parametrized by \(\theta\).
Pre-trained transformer encoder + average pooling over output layer.
Use shared encoder and break symmetry by adding prefix identifiers “query:” and “passage”.
Negative Sampling - in-batch negatives.

Supervised finetuning with NLI(Semantic Textual Similarity) and MS-MARCO + NQ (Retrieval).
Mined hard negatives from Cross encoder for MS-MARCO and NQ datasets.
NLI - use contradiction sentences as hard negatives.
Loss function is combination of distillation and contrastive loss: \(min \ D_{KL}(p_{ce}, p_{stu}) + \alpha L_{cont}\) where \(p_{ce}\) and \(p_{stu}\) ae the probabilities from the cross-encoder teacher mdoe and student model.

Alt text

Alt text