Deep Learning in Information Retrieval. Part I: Introduction and Sparse Retrieval

Information Retrieval (IR) is an important field of research in computer science. It is not only about building big search engines like Google or Bing. We can face IR systems when use an online store search, chat-bot or receive an automatic email with recommended online courses. IR is a hot field in the industry. Big online marketplaces, streaming services, social medias, job search services and many other companies build their own information retrieval systems. Relevancy of search or recommendation results has a crucial impact on business performance in these fields. Improving search quality by a few percents can lead to an increase in profits by millions of dollars. Some companies even create own R&D departments dedicated to research problems in information retrieval. A lot of problems are still unsolved in modern information retrieval. Researchers work to define and solve those kind of problems. For example, TREC-COVID Information Retrieval Challenge was held in 2020. Participants compete in building a pandemic document search system which helps to identify answers for some of important questions for biomedical researchers, clinicians and policy makers. In 2022 more general TREC 2022 Fair Ranking Track along with other challenges was held. This challenge was focused on fairly prioritizing Wikimedia articles for editing to provide a fair exposure to articles from different groups.
One of the promising areas to build information retrieval systems is medical research. The vast majority of clinical trials fail to meet their patient recruitment goal. Efficient patient trial recruitment is a major barrier to medical research. An important solution to this problem is to build information retrieval system which will utilize the vast amounts of patient data that is already available in the form of the electronic health record (EHR). In 2022 Clinical Trials Track participants were challenged to build approach for retrieving clinical trials from ClinicalTrials.gov, a required registry for clinical trials.

The field of IR has relatively long history of research. The Text REtrieval Conference (TREC) was founded in 1992. But recent advances in deep learning change this filed dramatically. This is the first article of “Deep Learning in Information Retrieval” series. In the series of articles we will look at how recent deep learning methods applied in modern information retrieval systems. We will focus mostly on textual information retrieval. Similarity search of images, sounds and other objects, and also multi-modal information retrieval are also important and noteworthy fields of research. But we will not deep dive to these topics in this series. In this article we start from the problem definition of textual information retrieval, general framework to solve IR, classical algorithms in IR and ways to improve performance of classical IR algorithms using deep learning.

What is Information Retrieval (IR)? It is reasonable to define information retrieval as a ranking problem. We have a query (also called context) which is a formal statements of information needs. For example, query can be a search string, a question from chat, a data of user for whom we need to show recommendations, etc. And we have a set of candidates that are data entities available for retrieval to particular query. For example, candidates can be paragraphs, documents, images, videos, audio files or other entities. In this series of articles I will focus on textual data. In information retrieval a query does not identify a unique object in the collection of candidates. Instead, several candidates may match the query with different degrees of relevancy. Some queries can have a strong match. The other can match only weak results with relatively low relevancy but still related to particular query. This is the reason why most IR systems rank the results by some kind of relevancy score and return a list of top ranked results. Thus, IR can be defined as a problem of ranking collection of data entities by relevancy to a particular query. This is an informal definition for better understanding of IR problem.

There are different metrics to measure how good some IR algorithm or system at ranking task. One of the most important metrics are Recall@N and MRR. Recall is the fraction of relevant results that were retrieved. Recall@N means the fraction of relevant results that were retrieved in first N top ranked results. The reciprocal rank of ranked results list for a query is the multiplicative inverse of the rank of the first correct result. The Mean Reciprocal Rank is the average of the reciprocal ranks of results for a dataset of queries:

where Q is a sample of queries and rankᵢ is a rank position of the first relevant candidate for the i-th query.

In a scientific world different public benchmarks are available to measure quality of IR systems. One of the largest and most popular for open domain IR is MS MARCO dataset. This dataset comprises of 1,010,916 anonymised questions sampled from Bing’s search query logs each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing that provide the information necessary for curating the natural language answers [1]. One of task proposed by authors of MS MARCO is ranking a set of retrieved passages given a question. This task is a good benchmark for large scale open domain IR system. So this dataset is often used in scientific papers to benchmark new models and approaches in modern information retrieval. There is also TREC Complex Answer Retrieval (TREC CAR) dataset. It is smaller and based on ~7 million paragraphs from all Wikipedia pages [2]. TREC CAR dataset is also often used as a benchmark for modern IR approaches in scientific papers. Also, there are some other smaller and older public IR datasets, benchmarks for different variations of IR problem, datasets for different languages, etc. You can check ir_datasets repo and Python library which provides a common interface to many IR ranking datasets.

Typical IR system pipeline

How to build information retrieval system? A straightforward approach is to design an algorithm or a model which scores relevancy of candidate to query and to build a system which computes scores for all candidates and returns ranked result list. However, this is impractical to rank whole set of candidates for each query. This is the reason why most information retrieval systems usually use at least two stages pipeline (see picture above). First stage is usually referred as candidates retrieval stage. Second and subsequent stages are referred as ranking or re-ranking stages. At candidates retrieval stage some fast but not very precise algorithm is used to retrieve a reasonable subset of candidates which probably contains relevant results. Then at ranking stages more complex algorithms and machine learning models are used to rank retrieved subset and obtain top relevant results for end user. In a subsequent sections of this article we will dive into candidates retrieval stage of IR pipeline and see how it works and how modern deep learning methods can be applied to improve results of this stage.

One of the central components of a classic information retrieval systems is an inverted index data structure. Inverted index is an index storing a mapping from content, such as terms, words, numbers, etc., to its locations in a collection of data entities (documents, paragraphs, etc.) The purpose of an inverted index is to enable fast searches, at a cost of increased processing time when a new data entity is added to the system. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science.

One of the simplest approach of text representation for building inverted index is Bag-of-Words (BoW) model. In BoW model a data entity text
is represented as the bag (multiset) of its words (terms or tokens), disregarding grammar and word order. Let’s look at the following example. Here are two simple texts:

This article is about information retrieval.

Information retrieval is the process of obtaining information resources that are relevant to an information need.

Based on these two texts, a list of terms is constructed as follows for each text:

"this", "article", "is", "about", "information", "retrieval"
"information", "retrieval", "is", "the", "process", "of", "obtaining", "information", "resources", "that", "are", "relevant", "to", "an", "information", "need"

If we collect all tokens used in our two texts dataset to a single vocabulary and assign order number to each token in the vocabulary then we will be able to present each bag of words as a vector using number of occurrences of particular tokens in each bag. For example, let’s assume we define vocabulary the following way:

 0: "about",
1: "an",
2: "are",
3: "article",
4: "is",
5: "obtaining",
6: "of",
7: "need"
8: "process",
9: "information",
10: "relevant",
11: "retrieval"
12: "resources",
13: "the",
14: "this",
15: "that",
16: "to"

Then using number of occurrences of particular tokens bags for these two texts can be presented as the following two vectors:

1: [1,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0,0],
2: [0,1,1,0,1,1,1,1,1,3,1,1,1,1,0,1,1]

Bag-of-Words is a simple vector space model. It does not take into account that terms can have different importance in a context of particular text. For example, common words like “a”, “the”, “this” are almost always have high frequency but carry very little quantities of information. one of the most The simple way to address this problem is using tf-idf or term frequency–inverse document frequency model. The tf–idf is calculated as a product of two metrics, term frequency and inverse document frequency. One data entity is referred as a document. Term frequency is the relative frequency of term within document and usually calculated as the number of times that term occurs in document divided by he total number of terms in document. The inverse document frequency is correlated to how much information the term provides and usually calculated as logarithm of fraction of the total number of documents divided by the number of documents containing the term. Set of computed inverse document frequencies for vocabulary is usually referred as IDF table. For two texts dataset from above IDF table computed using base 10 logarithm looks as the following:

"about": 0.30103,
"an": 0.30103,
"are": 0.30103,
"article": 0.30103,
"is": 0,
"obtaining": 0.30103,
"of": 0.30103,
"need": 0.30103
"process": 0.30103,
"information": 0,
"relevant": 0.30103,
"retrieval": 0,
"resources": 0.30103,
"the": 0.30103,
"this": 0.30103,
"that": 0.30103,
"to": 0.30103

Calculating term frequencies and then whole tf-idf metric for each term these two texts can be presented as the following two vectors:

1: [0.05017,0.0    ,0.0    ,0.05017,0.0,0.0    ,0.0    ,0.0    ,0.0    ,0.0,0.0    ,0.0,0.0    ,0.0    ,0.05017,0.0    ,0.0    ],
2: [0.0 ,0.01881,0.01881,0.0 ,0.0,0.01881,0.01881,0.01881,0.01881,0.0,0.01881,0.0,0.01881,0.01881,0.0 ,0.01881,0.01881]

Tf-ids anf Bag-of-Words are both simple vector space models. In real cases (with huge vocabulary of hundreds of thousands terms) they both produce vectors having a relatively small number of nonzero elements. Such vectors are called sparse vectors. And methods of information retrieval that use sparse vectors are usually referred as sparse retrieval.

One of the most known classic algorithm for sparse retrieval is Okapi BM25. Okapi BM25 (BM — best match) is a family of scoring functions that ranks a set of documents based on the query terms appearing in each document. One of the popular variation of BM25 function is the following:

where D is a document, Q is a query containing terms q₁,…,qₙ, k₁ and b are special constants, avgdl is the average number of terms in a document, |D| is number of terms in the document D, f(qᵢ,D) is the number of times that term qᵢ occurs in the document D, IDF(qᵢ) is a variation of inverse-document frequency which is usually calculated as the following:

where n(qᵢ) is the number of documents containing the term qᵢ and N is the total number of documents.

In this article we will deep into all theory behind BM25 algorithm. Just note that despite it was implemented in 1980s it is still used in a lot of information retrieval systems. Old classic algorithms for sparse retrieval have some advantages over recent machine learning based approaches:

  • Indexing and retrieval speed. Classic algorithms are much faster than modern DL-based approaches at indexing stage (calculating vectors) and in most cases at retrieval stage.
  • Explainability. The meaning of sparse vector is obvious. We can easily check why particular entity was retrieved for particular query and what terms had the greatest impact. It is even possible to influence on results by simply deleting or updating particular terms or adding custom rules to calculate terms weights.

So classic sparse retrieval approaches are still used in modern IR systems. And it is useful to know how they work. Modern deep learning methods can be used to improve even sparse retrieval. Let’s see how it works in the next section.

Deep learning language models based on variations of Transformer architecture achieved state-of-the-art results in a vast majority of NLP tasks across different domains. One of the first and the most popular language model is BERT (Bidirectional Encoder Representations from Transformers) developed and published by Jacob Devlin and his colleagues from Google in 2018 [3]. If you are not familiar with Transformer architecture or BERT then I recommend this article. At this moment a lot of more powerful variations of such models are available: RoBERTa, XLNet and others. But how to apply language models to information retrieval and particularly to sparse retrieval algorithms?

First, let’s look at approach called DeepCT. As we see above, term frequency is a common method for identifying the importance of a term in a query or document. But it is a weak signal, especially when the frequency distribution is flat, such as in long queries or short documents where the text is of sentence/passage-length. Let’s look at the following example from the original paper:

Two passages that mention ‘stomach’ twice. But only the first passage is about the topic ‘stomach’. DeepCT paper proposes a Deep Contextualized Term Weighting framework that learns to map BERT’s contextualized text representations to context-aware term weights for texts. When applied to passages, DeepCT-Index produces term weights that can be stored in an ordinary inverted index for retrieval. When applied to query text, DeepCT-Query generates a weighted bag-of-words query. Both types of term weight can be used directly by typical first-stage retrieval algorithms [4].

DeepCT includes two main components: generating contextualized word embeddings through BERT, and predicting term weights through linear regression for each token. The contextualized word embedding is a feature vector that characterizes the word’s syntactic and semantic role in a given context. DeepCT linearly combines the features into a word importance score. Given the ground truth term weight for every word in text, DeepCT aims to minimize the mean square error. The DeepCT model, from BERT to the regression layer, is optimized end-to-end. The BERT component is initialized with a pretrained BERT model to reduce over-fitting. It is fine-tuned to align the contextualized word embeddings with the term-prediction task. The last regression layer is learned from scratch.

Predicting Term Weights Architecture

DeepCT is benchmarked using MSMARCO and TREC-CAR datasets. A BM25 retrieval on DeepCT-Index can be 25% more accurate than classic tf-based indexes, and are more accurate than some widely-used multi-stage retrieval systems. Variations of DeepCT approach are also called W-index retrieval.

Similar approaches can be used for long documents retrieval. For example, in work called HDCT (Context-aware Hierarchical Document Term weighting framework) a framework for document indexing and retrieval is presented [5]. HDCT first estimates the semantic importance of a term in the context of each passage. These fine-grained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval using classic sparse approaches like BM25. HDCT architecture from original paper:

The HDCT architecture

HDCT is trained to predict importance weight got each token of a passage like in DeepCT approach. But the main difference is that authors of HDCT have proposed two weak supervision approaches for training HDCT instead of manually labeling the importance of every term in every passage. First approach uses metadata fields of a document (such as title, keywords, and abstracts) to generate weak labels based on percentage of field instances that contain a token from document. Second approach uses dataset of queries and relevant documents to generate weak labels based on the percentage of relevant queries that mention term from document.

HDCT was evaluated on 4 document retrieval datasets: MS-MARCO Document Ranking dataset, ClueWeb09-B, ClueWeb09-C and ClueWeb12-C. It significantly improved sparse retrieval accuracy compared to typical term-frequency approach.

Improved term weighting does not help in cases when query and relevant text don’t match by terms directly. For example, if synonyms are used in query or query contains terms that are related indirectly to some text. For such cases another family of methods called document expansion are come into play. Document expansion is a way to improve information retrieval quality via expanding document by relevant terms or phrases. There are a lot of methods to do document expansion but here we consider a few deep learning based approaches.

First approach is called Doc2Query. It is a simple method that predicts which queries will be issued for a given document and then expands it with those predictions with a vanilla sequence-to-sequence neural network, trained using datasets consisting of pairs of query and relevant documents [6]. The idea of the method is illustrated by the following picture from the original paper:

Doc2Query Idea

Trained Doc2Query model is used to predict 10 queries using top-k random sampling and then generated queries are appended to each document in the index. Then BM25 algorithm is used for first stage retrieval. The next version of this methods called DocTTTTTQuery is trained the same way but uses modern transformer-based T5 model [7].

Document expansion methods like Doc2Query can be used in combination with W-index retrieval. But Some modern deep learning methods do both document expansion and term weighting at the same time. Let’s look at some examples.

Framework called SparTerm directly learns sparse text representations in the full vocabulary space. SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework [8]. SparTerm architecture schema from original paper:

SparTerm Architecture

The importance predictor outputs semantic importance of all the terms in the passage. The importance predictor is trained end-to-end by optimizing the ranking objective. But unlike of DeepCT it doesn’t directly fitted to the statistical term importance labels but views the importance as intermediate variables that can be learned by distant supervisory signal for passage ranking.

The gating controller generates a binary gating signal of which terms to activate to represent the passage. It learns to activate terms presented in the passage and some other terms related to the passage topic. This is how SparTerm does document expansion by activating additional terms that are not presented in the passage originally. Passage-query/summary parallel corpus is used in training of gating controller. Thus approach to document expansion is similar to Doc2Query. But the difference is that sparse representation of passages is learned directly and expansion and term weighting models are trained jointly. SparTerm outperforms DeepCT method on MSMARCO Passage Retrieval dataset.

More modern approach called SPLADE (SParse Lexical AnD Expansion model) adds slight, but essential changes to the SparTerm model that dramatically improve its performance [9]. First, it introduces a log-saturation effect to term weights which prevents some terms to dominate and naturally ensures sparsity in representations. Second, it introduces loss which uses hard negatives in the batch sampled by BM25. The in-batch negatives sampling strategy is widely used for training image retrieval models, and has shown to be effective in learning first-stage rankers. Third, it applyes some regularization techniques to the loss to to obtain a well-balanced index. These changes allows significantly improve SparTerm approach. In the next version of this method called SPLADEv2 several significant improvements in terms of effectiveness and efficiency are introduced. More specifically, the pooling mechanism of the model is modified and models trained with distillation are introduced [10].

There are also other deep learning approaches to sparse retrieval [11]. The general idea of applying deep learning to sparse retrieval is searching for a way to obtain sparse text representations that will work better for classic sparse retrieval algorithms like BM25.

In this part we have considered basic concepts of information retrieval systems: inverted index, bag-of-words, TF-IDF, MRR and NDCG metrics, sparse retrieval and BM25 algorithm. Then we have a look at different approaches to improve performance of sparse retrieval using deep learning: W-index retrieval, document expansion models and hybrid approaches like SparTerm or SPLADE. Sparse retrieval methods have some advantages like explainability and existence of efficient implementations but other approaches can outperform sparse retrieval methods in some cases. In the next part we will look at alternative methods for first stage retrieval that can be used in IR systems.

  1. Tri Nguyen et al., MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016, arXiv:1611.09268v3
  2. Federico Nanni, Bhaskar Mitra, Matt Magnusson, Laura Dietz, Benchmark for Complex Answer Retrieval, 2017, arXiv:1705.04803
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018, arXiv:1810.04805
  4. Zhuyun Dai and Jamie Callan, Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval, 2019, arXiv:1910.10687v2
  5. Context-Aware Document Term Weighting for Ad-Hoc Search, WWW ’20: Proceedings of The Web Conference 2020.
  6. Rodrigo Nogueira et al., Document Expansion by Query Prediction, 2019, arXiv:1904.08375
  7. Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. From doc2query to doctttttquery, 2019
  8. Yang Bai et al., SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval, 2020, arXiv:2010.00768
  9. Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant, SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2288–2292, 2021, doi: 10.1145/3404835.3463098
  10. Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval, 2021, arXiv:2109.10086
  11. SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval, CIKM ’22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, doi:10.1145/3511808.3557456

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andrei Khobnia

https://www.linkedin.com/in/andreikhobnia/ Blog about building NLP systems: chat-bots, search engines, recommenders, summarization tools and smart assistants