In this link you find a python notebook that defines a simple search engine in python. By clicking on the following link you can open this notebook in Colab:
This search engine is setup to parse and index a famous collection of documents in the information retrieval community called AdHoc8 [1]; search on it using TF-IDF or BM25; and, evaluate the search result with trec_eval
.
This test collection consists of:
- a collection of documents,
- a set of topics, and;
- a set of relevance assessments.
The collection of documents consists of a set of news articles coming from 4 newswires: Financial Times, Federal Register, Foreign Broadcast Information Service, and LA Times. You can get access to this collection of documents via the Linguistic Data Consortium (LDC) following this link.
The set of topics and relevance assessments can instead be downloaded from the Text REtrieval Conference (TREC) website following this link.