We’re going to try to build the start of a document search engine.
Our search engine should take in a static list of documents — long alphanumeric, lower-cased, and space-separated strings.
e.g. ['this is a document', 'this is another document']
Our search engine needs only to support queries that are alphanumeric, lower-cased, and space-separated strings.
e.g. 'some query'
We should expose this function to our users:
searchDocuments(query: string)
which returns a list of search results that each have a document index and a relevance score.
We should not include documents in the search results if they do not match at all. Our list should be sorted by highest relevance score first.
Ex: Calling searchDocuments with 'some query' might return a result that looks like [{index: 5, score: 2.828}, {index: 1, score: 1.414}]
Scoring
Document scores depend on the following components:
tf= term frequency =sqrt(number of times the term appears in the document)idf= inverse document frequency =1 + log(numDocs / (numDocsContainTerm + 1))dn= document length norm =1 / sqrt(numWordsInDocument)
We can calculate relevance scores utilizing those components:
Document score for query = sum((tf * idf² * dn) for each term in the query)
Ex:
scoring document:
'this is a long document about a dog and a cat'
with query:
'dog cat dog'
would return score:
2 * (tf_for_dog * idf_for_dog² * dn_for_dog) + (tf_for_cat * idf_for_cat² * dn_for_cat)
这道题本质上是在实现一个简化版的文档搜索引擎排序:给定一组静态文档和一个查询词列表,需要为每个匹配文档计算相关性分数,并按分数从高到低返回。核心思路是把每个词的贡献拆成三部分:词频 tf 反映该词在文档中出现得有多集中,idf 用来降低高频“泛词”的权重,dn 则对长文档做归一化,避免长文档天然占优。实现时通常会先预处理每篇文档的词频、文档长度,以及每个词出现在哪些文档中,再对查询词逐项累加分数。这个题重点考察哈希统计、倒排索引、打分公式和排序处理。