Unicorns / VO 面试真题解析:Document Search Engine Scoring,检索排序与 TF-IDF 风格打分

17次阅读
没有评论

We’re going to try to build the start of a document search engine.

Our search engine should take in a static list of documents — long alphanumeric, lower-cased, and space-separated strings.

e.g. ['this is a document', 'this is another document']

Our search engine needs only to support queries that are alphanumeric, lower-cased, and space-separated strings.

e.g. 'some query'

We should expose this function to our users:

searchDocuments(query: string)

which returns a list of search results that each have a document index and a relevance score.

We should not include documents in the search results if they do not match at all. Our list should be sorted by highest relevance score first.

Ex: Calling searchDocuments with 'some query' might return a result that looks like [{index: 5, score: 2.828}, {index: 1, score: 1.414}]

Scoring

Document scores depend on the following components:

  • tf = term frequency = sqrt(number of times the term appears in the document)
  • idf = inverse document frequency = 1 + log(numDocs / (numDocsContainTerm + 1))
  • dn = document length norm = 1 / sqrt(numWordsInDocument)

We can calculate relevance scores utilizing those components:

Document score for query = sum((tf * idf² * dn) for each term in the query)

Ex:

scoring document:

'this is a long document about a dog and a cat'

with query:

'dog cat dog'

would return score:

2 * (tf_for_dog * idf_for_dog² * dn_for_dog) + (tf_for_cat * idf_for_cat² * dn_for_cat)

这道题本质上是在实现一个简化版的文档搜索引擎排序:给定一组静态文档和一个查询词列表,需要为每个匹配文档计算相关性分数,并按分数从高到低返回。核心思路是把每个词的贡献拆成三部分:词频 tf 反映该词在文档中出现得有多集中,idf 用来降低高频“泛词”的权重,dn 则对长文档做归一化,避免长文档天然占优。实现时通常会先预处理每篇文档的词频、文档长度,以及每个词出现在哪些文档中,再对查询词逐项累加分数。这个题重点考察哈希统计、倒排索引、打分公式和排序处理。

正文完
 0