Unicorns / VO Interview Problem: Document Search Engine Scoring

13 Views
No Comments

We’re going to try to build the start of a document search engine.

Our search engine should take in a static list of documents — long alphanumeric, lower-cased, and space-separated strings.

e.g. ['this is a document', 'this is another document']

Our search engine needs only to support queries that are alphanumeric, lower-cased, and space-separated strings.

e.g. 'some query'

We should expose this function to our users:

searchDocuments(query: string)

which returns a list of search results that each have a document index and a relevance score.

We should not include documents in the search results if they do not match at all. Our list should be sorted by highest relevance score first.

Ex: Calling searchDocuments with 'some query' might return a result that looks like [{index: 5, score: 2.828}, {index: 1, score: 1.414}]

Scoring

Document scores depend on the following components:

  • tf = term frequency = sqrt(number of times the term appears in the document)
  • idf = inverse document frequency = 1 + log(numDocs / (numDocsContainTerm + 1))
  • dn = document length norm = 1 / sqrt(numWordsInDocument)

We can calculate relevance scores utilizing those components:

Document score for query = sum((tf * idf² * dn) for each term in the query)

Ex:

scoring document:

'this is a long document about a dog and a cat'

with query:

'dog cat dog'

would return score:

2 * (tf_for_dog * idf_for_dog² * dn_for_dog) + (tf_for_cat * idf_for_cat² * dn_for_cat)

This problem asks you to build a simple search ranking engine over a static set of documents. For each query, you compute a relevance score for every matching document using term frequency, inverse document frequency, and a document-length normalization factor, then return the results sorted by score descending. The key implementation idea is to precompute word counts, document lengths, and document frequency information so each query can be scored efficiently. It is a classic information retrieval design problem that tests hashing, indexing, and ranking logic.

END
 0