Pinterest OA 面试题解析:Labels(朴素贝叶斯分类)

19次阅读
没有评论

Implement the missing code, denoted by ellipses. You may not modify the pre-existing code.

Your task is to implement parts of a Naive Bayes algorithm from scratch (i.e., without importing any libraries or packages beyond numpy or math). As a reminder, Naive Bayes classification is an application of Bayes’ Theorem that predicts the probability that a case, or set of data points, belongs to one or more classes. It comprises four major steps:

  • Calculate feature descriptive statistics by class.
  • Calculate prior probabilities.
  • Implement a Gaussian density function.
  • Calculate posterior probabilities according to Bayes’ Theorem, and make a prediction, which is the class with the largest predicted probability.

To validate the algorithm implementation, you will need to use it for some classification tasks. Specifically, you will be given a two-dimensional array of float values x_train as training data, where each subarray x_train[i] represents a unique case. You will also be given a one-dimensional array y_train where each element represents the true class label of the corresponding subarray in x_train[i]. In addition, you will be given x_test as test data, with the same format as the training data but without class labels.

Note: It is guaranteed that all training and test data will be float values. Some skeleton code for the algorithm has already been created, so please do not edit them. You should only implement code under the # implement this sections.

Example

For

x_train = [[-2.6, 1.9, 2.0, 1.0],
           [-2.8, 1.7, -1.2, 1.5],
           [2.0, -0.9, 0.3, 2.3],
           [-1.5, -0.1, -1.6, -1.1],
           [-1.0, -0.6, -1.2, -0.7],
           [-0.3, 1.2, 2.6, 0.2],
           [-1.8, -1.3, -0.1, -1.2],
           [0.2, 1.2, -0.6, -1.3],
           [-5.2, 0.3, 0.2, 2.2],
           [-0.8, -0.1, 1.5, -0.1],
           [-2.3, 0.3, 0.8, 0.7],
           [0.2, 3.0, 3.6, -0.9],
           [1.7, -0.8, -0.0, 2.0],
           [2.8, 0.8, 1.8, -0.7]]
y_train = [1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 2, 1, 0, 2]

and

x_test = [[-0.1, 1.4, 0.4, -1.0],
          [-1.3, 0.2, -1.3, -0.8],
          [-1.1, 1.5, -2.3, -2.5]]

the output should be solution(x_train, y_train, x_test) = [1, 0, 1].

The Naive Bayes Classifier should calculate the mean and variance of the features in x_train for each label in y_train (Step 1). This information will be used to calculate the prior probability for each label in Step 2. The Gaussian Density Function (Step 3) and the prior probability estimates will be used to calculate a posterior probability and predicted label for each case in x_test (Step 4).

这道 Pinterest OA 题要求你补全一个朴素贝叶斯分类器:先按类别统计训练集中特征的均值和方差,再计算每个类别的先验概率,接着用高斯密度函数求出给定样本在各类别下的似然,最后结合先验得到后验概率,选择概率最大的类别作为预测结果。题目强调只能填写指定的实现区域,并且输入都是浮点数,因此实现时要注意按类别分组、处理方差为 0 的数值稳定性,以及对测试集逐个样本输出预测标签。

正文完
 0