Affirm VO 面试真题解析：Format Data for Modelling 数据预处理与特征工程

def format_data_for_modelling(schema, training_data, label_data):

Takes in a list of headers, a training data set and info on the eventual outcome (a.k.a label/dependent variable/flag), merges and reformats the data to generate a list of feature vectors on which a binary logistic regression model may be built. All feature values must be real numbers.

:param schema: list of dictionaries representing column types, names and other info.

Every entry will have at least a 'type' and 'name' key and potentially others. Types will be one of {id, date, datetime, str, float, int, bool}. Names will be unique across entries.

e.g. [{'type': 'int', 'name': 'my_field'}, ...]

:param training_data: list of lists where each internal list represents a record to model.

Columns follow types from schema in the same order.

:param label_data: dict of {id: bool} representing the dependent variable (label) of the data set.

:return: list of lists of all numbers representing features capable of having logistic regression run on them.

The first col should have the id and the final col should have the dependent variable/label flag.

e.g.

[[1, 17, 3, 95.0, ..., 1],
    [2, 34, 5, 96.0, ..., 0],
    ...
]

这道题要求把混合类型的训练数据按照 schema 统一整理成适合逻辑回归的数值特征矩阵。核心难点是先用 id 对齐训练记录与 label，再根据字段类型做转换：id 作为首列保留，数值型直接使用，bool 映射为 0/1，date 和 datetime 通常转成时间戳或可比较的数值，字符串字段则需要做类别编码或其他数值化处理。最终输出的每一行都必须是纯数字，并且最后一列附上标签值。

Post Views: 30

Affirm VO 面试真题解析：Format Data for Modelling 数据预处理与特征工程

联系我

友情提示