Affirm VO Interview Coding Question: Format Data for Modelling

18 Views
No Comments

def format_data_for_modelling(schema, training_data, label_data):

Takes in a list of headers, a training data set and info on the eventual outcome (a.k.a label/dependent variable/flag), merges and reformats the data to generate a list of feature vectors on which a binary logistic regression model may be built. All feature values must be real numbers.

:param schema: list of dictionaries representing column types, names and other info.

Every entry will have at least a 'type' and 'name' key and potentially others. Types will be one of {id, date, datetime, str, float, int, bool}. Names will be unique across entries.

e.g. [{'type': 'int', 'name': 'my_field'}, ...]

:param training_data: list of lists where each internal list represents a record to model.

Columns follow types from schema in the same order.

:param label_data: dict of {id: bool} representing the dependent variable (label) of the data set.

:return: list of lists of all numbers representing features capable of having logistic regression run on them.

The first col should have the id and the final col should have the dependent variable/label flag.

e.g.

[[1, 17, 3, 95.0, ..., 1],
    [2, 34, 5, 96.0, ..., 0],
    ...
]

This problem asks you to transform heterogeneous records into a fully numeric feature matrix suitable for binary logistic regression. The main steps are to align each training row with its label by id, preserve the id in the first column, convert numeric and boolean fields directly, encode string fields into numeric form, and turn date/datetime values into real numbers such as timestamps. The result should be a list of numeric vectors with the label appended as the final column.

END
 0