We’ve provided a labeled training set for a fraudulent booking model.
Each record is a historical Airbnb booking and its corresponding characteristics (or features). These features are:
price: dollar amount paid for the reservationnights: number of nightsmarket_avg_price_per_night: average price per night paid in the marketpast_delta_checkin: days between past reservations by the guestlisting_market: market of the listinghost_past_nights: number of prior nights hosted by the hostds: date-stamp of the reservationlabel: whether the reservation is fake (1) or not (0)pred_score: risk score from a model trained on this data set
Question:
Explore the data set and propose a good imputation strategy for missing values.
This Airbnb data-handling question asks candidates to inspect the missing-data pattern and design an imputation strategy that respects feature type, grouping, and leakage risk. A strong answer typically separates numeric and categorical fields, uses group-aware statistics such as market-level or time-aware medians where appropriate, adds missingness indicators, and treats absence as potentially informative for fraud detection. The key is to preserve signal while avoiding unrealistic global fill values and future-data leakage.