Capital One OA Interview Question: Assigned Driver Data Collection

20 Views
No Comments

You are given access to the data containing information about taxi drivers and their rides, created by April 15th, 2023. When calculating any time features, consider April 15th, 2023 as today. The data is distributed across 6 different files:

drivers.csv

  • driver_id (type: int) β€” unique driver identifier
  • car_id (type: int)
  • age (type: int)
  • started_driving_year (type: int)
  • second_language (type: str). If a driver doesn’t have a second language, the value is "no"
  • rating (type: float)
  • net_worth_of_tips (type: float)
  • driver_class (type: str). One of the following: ["A class", "B class"]

rides_{i}.csv, split into 4 files:

  • ride_id (type: int)
  • driver_id (type: int)
  • passenger_id (type: int)
  • date (type: str)
  • status (type: str). One of the following: ["Rejected by the driver", "Cancelled by the passenger", "Success"]
  • car_clearness_upvote_given (type: bool)
  • politeness_upvote_given (type: bool)
  • communication_upvote_given (type: bool)
  • punctuality_upvote_given (type: bool)
  • complaint_given (type: bool)

cars.csv

  • car_id (type: int)
  • model (type: str)
  • manufacture_year (type: int)
  • last_inspection_date (type: str)

Your task is to retrieve the needed information from the data about each driver and store it in the collected.csv file.

Your goal is to obtain a table with the following columns. You may order rows and columns in any way you find comfortable to work with; tests are designed to be order-agnostic:

  • driver_id (type: int) β€” unique driver identifier
  • car_model (type: str) β€” driver’s car model
  • car_manufacture_year (type: int) β€” driver’s car’s year of manufacture
  • days_since_inspection (type: int) β€” number of days passed since the last inspection of the driver’s car. It is guaranteed that each car had an inspection
  • age (type: int) β€” driver’s age
  • experience (type: int) β€” driver’s number of years of driving experience, calculated as 2023 - started_driving_year
  • second_language (type: str) β€” driver’s second language
  • rating (type: float) β€” driver’s average rating
  • net_worth_of_tips (type: float) β€” driver’s net worth of tips received
  • number_of_upvotes (type: int) β€” total number of various upvotes that the driver has received
  • driver_class (type: str) β€” class assigned to the driver

This problem asks you to build a per-driver summary table by joining driver, car, and ride data spread across multiple CSV files. The key steps are to merge drivers with cars via car_id, combine all ride files, compute time-based features using April 15, 2023 as the reference date, and aggregate ride-level booleans into counts such as total upvotes. A clean pandas workflow with joins, date parsing, and groupby aggregation is the most natural approach.

END
 0