Capital One OA 面试真题解析:司机数据汇总分析(Data Collection)

18次阅读
没有评论

You are given access to the data containing information about taxi drivers and their rides, created by April 15th, 2023. When calculating any time features, consider April 15th, 2023 as today. The data is distributed across 6 different files:

drivers.csv

  • driver_id (type: int) — unique driver identifier
  • car_id (type: int)
  • age (type: int)
  • started_driving_year (type: int)
  • second_language (type: str). If a driver doesn’t have a second language, the value is "no"
  • rating (type: float)
  • net_worth_of_tips (type: float)
  • driver_class (type: str). One of the following: ["A class", "B class"]

rides_{i}.csv, split into 4 files:

  • ride_id (type: int)
  • driver_id (type: int)
  • passenger_id (type: int)
  • date (type: str)
  • status (type: str). One of the following: ["Rejected by the driver", "Cancelled by the passenger", "Success"]
  • car_clearness_upvote_given (type: bool)
  • politeness_upvote_given (type: bool)
  • communication_upvote_given (type: bool)
  • punctuality_upvote_given (type: bool)
  • complaint_given (type: bool)

cars.csv

  • car_id (type: int)
  • model (type: str)
  • manufacture_year (type: int)
  • last_inspection_date (type: str)

Your task is to retrieve the needed information from the data about each driver and store it in the collected.csv file.

Your goal is to obtain a table with the following columns. You may order rows and columns in any way you find comfortable to work with; tests are designed to be order-agnostic:

  • driver_id (type: int) — unique driver identifier
  • car_model (type: str) — driver’s car model
  • car_manufacture_year (type: int) — driver’s car’s year of manufacture
  • days_since_inspection (type: int) — number of days passed since the last inspection of the driver’s car. It is guaranteed that each car had an inspection
  • age (type: int) — driver’s age
  • experience (type: int) — driver’s number of years of driving experience, calculated as 2023 - started_driving_year
  • second_language (type: str) — driver’s second language
  • rating (type: float) — driver’s average rating
  • net_worth_of_tips (type: float) — driver’s net worth of tips received
  • number_of_upvotes (type: int) — total number of various upvotes that the driver has received
  • driver_class (type: str) — class assigned to the driver

这道题是一个典型的多表数据整理题,核心是把 drivers.csv、cars.csv 以及按文件拆分的 rides_{i}.csv 关联起来,最终为每位司机生成一张汇总表。需要注意的是,时间相关字段要以 2023-04-15 为基准来计算,例如车辆距上次检查的天数,以及驾驶年限 experience = 2023 – started_driving_year。统计类字段则需要对多次 ride 记录做聚合,例如把四个 rides 文件合并后统计各类 upvote 的总数。解题时通常会先用 driver_id 和 car_id 做 join,再用 pandas 对日期转换、布尔值求和、字符串字段保留原值,最后按要求输出 collected.csv。

正文完
 0