AppliedIntuition VO Interview Question: Database Schema Design for a Multi-Tenant Lakehouse Platform

17 Views
No Comments

You are a new engineer at a Databricks competitor startup named DataTricks, which is implementing its own Lakehouse platform. They have assigned you to the Admin Experience team, where you are building out setup and administration flows for the product.

Your PM and TL tell you that DataTricks wants to set up their Lakehouse platform as follows:

  • A Customer of DataTricks signs up for the product.
  • Each Customer can have multiple Users, as many different people at the company will want to use DataTricks (everyone from analysts to data engineers to ML practitioners). However, since Users only work for one company, each User only belongs to a single Customer.
  • Each Customer can have 0 or more Repos, where a Repo contains all of your data science notebooks, BI dashboards, and ML assets. Further, a Customer can assign 0 or more Users to any given Repo, and each User can belong to 0 or more Repos.
  • A Repo is required to be associated with exactly 1 ComputeEnvironment, which is a collection of cloud resources (a network for running Apache Spark, cloud storage for storing data, etc.) to use for data processing. One ComputeEnvironment can be shared by many different Repos, but only from a single customer.

Part 1: Design the database schema

Please write up a database schema that can properly model the relationships above. Please write your thoughts below.

Part 2: ComputeEnvironment API

Now that you’ve designed the database, you are ready to split up the work amongst yourself and your team. You are in charge of the APIs to manage (e.g. Create / Read / Delete) ComputeEnvironments, and you leave user, repo, and customer management to your teammate. Let’s go ahead and design the APIs for ComputeEnvironments below.

Part 3: Dealing with slow clouds

You’ve started to implement your API, and you realize that CreateComputeEnv takes more than 10 minutes. Turns out creating cloud infrastructure is not very fast. You are frustrated, as when you make an API request, it takes forever to return, to the point that you wonder if it’s even really working. How do you update your system?

This is a classic multi-tenant SaaS data modeling problem. The key is to map Customers, Users, Repos, and ComputeEnvironments into relational tables with the correct one-to-many and many-to-many relationships. A solid design usually includes separate tables for each entity, a join table for the User-Repo relationship, and foreign keys that enforce each User belonging to exactly one Customer and each Repo belonging to exactly one ComputeEnvironment, with ComputeEnvironments scoped to a single Customer. The later parts extend the problem into API design and system design, especially handling long-running cloud provisioning with an asynchronous workflow and status tracking.

END
 0