You are a new engineer at a Databricks competitor startup named DataTricks, which is implementing its own Lakehouse platform. They have assigned you to the Admin Experience team, where you are building out setup and administration flows for the product.
Your PM and TL tell you that DataTricks wants to set up their Lakehouse platform as follows:
- A
Customerof DataTricks signs up for the product. - Each
Customercan have multipleUsers, as many different people at the company will want to use DataTricks (everyone from analysts to data engineers to ML practitioners). However, sinceUsersonly work for one company, eachUseronly belongs to a singleCustomer. - Each
Customercan have 0 or moreRepos, where aRepocontains all of your data science notebooks, BI dashboards, and ML assets. Further, aCustomercan assign 0 or moreUsersto any givenRepo, and eachUsercan belong to 0 or moreRepos. - A
Repois required to be associated with exactly 1ComputeEnvironment, which is a collection of cloud resources (a network for running Apache Spark, cloud storage for storing data, etc.) to use for data processing. OneComputeEnvironmentcan be shared by many differentRepos, but only from a single customer.
Part 1: Design the database schema
Please write up a database schema that can properly model the relationships above. Please write your thoughts below.
Part 2: ComputeEnvironment API
Now that you’ve designed the database, you are ready to split up the work amongst yourself and your team. You are in charge of the APIs to manage (e.g. Create / Read / Delete) ComputeEnvironments, and you leave user, repo, and customer management to your teammate. Let’s go ahead and design the APIs for ComputeEnvironments below.
Part 3: Dealing with slow clouds
You’ve started to implement your API, and you realize that CreateComputeEnv takes more than 10 minutes. Turns out creating cloud infrastructure is not very fast. You are frustrated, as when you make an API request, it takes forever to return, to the point that you wonder if it’s even really working. How do you update your system?
This is a classic multi-tenant SaaS data modeling problem. The key is to map Customers, Users, Repos, and ComputeEnvironments into relational tables with the correct one-to-many and many-to-many relationships. A solid design usually includes separate tables for each entity, a join table for the User-Repo relationship, and foreign keys that enforce each User belonging to exactly one Customer and each Repo belonging to exactly one ComputeEnvironment, with ComputeEnvironments scoped to a single Customer. The later parts extend the problem into API design and system design, especially handling long-running cloud provisioning with an asynchronous workflow and status tracking.