Robinhood Interview #1 – Job Scheduler / Cron Service Design (System Design, Reliability, SLA, Observability)

Problem Statement: Build a service/system that can support defining and running jobs on a schedule.

Requirements:

Should be able to create jobs
Should be able to schedule and run jobs
Should be able to report failures and successes
Should be reliable and have strong guarantees about its job runs
Should be able to view logs and status of running jobs, as well as previously finished jobs
Should be able to handle when a job takes longer to run than expected (SLA)

Summary (what to design & key choices)

Goal: A horizontally scalable, highly reliable job orchestration service (cron-as-a-service).

Core components

API + Auth: Create/update/disable jobs, trigger-now, fetch runs/logs.
Metadata DB (SQL): Jobs, schedules (cron/interval), next_run_at, concurrency policy, retries, SLA, owner.
Scheduler: Time-wheel/priority queue that computes due jobs; writes enqueue records idempotently with a dedupe token.
Queue (Kafka/SQS/PubSub): Durable, ordered per job; supports DLQ for poison messages.
Workers: Pull tasks, enforce concurrency (per-job / per-owner), execute (HTTP, script, container, workflow), emit heartbeats.
Result Store (SQL/TSDB): JobRun state machine (QUEUED→RUNNING→SUCCESS/FAILED/TIMED_OUT), timings, retry_count.
Logs: Stream stdout/stderr to object storage + index to log service (ELK/Cloud Logging).
SLA/Timeouts: Per-run watchdog; if heartbeat missing or runtime > SLA → mark TIMED_OUT, trigger alerts, optional kill.
Retries/Backoff: Configurable attempts with exponential backoff + jitter; idempotent handlers required.
Exactly-once semantics: Use lease + heartbeat; workers renew lease. On expiry, another worker may re-claim (at-least-once), but executor side must be idempotent (job-specific dedupe key), giving effectively-once.
High availability: Scheduler active/passive with leader election (ZK/etcd), workers stateless & autoscaled, multi-AZ queues.
Observability: Metrics (scheduled/queued/latency, success rate, retry rate), traces per run, dashboards + alerts (SLA breach, backlog growth).
Backfill & pause windows: Support manual backfill; maintenance windows to suppress firing.
Security: Per-tenant isolation, secrets via vault; RBAC on jobs.

Key APIs

POST /jobs (name, schedule, target, payload, retries, timeout, concurrency, owner)
PATCH /jobs/{id} enable/disable/update
POST /jobs/{id}:runNow
GET /jobs/{id} / /runs?jobId=&status=&timeRange=
GET /runs/{runId}/logs

Data model (simplified)

jobs(id, name, schedule, next_run_at, status, concurrency_limit, retry_policy, timeout_sla, owner, updated_at)
job_runs(id, job_id, scheduled_at, started_at, finished_at, status, attempt, worker_id, sla_violation, metrics_json)
job_events(id, run_id, ts, type, message) (append-only for audit)

Scheduling strategy

Compute next_run_at on write; scheduler scans the next N minutes using an index on next_run_at; enqueue with dedupe (job_id, scheduled_at). After enqueue, set the next occurrence in a transaction to avoid double scheduling.

Failure & SLA handling

Worker sends heartbeats; if missed, coordinator marks LOST and requeues.
SLA watchdog compares now - started_at > timeout_sla → mark TIMED_OUT, emit alert, optional kill hook.

Scaling

Partition by job_id hash to ensure per-job ordering; use multiple scheduler shards if needed.
Worker pools per target type (HTTP vs batch). Autoscale on queue depth & wait latency.

The VOprep team has long accompanied candidates through various major company OAs and VOs, including Robinhood, Google, Amazon, Citadel, SIG, providing real-time voice assistance, remote practice, and interview pacing reminders to help you stay smooth during critical moments. If you are preparing for Tiktok or similar engineering-focused companies, you can check out our customized support plans—from coding interviews to system design, we offer full guidance to help you succeed.

Post Views: 21

Robinhood Interview #1 – Job Scheduler / Cron Service Design (System Design, Reliability, SLA, Observability)

Summary (what to design & key choices)

Contact me

Friendly reminder