Robinhood Interview #1 – Job Scheduler / Cron Service Design (System Design, Reliability, SLA, Observability)

21 Views
No Comments

Problem Statement: Build a service/system that can support defining and running jobs on a schedule.

Requirements:

  1. Should be able to create jobs
  2. Should be able to schedule and run jobs
  3. Should be able to report failures and successes
  4. Should be reliable and have strong guarantees about its job runs
  5. Should be able to view logs and status of running jobs, as well as previously finished jobs
  6. Should be able to handle when a job takes longer to run than expected (SLA)

Summary (what to design & key choices)

Goal: A horizontally scalable, highly reliable job orchestration service (cron-as-a-service).

Core components

  • API + Auth: Create/update/disable jobs, trigger-now, fetch runs/logs.
  • Metadata DB (SQL): Jobs, schedules (cron/interval), next_run_at, concurrency policy, retries, SLA, owner.
  • Scheduler: Time-wheel/priority queue that computes due jobs; writes enqueue records idempotently with a dedupe token.
  • Queue (Kafka/SQS/PubSub): Durable, ordered per job; supports DLQ for poison messages.
  • Workers: Pull tasks, enforce concurrency (per-job / per-owner), execute (HTTP, script, container, workflow), emit heartbeats.
  • Result Store (SQL/TSDB): JobRun state machine (QUEUED→RUNNING→SUCCESS/FAILED/TIMED_OUT), timings, retry_count.
  • Logs: Stream stdout/stderr to object storage + index to log service (ELK/Cloud Logging).
  • SLA/Timeouts: Per-run watchdog; if heartbeat missing or runtime > SLA → mark TIMED_OUT, trigger alerts, optional kill.
  • Retries/Backoff: Configurable attempts with exponential backoff + jitter; idempotent handlers required.
  • Exactly-once semantics: Use lease + heartbeat; workers renew lease. On expiry, another worker may re-claim (at-least-once), but executor side must be idempotent (job-specific dedupe key), giving effectively-once.
  • High availability: Scheduler active/passive with leader election (ZK/etcd), workers stateless & autoscaled, multi-AZ queues.
  • Observability: Metrics (scheduled/queued/latency, success rate, retry rate), traces per run, dashboards + alerts (SLA breach, backlog growth).
  • Backfill & pause windows: Support manual backfill; maintenance windows to suppress firing.
  • Security: Per-tenant isolation, secrets via vault; RBAC on jobs.

Key APIs

  • POST /jobs (name, schedule, target, payload, retries, timeout, concurrency, owner)
  • PATCH /jobs/{id} enable/disable/update
  • POST /jobs/{id}:runNow
  • GET /jobs/{id} / /runs?jobId=&status=&timeRange=
  • GET /runs/{runId}/logs

Data model (simplified)

  • jobs(id, name, schedule, next_run_at, status, concurrency_limit, retry_policy, timeout_sla, owner, updated_at)
  • job_runs(id, job_id, scheduled_at, started_at, finished_at, status, attempt, worker_id, sla_violation, metrics_json)
  • job_events(id, run_id, ts, type, message) (append-only for audit)

Scheduling strategy

  • Compute next_run_at on write; scheduler scans the next N minutes using an index on next_run_at; enqueue with dedupe (job_id, scheduled_at). After enqueue, set the next occurrence in a transaction to avoid double scheduling.

Failure & SLA handling

  • Worker sends heartbeats; if missed, coordinator marks LOST and requeues.
  • SLA watchdog compares now - started_at > timeout_sla → mark TIMED_OUT, emit alert, optional kill hook.

Scaling

  • Partition by job_id hash to ensure per-job ordering; use multiple scheduler shards if needed.
  • Worker pools per target type (HTTP vs batch). Autoscale on queue depth & wait latency.

The VOprep team has long accompanied candidates through various major company OAs and VOs, including Robinhood, Google, Amazon, Citadel, SIG, providing real-time voice assistance, remote practice, and interview pacing reminders to help you stay smooth during critical moments. If you are preparing for Tiktok or similar engineering-focused companies, you can check out our customized support plans—from coding interviews to system design, we offer full guidance to help you succeed.

END
 0