Problem Statement: Build a service/system that can support defining and running jobs on a schedule.
Requirements:
- Should be able to create jobs
- Should be able to schedule and run jobs
- Should be able to report failures and successes
- Should be reliable and have strong guarantees about its job runs
- Should be able to view logs and status of running jobs, as well as previously finished jobs
- Should be able to handle when a job takes longer to run than expected (SLA)
Summary (what to design & key choices)
Goal: A horizontally scalable, highly reliable job orchestration service (cron-as-a-service).
Core components
- API + Auth: Create/update/disable jobs, trigger-now, fetch runs/logs.
- Metadata DB (SQL): Jobs, schedules (cron/interval), next_run_at, concurrency policy, retries, SLA, owner.
- Scheduler: Time-wheel/priority queue that computes due jobs; writes enqueue records idempotently with a dedupe token.
- Queue (Kafka/SQS/PubSub): Durable, ordered per job; supports DLQ for poison messages.
- Workers: Pull tasks, enforce concurrency (per-job / per-owner), execute (HTTP, script, container, workflow), emit heartbeats.
- Result Store (SQL/TSDB): JobRun state machine (QUEUED→RUNNING→SUCCESS/FAILED/TIMED_OUT), timings, retry_count.
- Logs: Stream stdout/stderr to object storage + index to log service (ELK/Cloud Logging).
- SLA/Timeouts: Per-run watchdog; if heartbeat missing or runtime > SLA → mark TIMED_OUT, trigger alerts, optional kill.
- Retries/Backoff: Configurable attempts with exponential backoff + jitter; idempotent handlers required.
- Exactly-once semantics: Use lease + heartbeat; workers renew lease. On expiry, another worker may re-claim (at-least-once), but executor side must be idempotent (job-specific dedupe key), giving effectively-once.
- High availability: Scheduler active/passive with leader election (ZK/etcd), workers stateless & autoscaled, multi-AZ queues.
- Observability: Metrics (scheduled/queued/latency, success rate, retry rate), traces per run, dashboards + alerts (SLA breach, backlog growth).
- Backfill & pause windows: Support manual backfill; maintenance windows to suppress firing.
- Security: Per-tenant isolation, secrets via vault; RBAC on jobs.
Key APIs
POST /jobs(name, schedule, target, payload, retries, timeout, concurrency, owner)PATCH /jobs/{id}enable/disable/updatePOST /jobs/{id}:runNowGET /jobs/{id}//runs?jobId=&status=&timeRange=GET /runs/{runId}/logs
Data model (simplified)
jobs(id, name, schedule, next_run_at, status, concurrency_limit, retry_policy, timeout_sla, owner, updated_at)job_runs(id, job_id, scheduled_at, started_at, finished_at, status, attempt, worker_id, sla_violation, metrics_json)job_events(id, run_id, ts, type, message)(append-only for audit)
Scheduling strategy
- Compute
next_run_aton write; scheduler scans the next N minutes using an index on next_run_at; enqueue with dedupe(job_id, scheduled_at). After enqueue, set the next occurrence in a transaction to avoid double scheduling.
Failure & SLA handling
- Worker sends heartbeats; if missed, coordinator marks LOST and requeues.
- SLA watchdog compares
now - started_at > timeout_sla→ mark TIMED_OUT, emit alert, optional kill hook.
Scaling
- Partition by
job_id hashto ensure per-job ordering; use multiple scheduler shards if needed. - Worker pools per target type (HTTP vs batch). Autoscale on queue depth & wait latency.
The VOprep team has long accompanied candidates through various major company OAs and VOs, including Robinhood, Google, Amazon, Citadel, SIG, providing real-time voice assistance, remote practice, and interview pacing reminders to help you stay smooth during critical moments. If you are preparing for Tiktok or similar engineering-focused companies, you can check out our customized support plans—from coding interviews to system design, we offer full guidance to help you succeed.