Developer Tools / Ops

Upwatch

An API uptime monitoring service that checks your endpoints every 30 seconds, alerts your team on failures, and tracks response time trends over months.

Solo DeveloperOpen Source2023
🟢

Overview

Upwatch monitors HTTP endpoints and alerts teams when something goes down. It checks URLs every 30 seconds from multiple regions, tracks response times, and sends alerts through Slack, email, or webhooks. The status page is public-facing, so users can check service health without contacting support.

The Problem

Downtime costs money and trust. Most uptime monitors are either too basic (just ping a URL) or too expensive for small teams ($30+/month for 50 monitors). Developers running side projects or small SaaS products need monitoring that's cheap to operate, easy to configure, and reliable enough to wake them up at 3am when something breaks.

Approach

Distributed health checks

Monitors run from 3 regions (US, EU, Asia). A check only triggers an alert if it fails from at least 2 regions, eliminating false positives from regional network issues. Each check records status code, response time, TLS certificate expiry, and response body hash (to detect content changes).

Alerting with escalation policies

Alerts route through configurable channels: Slack, email, or webhook. Escalation policies define who gets notified and when. If the primary on-call doesn't acknowledge within 5 minutes, the alert escalates to the next person. Alerts auto-resolve when the endpoint recovers, with a recovery notification.

Public status pages

Each project gets a hosted status page showing uptime percentage, response time graphs, and incident history. Status pages are customizable with the project's branding and domain. They update in real time when incidents are created or resolved.

Response time analytics

Historical response time data is stored in ClickHouse and visualized as percentile charts (p50, p95, p99) over configurable time ranges. Teams can spot gradual performance degradation before it becomes an outage. Anomaly detection flags sudden response time spikes even if the endpoint is still returning 200.

Challenges

Minimizing false positive alerts

Network hiccups trigger momentary failures that aren't real incidents. The multi-region check reduces false positives, but timing matters too. Built a confirmation window: after an initial failure, Upwatch retries 3 times over 90 seconds before declaring an incident. This catches transient issues without delaying real alerts.

Keeping check intervals consistent under load

With thousands of monitors, scheduling checks every 30 seconds requires careful timing. Used a distributed scheduler backed by Redis that assigns check slots across worker processes. Each worker handles a subset of monitors, and rebalancing happens automatically when workers join or leave.

Results

Upwatch monitors 3,000+ endpoints for 200+ users, with a median alert latency of 45 seconds from failure to notification.

3,000+

Endpoints monitored

45s

Median alert latency

99.99%

Monitoring uptime (self)

Tech Stack

Next.jsDashboard, status pages, and API routes for monitor configuration
TypeScriptType-safe health check logic and alert routing
PostgreSQLMonitor configs, incident history, and user accounts
ClickHouseResponse time series storage and percentile analytics
RedisDistributed check scheduling and real-time alert state
VitestTests for alerting logic, escalation policies, and false positive filtering