Developer Tools / Ops
Upwatch
An API uptime monitoring service that checks your endpoints every 30 seconds, alerts your team on failures, and tracks response time trends over months.
Overview
Upwatch monitors HTTP endpoints and alerts teams when something goes down. It checks URLs every 30 seconds from multiple regions, tracks response times, and sends alerts through Slack, email, or webhooks. The status page is public-facing, so users can check service health without contacting support.
The Problem
Downtime costs money and trust. Most uptime monitors are either too basic (just ping a URL) or too expensive for small teams ($30+/month for 50 monitors). Developers running side projects or small SaaS products need monitoring that's cheap to operate, easy to configure, and reliable enough to wake them up at 3am when something breaks.
Approach
Distributed health checks
Monitors run from 3 regions (US, EU, Asia). A check only triggers an alert if it fails from at least 2 regions, eliminating false positives from regional network issues. Each check records status code, response time, TLS certificate expiry, and response body hash (to detect content changes).
Alerting with escalation policies
Alerts route through configurable channels: Slack, email, or webhook. Escalation policies define who gets notified and when. If the primary on-call doesn't acknowledge within 5 minutes, the alert escalates to the next person. Alerts auto-resolve when the endpoint recovers, with a recovery notification.
Public status pages
Each project gets a hosted status page showing uptime percentage, response time graphs, and incident history. Status pages are customizable with the project's branding and domain. They update in real time when incidents are created or resolved.
Response time analytics
Historical response time data is stored in ClickHouse and visualized as percentile charts (p50, p95, p99) over configurable time ranges. Teams can spot gradual performance degradation before it becomes an outage. Anomaly detection flags sudden response time spikes even if the endpoint is still returning 200.
Challenges
Minimizing false positive alerts
Network hiccups trigger momentary failures that aren't real incidents. The multi-region check reduces false positives, but timing matters too. Built a confirmation window: after an initial failure, Upwatch retries 3 times over 90 seconds before declaring an incident. This catches transient issues without delaying real alerts.
Keeping check intervals consistent under load
With thousands of monitors, scheduling checks every 30 seconds requires careful timing. Used a distributed scheduler backed by Redis that assigns check slots across worker processes. Each worker handles a subset of monitors, and rebalancing happens automatically when workers join or leave.
Results
Upwatch monitors 3,000+ endpoints for 200+ users, with a median alert latency of 45 seconds from failure to notification.
3,000+
Endpoints monitored
45s
Median alert latency
99.99%
Monitoring uptime (self)