Observability

How I Approached Monitoring Lodestone

Architecture deep-dive into Lodestone's metrics collection — from HTTP middleware to Grafana dashboards

Table of Contents

Quick Architecture Overview

Lodestone runs as a single Go binary that is orchestrated by Nomad on a single, large Hetzner bare-metal node. There is a SQLite database that is backed up via Litestream to Cloudflare R2. It's a relatively simple architecture. One that I arrived at because I wanted something that was simple, cheap, and easy to maintain. Lodestone currently doesn't have a lot of users and there is only one SRE (myself) that maintains it. The infrastructure cost is around $50 USD a month.

On the same box that Lodestone is running, Nomad also controls the lifecycle of: Prometheus, Traefik, Loki, Gitea, Promtail, Grafana, and Bugsink.

Traefik is the reverse proxy for the setup; it is the entrypoint for the box and will route traffic to various applications based off of their subdomains. For example, there is a grafana.lodestonehikes.com and it naturally routes to my Grafana dashboards. Security is multi-layered, but access to internal sites is dependent on Tailscale networking. In other words, only my laptop with Tailscale installed has a happy network path to internal sites.

Okay, with that layout established we can talk about why monitoring in general is important.

Why Metrics are the Best

Visibility in application performance and business related data is crucial for maintaining a stable platform and being able to apply reasoning to user growth.

With metrics I get to answer questions like:

How many users do I currently have?
Is my application slow? If so, for how many users? Which parts are slow? Is it a specific URL? A database transaction?
What is the success rate for AI inference of hiking gear photos?
How much battery does my application draw when doing offline navigation?
Are my users completing the onboarding funnel? Or are they dismissing it?
What features are my users engaging with? Are they using my application in a way that I didn't expect?

And so, so many more.

In my mind, a decision made without data to back it up is essentially gambling.

The Metric Taxonomy

Lodestone currently registers over a hundred metrics that are organized into five layers:

HTTP Request Metrics

The foundation. Three metrics cover the golden signals:

http_requests_total — counter by method, path, status (rate)
http_request_duration_seconds — histogram by method, path (latency)
http_requests_in_flight — gauge (saturation)

Business / Product Metrics

Counters for every meaningful user action:

Resources: trips, gear items, routes, loadouts, provision kits — created and deleted
Social: friend requests (sent/accepted/rejected), group invitations, trip invitations, messages, comments
Engagement: navigation sessions, route discovery searches, GPX imports/exports, AI trip analyses, gear recognition attempts
Lifecycle: registrations by method, account deletions, password resets, profile updates, onboarding completion

This is less for performance management and reliability and more to provide business intelligence.

External Service SLIs

Every third-party API call is instrumented with three metrics:

externalServiceCalls    // counter by service × status
externalServiceErrors   // counter by service
externalServiceDuration // histogram by service

Services tracked:

Mapbox (geocoding, directions)
Weather API
Gemini AI (trip analysis, gear recognition)
Email (transactional).

When some part of my application starts to misbehave, I want to know if it's an upstream provider that's failing and not my own code.

Payment Metrics

Lodestone has two payment providers: Stripe and Apple. There will eventually be a third when I roll out Android support but for now I have two. Payment metrics are divided between reliability (they are also external services) and business metrics.

Stripe:

Checkout sessions
Portal sessions
Customer creation
Webhook events by type
Subscription activations/deactivations by reason
Payment failures
API call duration
Subscription lifetime in days

Apple StoreKit:

App Store Server Notifications by event type
Subscription activations/deactivations
JWS signature failures
API duration
Subscription lifetime.

The subscription lifetime histograms (lodestone_stripe_subscription_lifetime_days, lodestone_apple_subscription_lifetime_days) use buckets at 1, 7, 14, 30, 60, 90, 180, 365, and 730 days. These tell you where churn concentrates — the difference between "users leave after the free trial" and "users leave after three months" demands entirely different responses.

Infrastructure

Monitoring the infrastructure that Lodestone runs on is as critical as monitoring Lodestone itself.

So I also try to track:

Litestream operations
Gitea, Loki, and underlying host metrics are also collected

Collecting and Displaying Metrics

Now that we've established what gets measured, let's talk about how it all flows together.

The Pipeline

The collection pipeline is straightforward:

Lodestone exposes a /metrics endpoint using the standard Prometheus Go client. Every metric I described above is registered at startup via promauto and updated in real-time as requests flow through the system.
Prometheus scrapes that endpoint every 15 seconds. It uses Consul service discovery to find the Lodestone service — no hardcoded IPs. When Nomad reschedules the container and it gets a new port, Prometheus just follows along.
Grafana queries Prometheus and renders everything into dashboards.

For logs, the pipeline is similar:

Lodestone writes structured JSON logs to stdout. This is the standard approach for containerized applications — don't write to files, let the orchestrator handle it.
Promtail discovers containers via the Docker socket, reads their log output, and labels each line with the Nomad job name, task group, and allocation ID.
Loki ingests and indexes those logs with a 30-day retention window.
Grafana queries Loki alongside Prometheus, so I can correlate a latency spike with the log lines that caused it.

The whole thing is self-contained on the same box and deployed via a combination of Nomad job definitions and an Ansible playbook.

The HTTP Middleware

The HTTP metrics are collected through a Chi middleware that wraps every request:

func PrometheusMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        httpRequestsInFlight.Inc()
        defer httpRequestsInFlight.Dec()

        wrapped := newResponseWriter(w)
        next.ServeHTTP(wrapped, r)

        routePattern := getRoutePattern(r)
        duration := time.Since(start).Seconds()
        status := strconv.Itoa(wrapped.statusCode)

        httpRequestsTotal.WithLabelValues(r.Method, routePattern, status).Inc()
        httpRequestDuration.WithLabelValues(r.Method, routePattern).Observe(duration)
    })
}

One thing worth calling out: getRoutePattern(r) extracts the route pattern from Chi's context (e.g., /api/trips/{id}) rather than the actual path (/api/trips/42). This is important. If you label metrics with raw paths, you end up with unbounded cardinality — every unique trip ID becomes its own time series. That's a fast way to blow up Prometheus's memory.

Business Metrics

Business metrics are simpler. At the point where a meaningful action happens in a handler or service, I call a helper:

func (s *TripsService) CreateTrip(...) {
    // ... create the trip ...
    middleware.IncrementTripsCreated()
}

There's a helper function for every metric in middleware/metrics.go. It's not glamorous, but it's explicit and easy to grep for.

Background Monitors

Some metrics can't be derived from request flow — they need periodic calculation. These run as background goroutines:

Active users — every 5 minutes, query users with a login in the last 24 hours and set the lodestone_active_users gauge.
Total users — same cadence, simple count.
Pro subscribers — count of active paid subscriptions.
WAL size — every minute, stat the SQLite write-ahead log file and set the lodestone_sqlite_wal_size_bytes gauge. If the WAL exceeds 40MB, it logs a warning.

All of these are wrapped in utils.SafeGo() for panic recovery and respect context cancellation for clean shutdown.

Securing the Metrics Endpoint

The /metrics endpoint is sensitive. It reveals internal structure, naming conventions, and operational state. I protect it with two layers:

Application layer: An internalOnly() middleware rejects any request that arrives with an X-Forwarded-For header (which means it was proxied through Traefik, which means it came from the internet). It also allowlists localhost and Docker/Nomad bridge IPs.
Reverse proxy layer: Traefik has a high-priority router that catches any request to /metrics or /health on public hosts and rewrites the path to /__blocked, which returns a 404.

Defense in depth. If either layer fails, the other still blocks external access.

The Dashboards

I currently maintain seven Grafana dashboards, each focused on a different operational concern:

SRE Overview — the golden signals at a glance. Request rate, error rate, P50/P99 latency, in-flight requests, active users.
Database (SQLite) — query latency distributions, WAL file size over time, transaction throughput. SQLite is remarkably predictable, so when something moves on this dashboard, it's worth investigating.
External Dependencies — latency and error rates for Mapbox, WeatherKit, Gemini, and email. When my app is slow and it's not my code, this dashboard tells me who to blame.
Auth & Security — login attempts, auth failures by reason, OAuth provider breakdown (Apple vs. Google), token refreshes, rate limit hits. Useful for spotting brute force attempts or OAuth misconfigurations.
Stripe Subscriptions — checkout sessions, subscription lifecycle, payment failures, churn analysis via the lifetime histogram. This is where I watch the business.
Product Insights — feature adoption and engagement. Trips created, gear items added, GPX imports, navigation sessions, social interactions. This tells me what people are actually using.
Apple StoreKit — same as Stripe, but for iOS in-app purchases. Webhook events, subscription activations, JWS verification status.

Each dashboard auto-refreshes every 30 seconds and is provisioned from JSON files that live in version control. If I accidentally break a dashboard in the Grafana UI, I can just redeploy and it resets.

iOS Client Metrics

The server-side pipeline handles everything that flows through the API, but there's a whole category of metrics that only the iOS client can capture.

TelemetryDeck

For product analytics on the client side, I use TelemetryDeck. It's a privacy-focused analytics SDK — no PII collection, no user tracking in the creepy sense. I chose it because TelemetryDeck's philosophy aligns better with the kind of app I'm building.

The AnalyticsManager tracks over 40 distinct events: app lifecycle, authentication flows, feature usage across gear, loadouts, trips, groups, navigation, and discovery. It gives me a client-side view of engagement that complements the server-side counters.

Bugsink (Self-Hosted Sentry)

For crash reporting, I run Bugsink — a self-hosted Sentry-compatible service — on the same Hetzner box. The Sentry SDK in the iOS app captures crashes, attaches session data, and sends it to bugsink.lodestonehikes.com. Auto session tracking is enabled, but I've intentionally set tracesSampleRate to zero. I don't need distributed tracing on the client right now, and it's not free in terms of payload size.

Every log call in the iOS app also creates a Sentry breadcrumb, so when a crash does happen, I get the trail of events leading up to it.

What's Next

I feel like I'm in a comfortable place right now and that I have a nice amount of observability without going overboard. There is still some work to do though. For example, I know that iOS / Apple provide a lot more metric collection utilies that I haven't quite tapped into yet. I'm not an iOS expert and am still learning a lot. My initial foray is to use server writes to capture telemetry data but there isn't much of a metrics engine internal to the app to capture metrics in a more holistic and comprehensive way.

But until then I hope that my existing metrics can protect and facilitate the growth of my lil' app.