Mill: Content Fetching, Storage, and Publishing System

Overview

Mill is a single-user system for collecting, persisting, and publishing technical content from the web. It is designed to ingest content from feeds, aggregators, and direct submissions; store it in a durable, inspectable format; and publish derived outputs such as RSS feeds, static pages, and summaries.

Mill is intentionally local-first, Git-backed, and incrementally extensible. It favors simple, explicit data models and filesystem storage over early database or infrastructure commitments.

The system is implemented in Go and organized as a mono-repository containing multiple cooperating sub-projects.

Goals

Primary goals

Fetch content from multiple technical sources (feeds, aggregators, direct URLs)
Deduplicate content across sources using stable identity rules
Persist raw and derived data in a Git repository
Publish results as static artifacts (RSS, HTML, etc.)
Support incremental enrichment (metadata, summaries, text-to-speech)

Non-goals (initially)

Multi-user support
Real-time ingestion
Strong consistency guarantees across processes
Complex query APIs
Premature database or search infrastructure

Core Design Principles

1. Canonical identity over source identity

Mill distinguishes between:

The thing itself (the target content)
Where it was discussed (HN, Lobsters, etc.)
Where it was discovered (feed, tag page, daily index)

This separation enables deduplication across sources while preserving provenance.

2. Filesystem as the primary datastore

All persistent state lives on disk in a structured directory layout:

Easy to inspect
Easy to version with Git
Easy to migrate later if needed

Databases are considered an implementation detail that may be introduced later, not a foundational dependency.

3. Append-and-merge, not replace

Fetchers do not “own” items. They emit observations that are merged into existing records. The system is designed to:

Re-fetch safely
Be idempotent
Accumulate context over time

4. Separation of discovery and persistence

Fetch logic discovers what exists. Content logic decides what it means and how it is stored.

Mono-Repository Structure

Mill is a mono-repo containing several long-lived packages and sub-projects.

mill/
├── cmd/
│   └── mill/            # Main CLI / runner
├── internal/
│   ├── fetcher/         # Source-specific fetching logic
│   │   └── sources/     # Individual source implementations
│   ├── content/         # Canonical data model + persistence
│   ├── rss/             # RSS generation
│   ├── site/            # Static page generation
│   └── scheduler/       # Periodic execution
├── content/             # Persisted items (Git-backed)
└── public/              # Published artifacts (RSS, HTML)

Core Concepts

Item

An Item represents a single, canonical piece of content (e.g. an article or page).

Identified by a stable hash of its target URL
Persisted on disk under a deterministic path
Accumulates metadata over time

Items do not belong to a single source.

Observation

An Observation represents source-specific context about an Item:

A discussion thread
A score or comment count
A title as presented by a source

Observations are keyed by discussion/permalink URLs and are updated in place as sources are re-fetched.

Source provenance

Source provenance records where an Item was discovered:

Feed URL
Tag page
Daily index
Submission endpoint

This allows Mill to answer questions like:

“Which feeds surfaced this item?”
“Why did this item enter the system?”

Sub-Projects

Fetcher

Responsible for discovering content from external systems.

Each source implementation knows how to:
- Fetch its data
- Interpret links and metadata
- Emit observations
Fetchers do not persist data
Fetchers do not assign identities

Fetcher interface (conceptual):

Fetch(context) → Observations

Content

Responsible for canonicalization, identity, and persistence.

Owns:
- Identity rules
- Merge semantics
- On-disk layout
Provides idempotent upsert operations
Is the only package allowed to write to content/

Content is intentionally conservative and stable, as it defines the long-term shape of stored data.

RSS

Responsible for publishing minimal RSS feeds from stored Items.

Reads from persisted content
Does not participate in fetching or merging
Produces stable feed output suitable for readers and podcast clients

RSS entries are intentionally small and link to full static pages.

Static Site

Responsible for generating per-item pages.

One page per Item
Displays canonical content, metadata, and provenance
Links back to source discussions where applicable

Static pages are treated as derived artifacts, not canonical data.

Scheduler

Responsible for orchestration.

Periodic fetches (e.g. twice daily)
Triggered runs for direct submissions
Keeps execution policy separate from business logic

Direct Submission (Sister Project)

A closely related sub-project allows direct URL submission.

Simple authenticated endpoint
Immediate ingestion
Shares internal packages with Mill
Treated as just another source of observations

This keeps ingestion paths uniform, regardless of origin.

Long-Term Extensions

Mill is designed to support, without redesign:

AI-generated summaries (content, social, metadata)
Text-to-speech artifacts
Alternate publication formats
Database-backed indexing
Multiple fetch schedules per source

These are layered on top of the existing content model rather than replacing it.

Summary

Mill is a deliberately constrained system:

One user
One repo
One canonical data model

By focusing early on identity, provenance, and merge semantics, Mill aims to make later features easy to add without rewriting the core.

The design favors clarity and durability over cleverness, and treats persistence formats as long-term contracts rather than implementation details.