Mill: Content Fetching, Storage, and Publishing System
Overview
Mill is a single-user system for collecting, persisting, and publishing technical content from the web. It is designed to ingest content from feeds, aggregators, and direct submissions; store it in a durable, inspectable format; and publish derived outputs such as RSS feeds, static pages, and summaries.
Mill is intentionally local-first, Git-backed, and incrementally extensible. It favors simple, explicit data models and filesystem storage over early database or infrastructure commitments.
The system is implemented in Go and organized as a mono-repository containing multiple cooperating sub-projects.
Goals
Primary goals
- Fetch content from multiple technical sources (feeds, aggregators, direct URLs)
- Deduplicate content across sources using stable identity rules
- Persist raw and derived data in a Git repository
- Publish results as static artifacts (RSS, HTML, etc.)
- Support incremental enrichment (metadata, summaries, text-to-speech)
Non-goals (initially)
- Multi-user support
- Real-time ingestion
- Strong consistency guarantees across processes
- Complex query APIs
- Premature database or search infrastructure
Core Design Principles
1. Canonical identity over source identity
Mill distinguishes between:
- The thing itself (the target content)
- Where it was discussed (HN, Lobsters, etc.)
- Where it was discovered (feed, tag page, daily index)
This separation enables deduplication across sources while preserving provenance.
2. Filesystem as the primary datastore
All persistent state lives on disk in a structured directory layout:
- Easy to inspect
- Easy to version with Git
- Easy to migrate later if needed
Databases are considered an implementation detail that may be introduced later, not a foundational dependency.
3. Append-and-merge, not replace
Fetchers do not “own” items. They emit observations that are merged into existing records. The system is designed to:
- Re-fetch safely
- Be idempotent
- Accumulate context over time
4. Separation of discovery and persistence
Fetch logic discovers what exists. Content logic decides what it means and how it is stored.
Mono-Repository Structure
Mill is a mono-repo containing several long-lived packages and sub-projects.
mill/
├── cmd/
│ └── mill/ # Main CLI / runner
├── internal/
│ ├── fetcher/ # Source-specific fetching logic
│ │ └── sources/ # Individual source implementations
│ ├── content/ # Canonical data model + persistence
│ ├── rss/ # RSS generation
│ ├── site/ # Static page generation
│ └── scheduler/ # Periodic execution
├── content/ # Persisted items (Git-backed)
└── public/ # Published artifacts (RSS, HTML)
Core Concepts
Item
An Item represents a single, canonical piece of content (e.g. an article or page).
- Identified by a stable hash of its target URL
- Persisted on disk under a deterministic path
- Accumulates metadata over time
Items do not belong to a single source.
Observation
An Observation represents source-specific context about an Item:
- A discussion thread
- A score or comment count
- A title as presented by a source
Observations are keyed by discussion/permalink URLs and are updated in place as sources are re-fetched.
Source provenance
Source provenance records where an Item was discovered:
- Feed URL
- Tag page
- Daily index
- Submission endpoint
This allows Mill to answer questions like:
- “Which feeds surfaced this item?”
- “Why did this item enter the system?”
Sub-Projects
Fetcher
Responsible for discovering content from external systems.
-
Each source implementation knows how to:
- Fetch its data
- Interpret links and metadata
- Emit observations
-
Fetchers do not persist data
-
Fetchers do not assign identities
Fetcher interface (conceptual):
Fetch(context) → Observations
Content
Responsible for canonicalization, identity, and persistence.
-
Owns:
- Identity rules
- Merge semantics
- On-disk layout
-
Provides idempotent upsert operations
-
Is the only package allowed to write to
content/
Content is intentionally conservative and stable, as it defines the long-term shape of stored data.
RSS
Responsible for publishing minimal RSS feeds from stored Items.
- Reads from persisted content
- Does not participate in fetching or merging
- Produces stable feed output suitable for readers and podcast clients
RSS entries are intentionally small and link to full static pages.
Static Site
Responsible for generating per-item pages.
- One page per Item
- Displays canonical content, metadata, and provenance
- Links back to source discussions where applicable
Static pages are treated as derived artifacts, not canonical data.
Scheduler
Responsible for orchestration.
- Periodic fetches (e.g. twice daily)
- Triggered runs for direct submissions
- Keeps execution policy separate from business logic
Direct Submission (Sister Project)
A closely related sub-project allows direct URL submission.
- Simple authenticated endpoint
- Immediate ingestion
- Shares internal packages with Mill
- Treated as just another source of observations
This keeps ingestion paths uniform, regardless of origin.
Long-Term Extensions
Mill is designed to support, without redesign:
- AI-generated summaries (content, social, metadata)
- Text-to-speech artifacts
- Alternate publication formats
- Database-backed indexing
- Multiple fetch schedules per source
These are layered on top of the existing content model rather than replacing it.
Summary
Mill is a deliberately constrained system:
- One user
- One repo
- One canonical data model
By focusing early on identity, provenance, and merge semantics, Mill aims to make later features easy to add without rewriting the core.
The design favors clarity and durability over cleverness, and treats persistence formats as long-term contracts rather than implementation details.