main
Raw Download raw file

Mill: Content Fetching, Storage, and Publishing System

Overview

Mill is a single-user system for collecting, persisting, and publishing technical content from the web. It is designed to ingest content from feeds, aggregators, and direct submissions; store it in a durable, inspectable format; and publish derived outputs such as RSS feeds, static pages, and summaries.

Mill is intentionally local-first, Git-backed, and incrementally extensible. It favors simple, explicit data models and filesystem storage over early database or infrastructure commitments.

The system is implemented in Go and organized as a mono-repository containing multiple cooperating sub-projects.


Goals

Primary goals

  • Fetch content from multiple technical sources (feeds, aggregators, direct URLs)
  • Deduplicate content across sources using stable identity rules
  • Persist raw and derived data in a Git repository
  • Publish results as static artifacts (RSS, HTML, etc.)
  • Support incremental enrichment (metadata, summaries, text-to-speech)

Non-goals (initially)

  • Multi-user support
  • Real-time ingestion
  • Strong consistency guarantees across processes
  • Complex query APIs
  • Premature database or search infrastructure

Core Design Principles

1. Canonical identity over source identity

Mill distinguishes between:

  • The thing itself (the target content)
  • Where it was discussed (HN, Lobsters, etc.)
  • Where it was discovered (feed, tag page, daily index)

This separation enables deduplication across sources while preserving provenance.

2. Filesystem as the primary datastore

All persistent state lives on disk in a structured directory layout:

  • Easy to inspect
  • Easy to version with Git
  • Easy to migrate later if needed

Databases are considered an implementation detail that may be introduced later, not a foundational dependency.

3. Append-and-merge, not replace

Fetchers do not “own” items. They emit observations that are merged into existing records. The system is designed to:

  • Re-fetch safely
  • Be idempotent
  • Accumulate context over time

4. Separation of discovery and persistence

Fetch logic discovers what exists. Content logic decides what it means and how it is stored.


Mono-Repository Structure

Mill is a mono-repo containing several long-lived packages and sub-projects.

mill/
├── cmd/
│   └── mill/            # Main CLI / runner
├── internal/
│   ├── fetcher/         # Source-specific fetching logic
│   │   └── sources/     # Individual source implementations
│   ├── content/         # Canonical data model + persistence
│   ├── rss/             # RSS generation
│   ├── site/            # Static page generation
│   └── scheduler/       # Periodic execution
├── content/             # Persisted items (Git-backed)
└── public/              # Published artifacts (RSS, HTML)

Core Concepts

Item

An Item represents a single, canonical piece of content (e.g. an article or page).

  • Identified by a stable hash of its target URL
  • Persisted on disk under a deterministic path
  • Accumulates metadata over time

Items do not belong to a single source.

Observation

An Observation represents source-specific context about an Item:

  • A discussion thread
  • A score or comment count
  • A title as presented by a source

Observations are keyed by discussion/permalink URLs and are updated in place as sources are re-fetched.

Source provenance

Source provenance records where an Item was discovered:

  • Feed URL
  • Tag page
  • Daily index
  • Submission endpoint

This allows Mill to answer questions like:

  • “Which feeds surfaced this item?”
  • “Why did this item enter the system?”

Sub-Projects

Fetcher

Responsible for discovering content from external systems.

  • Each source implementation knows how to:

    • Fetch its data
    • Interpret links and metadata
    • Emit observations
  • Fetchers do not persist data

  • Fetchers do not assign identities

Fetcher interface (conceptual):

Fetch(context) → Observations

Content

Responsible for canonicalization, identity, and persistence.

  • Owns:

    • Identity rules
    • Merge semantics
    • On-disk layout
  • Provides idempotent upsert operations

  • Is the only package allowed to write to content/

Content is intentionally conservative and stable, as it defines the long-term shape of stored data.


RSS

Responsible for publishing minimal RSS feeds from stored Items.

  • Reads from persisted content
  • Does not participate in fetching or merging
  • Produces stable feed output suitable for readers and podcast clients

RSS entries are intentionally small and link to full static pages.


Static Site

Responsible for generating per-item pages.

  • One page per Item
  • Displays canonical content, metadata, and provenance
  • Links back to source discussions where applicable

Static pages are treated as derived artifacts, not canonical data.


Scheduler

Responsible for orchestration.

  • Periodic fetches (e.g. twice daily)
  • Triggered runs for direct submissions
  • Keeps execution policy separate from business logic

Direct Submission (Sister Project)

A closely related sub-project allows direct URL submission.

  • Simple authenticated endpoint
  • Immediate ingestion
  • Shares internal packages with Mill
  • Treated as just another source of observations

This keeps ingestion paths uniform, regardless of origin.


Long-Term Extensions

Mill is designed to support, without redesign:

  • AI-generated summaries (content, social, metadata)
  • Text-to-speech artifacts
  • Alternate publication formats
  • Database-backed indexing
  • Multiple fetch schedules per source

These are layered on top of the existing content model rather than replacing it.


Summary

Mill is a deliberately constrained system:

  • One user
  • One repo
  • One canonical data model

By focusing early on identity, provenance, and merge semantics, Mill aims to make later features easy to add without rewriting the core.

The design favors clarity and durability over cleverness, and treats persistence formats as long-term contracts rather than implementation details.