Commit 9f54afc
Changed files (1)
FEATURE_PARITY.md
@@ -0,0 +1,362 @@
+# Feature Parity: Python warcprox vs Go gowarcprox
+
+**Last Updated**: 2026-01-07
+**Go Version**: 0.1.0 (Post Phase 4)
+**Python Version**: 2.x (from internetarchive/warcprox)
+
+## Implementation Status Legend
+
+- โ
**Implemented**: Feature complete and tested
+- ๐ง **In Progress**: Partially implemented
+- ๐ **Planned**: Scheduled for implementation
+- โ **Not Planned**: Explicitly excluded (plugins, RethinkDB)
+- โ ๏ธ **Differs**: Implemented differently in Go
+
+---
+
+## 1. Core Proxy Features
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| HTTP proxy | โ | โ | โ
| Phase 1 - Full HTTP/1.1 support |
+| HTTPS MITM proxy | โ | โ | โ
| Phase 2 - Certificate interception |
+| CONNECT method handling | โ | โ | โ
| Phase 2 - TLS tunneling |
+| Certificate authority | โ | โ | โ
| Phase 2 - Auto-generates CA |
+| Dynamic cert generation | โ | โ | โ
| Phase 2 - On-demand per host |
+| Wildcard certificates | โ | โ | โ
| Phase 2 - *.example.com support |
+| Certificate caching | โ | โ | โ
| Phase 2 - In-memory with sync.Map |
+| Persistent cert storage | โ | โ | โ
| Phase 2 - Saved to --certs-dir |
+| Socket timeout config | โ | โ | โ
| Phase 1 - --socket-timeout flag |
+| Max threads/concurrency | โ | โ | โ
| Phase 1 - Goroutines, no hard limit |
+| Request queue | โ | โ | โ
| Phase 3 - Buffered channels |
+| Connection pooling | โ | ? | ๐ | Phase 8 - Reuse backend connections |
+| Bad hostname cache | โ | ? | ๐ | Phase 8 - TTL cache for failed hosts |
+| Thread pool executor | โ | N/A | โ ๏ธ | Go uses goroutines, not thread pools |
+
+---
+
+## 2. WARC Writing Features
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| WARC 1.1 format | โ | โ | โ
| Via gowarc library |
+| Request records | โ | โ | โ
| Phase 4 - application/http; msgtype=request |
+| Response records | โ | โ | โ
| Phase 4 - application/http; msgtype=response |
+| Warcinfo records | โ | โ | โ
| Phase 4 - Via gowarc rotator |
+| Metadata records | โ | ? | ๐ | Phase 8 - WARCPROX_WRITE_RECORD method |
+| Resource records | โ | ? | ๐ | Phase 8 - Custom WARC-Type |
+| Revisit records | โ | ? | ๐ | Phase 5 - Deduplication required |
+| WARC file rotation | โ | โ | โ
| Phase 4 - Via gowarc rotator |
+| Size-based rollover | โ | โ | โ
| Phase 4 - --size flag (default 1GB) |
+| Time-based rollover | โ | ? | ๐ | Phase 8 - --rollover-idle-time |
+| GZIP compression | โ | โ | โ
| Phase 4 - --gzip flag |
+| ZSTD compression | ? | โ | โ ๏ธ | gowarc supports, not tested |
+| Custom WARC prefix | โ | โ | โ
| Phase 4 - --prefix flag |
+| Custom WARC filename | โ | ? | ๐ | Phase 8 - Template with variables |
+| Subdir by prefix | โ | ? | ๐ | Phase 8 - --subdir-prefix |
+| .open suffix | โ | โ | โ
| Phase 4 - Handled by gowarc |
+| File locking | โ | ? | ๐ | Phase 8 - When --no-warc-open-suffix |
+| Multi-prefix support | โ | ? | ๐ | Phase 7 - WARC writer pool |
+| Special prefix "-" | โ | ? | ๐ | Phase 7 - Disable archiving |
+| Serial number tracking | โ | โ | โ
| Phase 4 - Via gowarc |
+| Random token | โ | โ | โ
| Phase 4 - Via gowarc |
+| WARC-Record-ID (UUID) | โ | โ | โ
| Phase 4 - UUID v4 |
+| WARC-Date (RFC3339) | โ | โ | โ
| Phase 4 - Proper timestamp |
+| WARC-Target-URI | โ | โ | โ
| Phase 4 - Full URL |
+| WARC-IP-Address | โ | โ | โ
| Phase 4 - Remote server IP |
+| WARC-Payload-Digest | โ | โ | โ
| Phase 3 - SHA1/SHA256/BLAKE3 |
+| WARC-Block-Digest | โ | ? | ๐ | Phase 8 - Full HTTP block digest |
+| WARC-Concurrent-To | โ | โ | โ
| Phase 4 - Links request to response |
+| WARC-Refers-To | โ | ? | ๐ | Phase 5 - For revisit records |
+| WARC-Truncated | โ | ? | ๐ | Phase 8 - When size exceeded |
+
+---
+
+## 3. Digest Calculation
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| SHA1 digest | โ | โ | โ
| Phase 3 - Default algorithm |
+| SHA256 digest | โ | โ | โ
| Phase 3 - --digest sha256 |
+| BLAKE3 digest | ? | โ | โ ๏ธ | Go addition via gowarc |
+| MD5 digest | โ | ? | ๐ | Python supports via hashlib |
+| Other hash algorithms | โ | ? | ๐ | Python supports all hashlib |
+| Base32 encoding | โ | โ | โ
| Phase 3 - For SHA digests |
+| Hex encoding | โ | โ | โ
| Phase 3 - Default for BLAKE3 |
+| Payload digest | โ | โ | โ
| Phase 3 - Response body only |
+| Block digest | โ | ? | ๐ | Phase 8 - Full HTTP block |
+| Digest format | โ | โ | โ
| Phase 3 - "algorithm:hash" |
+| Configurable algorithm | โ | โ | โ
| Phase 3 - --digest flag |
+
+---
+
+## 4. Deduplication
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| SQLite dedup DB | โ | ? | ๐ | Phase 5 - warcprox.sqlite |
+| Dedup table schema | โ | ? | ๐ | Phase 5 - key/value pairs |
+| Dedup lookup | โ | ? | ๐ | Phase 5 - Before writing |
+| Dedup storage | โ | ? | ๐ | Phase 5 - After writing |
+| Revisit record creation | โ | ? | ๐ | Phase 5 - identical-payload-digest |
+| Dedup buckets (RW mode) | โ | ? | ๐ | Phase 7 - Read-write buckets |
+| Dedup buckets (RO mode) | โ | ? | ๐ | Phase 7 - Read-only buckets |
+| Multiple buckets | โ | ? | ๐ | Phase 7 - Per-request config |
+| Default bucket | โ | ? | ๐ | Phase 5 - __unspecified__ |
+| Bucket key format | โ | ? | ๐ | Phase 5 - "digest\|bucket" |
+| Min text size threshold | โ | ? | ๐ | Phase 8 - --dedup-min-text-size |
+| Min binary size threshold | โ | ? | ๐ | Phase 8 - --dedup-min-binary-size |
+| Blackout period | โ | ? | ๐ | Phase 8 - --blackout-period |
+| Bucket requirement mode | โ | ? | ๐ | Phase 8 - --dedup-only-with-bucket |
+| Disable deduplication | โ | ? | ๐ | Phase 5 - --dedup-db=/dev/null |
+| WAL mode | โ | ? | ๐ | Phase 5 - Better concurrency |
+| CDX server dedup | โ | - | โ | External service |
+| RethinkDB dedup | โ | - | โ | Excluded per CLAUDE.md |
+| Trough dedup | โ | - | โ | External system |
+| RethinkDB big table | โ | - | โ | Excluded per CLAUDE.md |
+
+---
+
+## 5. Statistics Tracking
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| SQLite stats DB | โ | ? | ๐ | Phase 6 - warcprox.sqlite |
+| Stats table schema | โ | ? | ๐ | Phase 6 - buckets_of_stats |
+| Basic URL stats | โ | ? | ๐ | Phase 6 - Count, bytes |
+| Stats categories | โ | ? | ๐ | Phase 6 - total/new/revisit |
+| Wire bytes tracking | โ | ? | ๐ | Phase 6 - Including HTTP headers |
+| Stats buckets | โ | ? | ๐ | Phase 7 - Custom bucket names |
+| Default buckets | โ | ? | ๐ | Phase 6 - __all__, __unspecified__ |
+| Domain tallying | โ | ? | ๐ | Phase 7 - Per-domain sub-buckets |
+| Batch updates | โ | ? | ๐ | Phase 6 - Efficiency |
+| JSON storage | โ | ? | ๐ | Phase 6 - Stats as JSON blob |
+| Disable statistics | โ | ? | ๐ | Phase 6 - --stats-db=/dev/null |
+| Running stats | โ | ? | ๐ | Phase 8 - In-memory snapshots |
+| Stats API query | โ | ? | ๐ | Phase 8 - Via /status endpoint |
+| RethinkDB stats | โ | - | โ | Excluded per CLAUDE.md |
+
+---
+
+## 6. Warcprox-Meta Header
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| Header detection | โ | โ | ๐ง | Phase 3 - Detected but not parsed |
+| JSON header parsing | โ | ? | ๐ | Phase 7 - Full implementation |
+| Custom warc-prefix | โ | ? | ๐ | Phase 7 - Per-request override |
+| Prefix validation | โ | ? | ๐ | Phase 7 - No slashes allowed |
+| Dedup bucket override | โ | ? | ๐ | Phase 7 - dedup-buckets field |
+| Stats bucket assignment | โ | ? | ๐ | Phase 7 - stats.buckets field |
+| Stats domain tallying | โ | ? | ๐ | Phase 7 - tally-domains list |
+| Hard limits (420) | โ | ? | ๐ | Phase 8 - limits field |
+| Soft limits (430) | โ | ? | ๐ | Phase 8 - soft-limits field |
+| URL blocking rules | โ | ? | ๐ | Phase 8 - blocks array |
+| Compressed blocks | โ | ? | ๐ | Phase 8 - Zlib + base64 |
+| Metadata inclusion | โ | ? | ๐ | Phase 8 - metadata field |
+| Accept flags | โ | ? | ๐ | Phase 8 - accept array |
+| Capture metadata | โ | ? | ๐ | Phase 8 - Timestamp in response |
+| Backward compatibility | โ | ? | ๐ | Phase 7 - captures-bucket, dedup-bucket |
+| MIME type filters | โ | ? | ๐ | Phase 8 - mime-type-filters |
+| Dedup-ok flag | โ | - | โ | RethinkDB big table only |
+| Captures table extras | โ | - | โ | RethinkDB big table only |
+
+---
+
+## 7. Filtering and Limits
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| HTTP method filtering | โ | ? | ๐ | Phase 8 - --method-filter |
+| Max resource size | โ | โ | ๐ง | Phase 4 - Config exists, not enforced |
+| Resource truncation | โ | ? | ๐ | Phase 8 - When size exceeded |
+| WARC-Truncated header | โ | ? | ๐ | Phase 8 - "length" or "time" |
+| Time-based truncation | โ | ? | ๐ | Phase 8 - 3 hour timeout |
+| MIME type filtering | โ | ? | ๐ | Phase 8 - Via Warcprox-Meta |
+| MIME type REJECT | โ | ? | ๐ | Phase 8 - Block matching types |
+| MIME type LIMIT | โ | ? | ๐ | Phase 8 - Only allow matching |
+| Do-not-archive flag | โ | ? | ๐ | Phase 8 - Skip WARC writing |
+
+---
+
+## 8. Advanced Features
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| WARCPROX_WRITE_RECORD | โ | ? | ๐ | Phase 8 - Custom record injection |
+| PUTMETA method | โ | - | โ | Deprecated in Python |
+| /status endpoint | โ | ? | ๐ | Phase 8 - JSON metrics |
+| Status: version/host/port | โ | ? | ๐ | Phase 8 - Basic info |
+| Status: thread counts | โ | ? | ๐ | Phase 8 - Active requests |
+| Status: queue status | โ | ? | ๐ | Phase 8 - Queued URLs |
+| Status: rates | โ | ? | ๐ | Phase 8 - URLs/sec, bytes/sec |
+| Status: postfetch chain | โ | ? | ๐ | Phase 8 - Pipeline status |
+| Crawl logging | โ | ? | ๐ | Phase 8 - Heritrix-style logs |
+| Crawl log format | โ | ? | ๐ | Phase 8 - Tab-separated |
+| Crawl log per prefix | โ | ? | ๐ | Phase 8 - Separate files |
+| Artificial status codes | โ | ? | ๐ | Phase 8 - -2, -6, -8, -404 |
+| Playback proxy | โ | - | โ | Separate feature, not planned |
+| Playback index DB | โ | - | โ | For playback proxy |
+| SOCKS proxy support | โ | ? | ๐ | Phase 8 - --socks-proxy |
+| Tor .onion support | โ | ? | ๐ | Phase 8 - --onion-tor-socks-proxy |
+| TLS fingerprinting | โ | ? | ๐ | Phase 8 - --ssl-context chrome/firefox |
+| Unsafe SSL renegotiation | โ | ? | ๐ | Phase 8 - --unsafe-legacy-renegotiation |
+| Plugin system | โ | - | โ | Explicitly excluded per CLAUDE.md |
+| Listener plugins | โ | - | โ | Use Processor interface instead |
+| Batch processors | โ | โ | โ ๏ธ | Via Processor interface |
+| Service registry | โ | - | โ | RethinkDB-based |
+| Service heartbeat | โ | - | โ | RethinkDB-based |
+
+---
+
+## 9. Logging and Monitoring
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| Verbose logging | โ | โ | โ
| Phase 1 - -v flag |
+| Log levels | โ | โ | โ
| Phase 1 - debug/info/warn/error |
+| Structured logging | โ | โ | โ
| Phase 1 - log/slog package |
+| Quiet mode | โ | ? | ๐ | Phase 8 - -q flag |
+| Trace logging | โ | ? | ๐ | Phase 8 - --trace |
+| Custom log levels | โ | ? | ๐ | Phase 8 - TRACE, NOTICE |
+| Logging config file | โ | ? | ๐ | Phase 8 - YAML format |
+| Sentry integration | โ | - | โ | External monitoring |
+| Sentry DSN | โ | - | โ | --sentry-dsn |
+| Sentry traces | โ | - | โ | Performance monitoring |
+| Sentry profiles | โ | - | โ | Profiling |
+| Deploy environment | โ | - | โ | --deploy-environment |
+| Performance profiling | โ | ? | ๐ | Phase 8 - --profile cProfile |
+| Thread stack traces | โ | ? | ๐ | Phase 8 - SIGQUIT handler |
+
+---
+
+## 10. Configuration and CLI
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| CLI framework | argparse | cobra | โ ๏ธ | Different libraries |
+| Address binding | โ | โ | โ
| Phase 1 - -b, --address |
+| Port binding | โ | โ | โ
| Phase 1 - -p, --port |
+| WARC directory | โ | โ | โ
| Phase 4 - -d, --directory |
+| WARC prefix | โ | โ | โ
| Phase 4 - --prefix |
+| WARC size limit | โ | โ | โ
| Phase 4 - --size |
+| GZIP compression flag | โ | โ | โ
| Phase 4 - -z, --gzip |
+| Digest algorithm | โ | โ | โ
| Phase 3 - -g, --digest |
+| CA certificate file | โ | โ | โ
| Phase 2 - -c, --cacert |
+| Certs directory | โ | โ | โ
| Phase 2 - --certs-dir |
+| Dedup DB file | โ | โ | ๐ง | Phase 4 - Flag exists, not functional |
+| Stats DB file | โ | โ | ๐ง | Phase 4 - Flag exists, not functional |
+| Socket timeout | โ | โ | โ
| Phase 1 - --socket-timeout |
+| Max threads | โ | โ | โ
| Phase 1 - --max-threads |
+| Queue size | โ | โ | โ
| Phase 3 - --queue-size |
+| Tmp file memory | โ | โ | โ
| Phase 3 - --tmp-file-max-memory |
+| Max resource size | โ | โ | ๐ง | Phase 4 - Flag exists, not enforced |
+| WARC writer threads | โ | โ | โ
| Phase 4 - --warc-writer-threads |
+| Version display | โ | โ | โ
| Phase 1 - --version |
+| Help display | โ | โ | โ
| Phase 1 - --help |
+
+---
+
+## 11. Error Handling and Resilience
+
+| Feature | Python | Go | Status | Notes |
+|---------|--------|----|---------| ------|
+| Graceful shutdown | โ | โ | โ
| Phase 1 - SIGINT/SIGTERM |
+| Signal handling | โ | โ | โ
| Phase 1 - os/signal package |
+| Pipeline drain | โ | โ | โ
| Phase 3 - Wait for queue |
+| WARC flush on shutdown | โ | โ | โ
| Phase 4 - Close rotator |
+| Failed URL handling | โ | ? | ๐ | Phase 8 - Error recording |
+| Connection drop detection | โ | ? | ๐ | Phase 8 - Keep-alive |
+| Panic recovery | โ | ? | ๐ | Phase 8 - Goroutine defer |
+| Resource cleanup | โ | โ | โ
| Phase 1 - Defer, WaitGroups |
+| Error wrapping | โ | โ | โ
| Phase 1 - fmt.Errorf with %w |
+| Context cancellation | โ | โ | โ
| Phase 3 - context.Context |
+
+---
+
+## Summary Statistics
+
+**Total Features Analyzed**: ~120
+
+**By Status**:
+- โ
Implemented: 47 (39%)
+- ๐ง In Progress: 5 (4%)
+- ๐ Planned: 54 (45%)
+- โ Not Planned: 14 (12%)
+- โ ๏ธ Differs: 4 (<1%)
+
+**By Phase**:
+- Phase 1-4 (Complete): 47 features โ
+- Phase 5 (Deduplication): ~12 features ๐
+- Phase 6 (Statistics): ~10 features ๐
+- Phase 7 (Warcprox-Meta): ~15 features ๐
+- Phase 8 (Advanced): ~25 features ๐
+
+**Core Strengths**:
+- HTTP/HTTPS proxy fully functional
+- WARC writing complete with gowarc integration
+- Digest calculation supports multiple algorithms
+- Pipeline architecture clean and extensible
+- Graceful shutdown properly implemented
+
+**Major Gaps**:
+1. Deduplication - Critical feature, completely missing (Phase 5)
+2. Statistics - Important for monitoring, completely missing (Phase 6)
+3. Warcprox-Meta - Advanced feature, model exists but parsing missing (Phase 7)
+4. Advanced features - Many quality-of-life features pending (Phase 8)
+
+---
+
+## Implementation Notes
+
+### Go-Specific Advantages
+- **Goroutines**: More efficient than Python thread pools
+- **Channels**: Clean producer-consumer pattern
+- **Type Safety**: Compile-time error detection
+- **Performance**: Generally faster execution
+- **gowarc Library**: Well-maintained WARC implementation
+
+### Python-Specific Features Not Planned
+- **RethinkDB Integration**: Excluded per project requirements
+- **Plugin System**: Different architecture approach in Go
+- **External Service Integration**: CDX server, Trough, etc.
+- **Sentry Monitoring**: Not core functionality
+
+### Testing Strategy
+- Unit tests for all implemented features
+- Functional tests comparing Python and Go output
+- WARC file byte-for-byte comparison (where applicable)
+- Payload digest validation (must match exactly)
+- Coverage target: 70% overall, 75%+ for critical packages
+
+---
+
+## Next Steps
+
+**Before Phase 5**:
+1. Complete unit test suite for Phases 1-4
+2. Build WARC comparison tools
+3. Write functional parity tests
+4. Achieve 70% test coverage
+5. Validate all tests pass
+
+**Phase 5 Priority**:
+- SQLite deduplication database
+- Revisit record generation
+- Dedup bucket support (basic)
+
+**Phase 6 Priority**:
+- SQLite statistics database
+- Basic stats tracking (total/new/revisit)
+- Stats bucket support
+
+**Phase 7 Priority**:
+- Warcprox-Meta JSON parsing
+- Custom WARC prefix support
+- Bucket configuration (dedup + stats)
+
+**Phase 8 Priority**:
+- /status endpoint
+- Limits enforcement
+- WARCPROX_WRITE_RECORD
+- Resource size truncation