Feature Parity: Python warcprox vs Go gowarcprox
Last Updated: 2026-01-07 Go Version: 0.1.0 (Post Phase 4) Python Version: 2.x (from internetarchive/warcprox)
Implementation Status Legend
- β Implemented: Feature complete and tested
- π§ In Progress: Partially implemented
- π Planned: Scheduled for implementation
- β Not Planned: Explicitly excluded (plugins, RethinkDB)
- β οΈ Differs: Implemented differently in Go
1. Core Proxy Features
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| HTTP proxy | β | β | β | Phase 1 - Full HTTP/1.1 support |
| HTTPS MITM proxy | β | β | β | Phase 2 - Certificate interception |
| CONNECT method handling | β | β | β | Phase 2 - TLS tunneling |
| Certificate authority | β | β | β | Phase 2 - Auto-generates CA |
| Dynamic cert generation | β | β | β | Phase 2 - On-demand per host |
| Wildcard certificates | β | β | β | Phase 2 - *.example.com support |
| Certificate caching | β | β | β | Phase 2 - In-memory with sync.Map |
| Persistent cert storage | β | β | β | Phase 2 - Saved to βcerts-dir |
| Socket timeout config | β | β | β | Phase 1 - βsocket-timeout flag |
| Max threads/concurrency | β | β | β | Phase 1 - Goroutines, no hard limit |
| Request queue | β | β | β | Phase 3 - Buffered channels |
| Connection pooling | β | ? | π | Phase 8 - Reuse backend connections |
| Bad hostname cache | β | ? | π | Phase 8 - TTL cache for failed hosts |
| Thread pool executor | β | N/A | β οΈ | Go uses goroutines, not thread pools |
2. WARC Writing Features
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| WARC 1.1 format | β | β | β | Via gowarc library |
| Request records | β | β | β | Phase 4 - application/http; msgtype=request |
| Response records | β | β | β | Phase 4 - application/http; msgtype=response |
| Warcinfo records | β | β | β | Phase 4 - Via gowarc rotator |
| Metadata records | β | ? | π | Phase 8 - WARCPROX_WRITE_RECORD method |
| Resource records | β | ? | π | Phase 8 - Custom WARC-Type |
| Revisit records | β | ? | π | Phase 5 - Deduplication required |
| WARC file rotation | β | β | β | Phase 4 - Via gowarc rotator |
| Size-based rollover | β | β | β | Phase 4 - βsize flag (default 1GB) |
| Time-based rollover | β | ? | π | Phase 8 - βrollover-idle-time |
| GZIP compression | β | β | β | Phase 4 - βgzip flag |
| ZSTD compression | ? | β | β οΈ | gowarc supports, not tested |
| Custom WARC prefix | β | β | β | Phase 4 - βprefix flag |
| Custom WARC filename | β | ? | π | Phase 8 - Template with variables |
| Subdir by prefix | β | ? | π | Phase 8 - βsubdir-prefix |
| .open suffix | β | β | β | Phase 4 - Handled by gowarc |
| File locking | β | ? | π | Phase 8 - When βno-warc-open-suffix |
| Multi-prefix support | β | ? | π | Phase 7 - WARC writer pool |
| Special prefix β-β | β | ? | π | Phase 7 - Disable archiving |
| Serial number tracking | β | β | β | Phase 4 - Via gowarc |
| Random token | β | β | β | Phase 4 - Via gowarc |
| WARC-Record-ID (UUID) | β | β | β | Phase 4 - UUID v4 |
| WARC-Date (RFC3339) | β | β | β | Phase 4 - Proper timestamp |
| WARC-Target-URI | β | β | β | Phase 4 - Full URL |
| WARC-IP-Address | β | β | β | Phase 4 - Remote server IP |
| WARC-Payload-Digest | β | β | β | Phase 3 - SHA1/SHA256/BLAKE3 |
| WARC-Block-Digest | β | ? | π | Phase 8 - Full HTTP block digest |
| WARC-Concurrent-To | β | β | β | Phase 4 - Links request to response |
| WARC-Refers-To | β | ? | π | Phase 5 - For revisit records |
| WARC-Truncated | β | ? | π | Phase 8 - When size exceeded |
3. Digest Calculation
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| SHA1 digest | β | β | β | Phase 3 - Default algorithm |
| SHA256 digest | β | β | β | Phase 3 - βdigest sha256 |
| BLAKE3 digest | ? | β | β οΈ | Go addition via gowarc |
| MD5 digest | β | ? | π | Python supports via hashlib |
| Other hash algorithms | β | ? | π | Python supports all hashlib |
| Base32 encoding | β | β | β | Phase 3 - For SHA digests |
| Hex encoding | β | β | β | Phase 3 - Default for BLAKE3 |
| Payload digest | β | β | β | Phase 3 - Response body only |
| Block digest | β | ? | π | Phase 8 - Full HTTP block |
| Digest format | β | β | β | Phase 3 - βalgorithm:hashβ |
| Configurable algorithm | β | β | β | Phase 3 - βdigest flag |
4. Deduplication
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| SQLite dedup DB | β | ? | π | Phase 5 - warcprox.sqlite |
| Dedup table schema | β | ? | π | Phase 5 - key/value pairs |
| Dedup lookup | β | ? | π | Phase 5 - Before writing |
| Dedup storage | β | ? | π | Phase 5 - After writing |
| Revisit record creation | β | ? | π | Phase 5 - identical-payload-digest |
| Dedup buckets (RW mode) | β | ? | π | Phase 7 - Read-write buckets |
| Dedup buckets (RO mode) | β | ? | π | Phase 7 - Read-only buckets |
| Multiple buckets | β | ? | π | Phase 7 - Per-request config |
| Default bucket | β | ? | π | Phase 5 - unspecified |
| Bucket key format | β | ? | π | Phase 5 - βdigest|bucketβ |
| Min text size threshold | β | ? | π | Phase 8 - βdedup-min-text-size |
| Min binary size threshold | β | ? | π | Phase 8 - βdedup-min-binary-size |
| Blackout period | β | ? | π | Phase 8 - βblackout-period |
| Bucket requirement mode | β | ? | π | Phase 8 - βdedup-only-with-bucket |
| Disable deduplication | β | ? | π | Phase 5 - βdedup-db=/dev/null |
| WAL mode | β | ? | π | Phase 5 - Better concurrency |
| CDX server dedup | β | - | β | External service |
| RethinkDB dedup | β | - | β | Excluded per CLAUDE.md |
| Trough dedup | β | - | β | External system |
| RethinkDB big table | β | - | β | Excluded per CLAUDE.md |
5. Statistics Tracking
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| SQLite stats DB | β | ? | π | Phase 6 - warcprox.sqlite |
| Stats table schema | β | ? | π | Phase 6 - buckets_of_stats |
| Basic URL stats | β | ? | π | Phase 6 - Count, bytes |
| Stats categories | β | ? | π | Phase 6 - total/new/revisit |
| Wire bytes tracking | β | ? | π | Phase 6 - Including HTTP headers |
| Stats buckets | β | ? | π | Phase 7 - Custom bucket names |
| Default buckets | β | ? | π | Phase 6 - all, unspecified |
| Domain tallying | β | ? | π | Phase 7 - Per-domain sub-buckets |
| Batch updates | β | ? | π | Phase 6 - Efficiency |
| JSON storage | β | ? | π | Phase 6 - Stats as JSON blob |
| Disable statistics | β | ? | π | Phase 6 - βstats-db=/dev/null |
| Running stats | β | ? | π | Phase 8 - In-memory snapshots |
| Stats API query | β | ? | π | Phase 8 - Via /status endpoint |
| RethinkDB stats | β | - | β | Excluded per CLAUDE.md |
6. Warcprox-Meta Header
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| Header detection | β | β | π§ | Phase 3 - Detected but not parsed |
| JSON header parsing | β | ? | π | Phase 7 - Full implementation |
| Custom warc-prefix | β | ? | π | Phase 7 - Per-request override |
| Prefix validation | β | ? | π | Phase 7 - No slashes allowed |
| Dedup bucket override | β | ? | π | Phase 7 - dedup-buckets field |
| Stats bucket assignment | β | ? | π | Phase 7 - stats.buckets field |
| Stats domain tallying | β | ? | π | Phase 7 - tally-domains list |
| Hard limits (420) | β | ? | π | Phase 8 - limits field |
| Soft limits (430) | β | ? | π | Phase 8 - soft-limits field |
| URL blocking rules | β | ? | π | Phase 8 - blocks array |
| Compressed blocks | β | ? | π | Phase 8 - Zlib + base64 |
| Metadata inclusion | β | ? | π | Phase 8 - metadata field |
| Accept flags | β | ? | π | Phase 8 - accept array |
| Capture metadata | β | ? | π | Phase 8 - Timestamp in response |
| Backward compatibility | β | ? | π | Phase 7 - captures-bucket, dedup-bucket |
| MIME type filters | β | ? | π | Phase 8 - mime-type-filters |
| Dedup-ok flag | β | - | β | RethinkDB big table only |
| Captures table extras | β | - | β | RethinkDB big table only |
7. Filtering and Limits
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| HTTP method filtering | β | ? | π | Phase 8 - βmethod-filter |
| Max resource size | β | β | π§ | Phase 4 - Config exists, not enforced |
| Resource truncation | β | ? | π | Phase 8 - When size exceeded |
| WARC-Truncated header | β | ? | π | Phase 8 - βlengthβ or βtimeβ |
| Time-based truncation | β | ? | π | Phase 8 - 3 hour timeout |
| MIME type filtering | β | ? | π | Phase 8 - Via Warcprox-Meta |
| MIME type REJECT | β | ? | π | Phase 8 - Block matching types |
| MIME type LIMIT | β | ? | π | Phase 8 - Only allow matching |
| Do-not-archive flag | β | ? | π | Phase 8 - Skip WARC writing |
8. Advanced Features
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| WARCPROX_WRITE_RECORD | β | ? | π | Phase 8 - Custom record injection |
| PUTMETA method | β | - | β | Deprecated in Python |
| /status endpoint | β | ? | π | Phase 8 - JSON metrics |
| Status: version/host/port | β | ? | π | Phase 8 - Basic info |
| Status: thread counts | β | ? | π | Phase 8 - Active requests |
| Status: queue status | β | ? | π | Phase 8 - Queued URLs |
| Status: rates | β | ? | π | Phase 8 - URLs/sec, bytes/sec |
| Status: postfetch chain | β | ? | π | Phase 8 - Pipeline status |
| Crawl logging | β | ? | π | Phase 8 - Heritrix-style logs |
| Crawl log format | β | ? | π | Phase 8 - Tab-separated |
| Crawl log per prefix | β | ? | π | Phase 8 - Separate files |
| Artificial status codes | β | ? | π | Phase 8 - -2, -6, -8, -404 |
| Playback proxy | β | - | β | Separate feature, not planned |
| Playback index DB | β | - | β | For playback proxy |
| SOCKS proxy support | β | ? | π | Phase 8 - βsocks-proxy |
| Tor .onion support | β | ? | π | Phase 8 - βonion-tor-socks-proxy |
| TLS fingerprinting | β | ? | π | Phase 8 - βssl-context chrome/firefox |
| Unsafe SSL renegotiation | β | ? | π | Phase 8 - βunsafe-legacy-renegotiation |
| Plugin system | β | - | β | Explicitly excluded per CLAUDE.md |
| Listener plugins | β | - | β | Use Processor interface instead |
| Batch processors | β | β | β οΈ | Via Processor interface |
| Service registry | β | - | β | RethinkDB-based |
| Service heartbeat | β | - | β | RethinkDB-based |
9. Logging and Monitoring
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| Verbose logging | β | β | β | Phase 1 - -v flag |
| Log levels | β | β | β | Phase 1 - debug/info/warn/error |
| Structured logging | β | β | β | Phase 1 - log/slog package |
| Quiet mode | β | ? | π | Phase 8 - -q flag |
| Trace logging | β | ? | π | Phase 8 - βtrace |
| Custom log levels | β | ? | π | Phase 8 - TRACE, NOTICE |
| Logging config file | β | ? | π | Phase 8 - YAML format |
| Sentry integration | β | - | β | External monitoring |
| Sentry DSN | β | - | β | βsentry-dsn |
| Sentry traces | β | - | β | Performance monitoring |
| Sentry profiles | β | - | β | Profiling |
| Deploy environment | β | - | β | βdeploy-environment |
| Performance profiling | β | ? | π | Phase 8 - βprofile cProfile |
| Thread stack traces | β | ? | π | Phase 8 - SIGQUIT handler |
10. Configuration and CLI
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| CLI framework | argparse | cobra | β οΈ | Different libraries |
| Address binding | β | β | β | Phase 1 - -b, βaddress |
| Port binding | β | β | β | Phase 1 - -p, βport |
| WARC directory | β | β | β | Phase 4 - -d, βdirectory |
| WARC prefix | β | β | β | Phase 4 - βprefix |
| WARC size limit | β | β | β | Phase 4 - βsize |
| GZIP compression flag | β | β | β | Phase 4 - -z, βgzip |
| Digest algorithm | β | β | β | Phase 3 - -g, βdigest |
| CA certificate file | β | β | β | Phase 2 - -c, βcacert |
| Certs directory | β | β | β | Phase 2 - βcerts-dir |
| Dedup DB file | β | β | π§ | Phase 4 - Flag exists, not functional |
| Stats DB file | β | β | π§ | Phase 4 - Flag exists, not functional |
| Socket timeout | β | β | β | Phase 1 - βsocket-timeout |
| Max threads | β | β | β | Phase 1 - βmax-threads |
| Queue size | β | β | β | Phase 3 - βqueue-size |
| Tmp file memory | β | β | β | Phase 3 - βtmp-file-max-memory |
| Max resource size | β | β | π§ | Phase 4 - Flag exists, not enforced |
| WARC writer threads | β | β | β | Phase 4 - βwarc-writer-threads |
| Version display | β | β | β | Phase 1 - βversion |
| Help display | β | β | β | Phase 1 - βhelp |
11. Error Handling and Resilience
| Feature | Python | Go | Status | Notes |
|---|---|---|---|---|
| Graceful shutdown | β | β | β | Phase 1 - SIGINT/SIGTERM |
| Signal handling | β | β | β | Phase 1 - os/signal package |
| Pipeline drain | β | β | β | Phase 3 - Wait for queue |
| WARC flush on shutdown | β | β | β | Phase 4 - Close rotator |
| Failed URL handling | β | ? | π | Phase 8 - Error recording |
| Connection drop detection | β | ? | π | Phase 8 - Keep-alive |
| Panic recovery | β | ? | π | Phase 8 - Goroutine defer |
| Resource cleanup | β | β | β | Phase 1 - Defer, WaitGroups |
| Error wrapping | β | β | β | Phase 1 - fmt.Errorf with %w |
| Context cancellation | β | β | β | Phase 3 - context.Context |
Summary Statistics
Total Features Analyzed: ~120
By Status:
- β Implemented: 47 (39%)
- π§ In Progress: 5 (4%)
- π Planned: 54 (45%)
- β Not Planned: 14 (12%)
- β οΈ Differs: 4 (<1%)
By Phase:
- Phase 1-4 (Complete): 47 features β
- Phase 5 (Deduplication): ~12 features π
- Phase 6 (Statistics): ~10 features π
- Phase 7 (Warcprox-Meta): ~15 features π
- Phase 8 (Advanced): ~25 features π
Core Strengths:
- HTTP/HTTPS proxy fully functional
- WARC writing complete with gowarc integration
- Digest calculation supports multiple algorithms
- Pipeline architecture clean and extensible
- Graceful shutdown properly implemented
Major Gaps:
- Deduplication - Critical feature, completely missing (Phase 5)
- Statistics - Important for monitoring, completely missing (Phase 6)
- Warcprox-Meta - Advanced feature, model exists but parsing missing (Phase 7)
- Advanced features - Many quality-of-life features pending (Phase 8)
Implementation Notes
Go-Specific Advantages
- Goroutines: More efficient than Python thread pools
- Channels: Clean producer-consumer pattern
- Type Safety: Compile-time error detection
- Performance: Generally faster execution
- gowarc Library: Well-maintained WARC implementation
Python-Specific Features Not Planned
- RethinkDB Integration: Excluded per project requirements
- Plugin System: Different architecture approach in Go
- External Service Integration: CDX server, Trough, etc.
- Sentry Monitoring: Not core functionality
Testing Strategy
- Unit tests for all implemented features
- Functional tests comparing Python and Go output
- WARC file byte-for-byte comparison (where applicable)
- Payload digest validation (must match exactly)
- Coverage target: 70% overall, 75%+ for critical packages
Next Steps
Before Phase 5:
- Complete unit test suite for Phases 1-4
- Build WARC comparison tools
- Write functional parity tests
- Achieve 70% test coverage
- Validate all tests pass
Phase 5 Priority:
- SQLite deduplication database
- Revisit record generation
- Dedup bucket support (basic)
Phase 6 Priority:
- SQLite statistics database
- Basic stats tracking (total/new/revisit)
- Stats bucket support
Phase 7 Priority:
- Warcprox-Meta JSON parsing
- Custom WARC prefix support
- Bucket configuration (dedup + stats)
Phase 8 Priority:
- /status endpoint
- Limits enforcement
- WARCPROX_WRITE_RECORD
- Resource size truncation