Research
이 문서는 research 폴더의 README에서 자동 생성되었습니다.
Research
섹션 제목: “Research”Research outputs represent the discovery phase of the NewsFork pipeline.
Philosophy
섹션 제목: “Philosophy”“Research Engine discovers WHERE to look.”
Research doesn’t judge content quality, fetch methods, or validity. It simply discovers URLs that might be relevant news sources.
Structure
섹션 제목: “Structure”research/├── datasets/ # Research datasets (immutable snapshots)│ └── country=sg/│ └── category=news/│ ├── 2026-01-24_0001.json│ └── 2026-01-25_0001.json├── blocked/ # Blocked domains (403, captcha)│ └── country=sg/│ └── 2026-01-24.json├── dead/ # Dead domains (DNS fail, unreachable)│ └── country=sg/│ └── 2026-01-24.json├── liveness/ # Liveness check results│ └── country=sg/│ └── 2026-01-24.json└── README.mdStorage
섹션 제목: “Storage”Research data is stored in Cloudflare R2 for production use:
- Primary Storage: R2 Bucket (
DATASETS_BUCKET) - Path Format:
research/datasets/country={code}/category={type}/{date}_{chunk}.json - Backup/Audit: GitHub (via metadata sync)
Path Convention
섹션 제목: “Path Convention”Uses Hive-style partitioning for compatibility with data lake tools:
research/datasets/country=sg/category=news/2026-01-25_0001.jsonresearch/liveness/country=sg/2026-01-25.jsonresearch/blocked/country=sg/2026-01-25.jsonresearch/dead/country=sg/2026-01-25.jsonThis format works with:
- BigQuery
- Delta Lake
- AWS Athena
- Cloudflare R2
Enhanced Research Dataset Schema
섹션 제목: “Enhanced Research Dataset Schema”{ "meta": { "dataset_id": "sg-news-2026-01-25-0001", "country": "SG", "category": "news", "discovered_at": "2026-01-25T03:12:00Z", "research_methods": ["google_search", "crtsh"], "queries": ["Singapore government news site:.gov.sg"], "engine": { "name": "research-engine", "version": "1.0.0" }, "record_count": 8 }, "records": [ { "raw_url": "https://www.mom.gov.sg/newsroom", "normalized_domain": "mom.gov.sg", "domain_id": "gov:sg:mom.gov.sg", "source_type": "gov", "discovery_method": "google_search", "confidence": 0.95, "content_hints": ["news", "government_content"] } ]}What Research Does
섹션 제목: “What Research Does”- ✅ Discover URLs via multiple methods (Google Search, crt.sh, etc.)
- ✅ Normalize domains and generate domain_id
- ✅ Check liveness (Phase 1-A)
- ✅ Create immutable dataset snapshots
- ✅ Track blocked/dead domains separately
- ✅ Store datasets in R2 for fast access
- ✅ Update metadata in D1 for querying
What Research Does NOT Do
섹션 제목: “What Research Does NOT Do”- ❌ Determine content type (RSS/HTML/API)
- ❌ Classify content nature
- ❌ Extract metadata
- ❌ Create seed contracts
These are the Seed Engine’s responsibilities.
Pipeline Flow
섹션 제목: “Pipeline Flow”[API Request] │ │ POST /api/v1/queues/research ▼[Queue Batch Creation] │ │ Batch metadata stored in D1 ▼[Queue Consumer] │ │ Process messages in batches ▼[Domain Functions] │ │ discoverUrlsFromSource() │ createResearchOutput() │ generateDatasetId() ▼[Storage] │ ├──→ R2 (Raw datasets) ├──→ D1 (Metadata, batch state) └──→ GitHub (Audit trail via sync)API Endpoints
섹션 제목: “API Endpoints”Research Operations
섹션 제목: “Research Operations”| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/research | List research outputs |
| GET | /api/v1/research/index | Get research index |
| GET | /api/v1/research/:country/:category/:date | Get specific research |
| GET | /api/v1/research/:country/:category/today | Get today’s research |
| POST | /api/v1/research | Create research output |
Queue Operations
섹션 제목: “Queue Operations”| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/queues/research | Create research batch |
| GET | /api/v1/queues/batch/:batchId | Get batch status |
Queue Processing
섹션 제목: “Queue Processing”Research queue processing:
- Batch Creation:
POST /api/v1/queues/researchwith URLs - Queue Consumer: Automatically processes messages
- URL Discovery: Domain functions discover and normalize URLs
- Dataset Creation: Creates immutable dataset snapshots
- Storage: Saves to R2 and updates D1 metadata
Queue Configuration
섹션 제목: “Queue Configuration”Research queue settings (from wrangler.jsonc):
{ "queue": "newsfork-research-staging", "max_batch_size": 10, "max_batch_timeout": 30, "max_retries": 3, "dead_letter_queue": "newsfork-dlq-staging"}Service Layer
섹션 제목: “Service Layer”The Research Service (src/services/research.service.ts) orchestrates:
- Domain Functions: Pure business logic (no Cloudflare dependencies)
- Infra Adapters: Cloudflare R2, D1, GitHub storage
- Queue Integration: Batch processing via Cloudflare Queues
Key Functions
섹션 제목: “Key Functions”// Domain functions (pure, testable)discoverUrlsFromSource(input: DiscoverUrlsInput): DiscoverUrlsOutputcreateResearchOutput(...): ResearchOutputgenerateDatasetId(...): stringcreateDatasetPath(...): string
// Service layer (orchestrates domain + infra)ResearchService.list(params): Promise<ResearchListResult>ResearchService.get(country, category, date): Promise<ResearchOutput>ResearchService.create(request): Promise<ResearchOutput>Liveness Checks
섹션 제목: “Liveness Checks”Liveness checks are performed separately via the Liveness Queue:
- Queue Creation:
POST /api/v1/queues/livenesswith domains - Health Check: Domain functions check domain accessibility
- Result Storage: Saves to
research/liveness/in R2 - Status Update: Updates dataset liveness status in D1
Metadata Sync
섹션 제목: “Metadata Sync”Research metadata is synced to GitHub for audit trail:
- Trigger: Scheduled cron job (every 6 hours) or manual sync
- Endpoint:
POST /api/v1/metadata/sync - Process: D1 metadata → GitHub commit
- Location:
metadata/snapshot.jsonin GitHub
Environment-Specific Paths
섹션 제목: “Environment-Specific Paths”Research data paths are prefixed by environment:
- Development:
dev/research/datasets/... - Staging:
staging/research/datasets/... - Production:
prod/research/datasets/...
This ensures complete isolation between environments.
Data Access
섹션 제목: “Data Access”Via API
섹션 제목: “Via API”# List research outputscurl https://api.example.com/api/v1/research?country=SG&category=news
# Get specific researchcurl https://api.example.com/api/v1/research/SG/news/2026-01-25
# Create research batchcurl -X POST https://api.example.com/api/v1/queues/research \ -H "Content-Type: application/json" \ -d '{ "country": "SG", "category": "news", "urls": ["https://example.com"], "chunkSize": 100 }'Via R2 Direct Access
섹션 제목: “Via R2 Direct Access”Research datasets can be accessed directly from R2:
// Using R2 Storage Serviceconst r2 = createR2StorageService(env.DATASETS_BUCKET);const dataset = await r2.getDataset(country, category, date, chunk);Related Documentation
섹션 제목: “Related Documentation”- Project README
- ](/ko/v1/guides/seeds/)
- Environment Guide
- Architecture Guidelines