Building a World-Monitor News Engine for Traders: From ~115 Public Sources to a Daily Brief

⚠ TL;DR

Async news intelligence engine at Saras - aggregates ~115 public macro and trade-intel sources, dedupes at four layers (page snapshot, exact hash, SimHash near-dup, freshness), enriches via Claude + Pydantic-typed outputs, and ships a chain-of-summaries daily brief. Scrapling's StealthyFetcher sits in the hot path with hard timeouts.

Introduction

I work at Saras - a platform that helps India's traders make sense of fragmented market information (advisor performance, sentiment, screeners, and news). Traders live in a firehose: central banks, statistical offices, ministries, exchanges, and global bodies all publish material that can move sectors and commodities. Reading it all, every day, is not human-scale.

So we built an internal news intelligence engine - codenamed SARAS (Strategic Agentic Research and Analysis System) in our stack - that aggregates public web sources across geographies, normalizes them, deduplicates aggressively, and turns raw pages into structured enrichment, trading signals, and a daily chain-of-summaries brief. This post is a technical walkthrough of that system: what problem it solves, how data flows, and where Scrapling fits in the scraping layer.

Scope and ethics: We only ingest content that is already published on public websites (and similar public endpoints). No paywalled bypass, no authenticated "inside" data - just automation and structure on what anyone could open in a browser, at scale.

The problem: attention is the bottleneck

Professional trading research is not only "faster charts." It is synthesis under time pressure:

Policy and data releases are distributed across dozens of official sites.
The same story reappears in near-duplicate form across portals.
What matters for India-linked equities or commodities is often two hops away (global macro → channel → local sector).

A "World Monitor" for a desk is not another RSS reader. It needs change detection (did this source's listing actually move?), deduplication (is this the same circular as yesterday?), structured extraction (entities, importance, sentiment), and downstream products (signals + a brief you can scan before the open).

Architecture at a glance

The pipeline is async end-to-end: scheduled jobs trigger a full run, ScraperFactory picks a strategy per source (static, dynamic, RSS, PDF, document, API), then we short-circuit unchanged pages, dedupe new items, store articles, enrich with Claude via Pydantic-typed outputs, extract signals for high-importance pieces, and generate category-parallel summaries merged into one daily brief.

Scraping: why Scrapling is in the hot path

Government and exchange sites are the worst of both worlds: high value and hostile ergonomics - rate limits, bot checks, heavy JavaScript, and occasional Cloudflare challenges. For static-style fetches we standardized on Scrapling, a Python framework that ships fetch tiers from basic HTTP to StealthyFetcher (stealth-oriented retrieval suited to protected pages).

In our stack, StealthyFetcher.async_fetch is used for static scraping and is wrapped in a hard timeout (asyncio.wait_for) so one sticky domain cannot stall the whole run. If Scrapling is unavailable or a fetch fails, we fall back to lighter HTTP paths - operational resilience matters more than purity.

Credit where it's due: Scrapling is maintained by the community around D4Vinci/Scrapling on GitHub; docs live on Read the Docs. We are not affiliated - we're a production user grateful for a focused abstraction over "open browser / solve challenge / get HTML."

Deduplication: four layers

Duplication is not "annoying" - it pollutes embeddings, summaries, and trader attention. We stack:

Page-level change detection - hash the set of extracted items; if nothing moved, skip the expensive work.
Exact dedup - content hash (e.g. SHA-256) with batched DB lookups per source.
Near-dedup - 64-bit SimHash with a Hamming threshold for "almost the same" press releases.
Freshness - time-window filtering so stale reprints do not dominate.

This is the difference between "we ingested 10,000 rows" and "we ingested net-new information."

Enrichment, signals, and the daily brief

Structured enrichment

Raw text is not enough for a desk. We use Claude with messages.parse() and Pydantic models so the model returns typed fields - sentiment, importance, entities, tags - rather than fragile free-form JSON. That choice pays off in downstream prompts: you filter and route on numbers and enums, not vibes.

Trading signals

High-importance articles feed a second structured pass: affected sectors, tickers, commodities, direction, horizon, confidence. This is explicitly global-to-India aware in prompt design - macro abroad often matters through transmission channels at home.

Chain-of-summaries brief

Eleven category buckets (trade, industrial, sector, IBEF, US macro, China, EU, energy, commodities, Asia-Pacific, international) each get a parallel summary call; results merge into one coherent daily brief. Empty categories can carry forward from the previous day with clear attribution - better an honest "no delta" than hallucinated filler.

Operations: async, Postgres, API

PostgreSQL is the system of record (JSONB for flexible AI payloads, timestamptz everywhere).
Repository pattern keeps SQL out of business logic.
FastAPI exposes briefs, articles, signals, and operational endpoints; a small dashboard consumes the same JSON the rest of the org could wire into internal tools.

Lessons learned

Stealth + timeouts - Anti-bot tech is real; so are hung connections. Wrap your stealth fetcher with hard deadlines and backoff per domain.
Skip work early - Page snapshots that detect "no change" save LLM budget and DB churn.
Typed AI outputs - Pydantic + parse() reduced an entire class of production bugs we used to see with ad-hoc JSON.
Public-only is a product constraint - It keeps the compliance story simple: we automate reading, not breaking into private data.

Conclusion

Markets reward speed of understanding, not speed of tab-switching. This engine exists so a small team can monitor the world with the same public sources a diligent human would use - just with automation, deduplication, and structured intelligence layered on top.

If you're building something similar: invest first in change detection and dedup, then in typed enrichment. The LLM is the glitter; the ingestion contract is the foundation.

◆ KEY TAKEAWAY

Invest first in change detection and dedup, then in typed enrichment. The LLM is glitter; the ingestion contract is the foundation. Skip the work you can (page-level snapshots) and type the work you can't (Pydantic on every AI output).

References

Saras - product home
Saras llms.txt - structured overview for tools and assistants
Scrapling on PyPI
Scrapling documentation
Scrapling (GitHub)