Google Research Introduces Groundsource: Gemini-Powered Pipeline Converts Global News Into Flood Event Dataset

Google Research has introduced Groundsource, a Gemini-powered pipeline that converts unstructured global news reports into structured, geo-referenced disaster datasets. Published on March 12, 2026, the methodology ingests articles across 80 languages, normalizes them through Google's Cloud Translation API, and then applies a multi-stage Gemini prompt pipeline to classify events, extract timestamps, assign severity ratings, and spatially map occurrences using Google Maps Platform. The first dataset released covers 2.6 million urban flash flood events spanning more than 150 countries from 2000 to 2025 — roughly 260 times the size of the GDACS inventory, which Google Research's paper estimates at approximately 10,000 entries. The dataset is being made openly available to researchers.

Validation results show 82% of extracted events are practically useful for real-world analysis, and spatiotemporal recall against GDACS — the joint UN and European Commission disaster benchmark — reached 85 to 100% for severe flood events. The work was led by Google Research software engineers Oleg Zlydenko and Rotem Mayo alongside research scientist Deborah Cohen. Compared to legacy archives like the Global Flood Database and the Dartmouth Flood Observatory, which struggle with cloud interference, slow satellite revisit times, and a bias toward large, long-duration events, Groundsource is designed specifically to capture the localized, fast-moving flash floods that traditional monitoring infrastructure tends to miss.

The most direct precedent for Groundsource's news-as-sensor-network approach is the GDELT Project, which has mined global media for coded event data since 1979. Groundsource differentiates itself by applying Gemini's multi-step reasoning capabilities to handle translation, relative date resolution, and sub-administrative spatial anchoring — capabilities that GDELT's older NLP pipelines lack. On the commercial side, firms such as Fathom, Jupiter Intelligence, and First Street Foundation offer proprietary flood risk analytics to insurers and municipalities, but their historical event layers are typically modeled rather than empirically observed, making the 2.6-million-record corpus a potentially significant input they cannot easily replicate without equivalent news-ingestion infrastructure.

The immediate downstream application is Google Flood Hub, where Groundsource data now powers near-global 24-hour advance urban flash flood forecasts — placing it in direct competition with the EU's Copernicus-operated GloFAS system and national meteorological agencies. Google Research explicitly frames the pipeline architecture as extensible to other hazard types, with droughts and landslides cited as near-term targets. By open-releasing the flood dataset while retaining the continuously updated Gemini inference pipeline and Maps Platform integration internally, Google follows a pattern it has used previously with TensorFlow and AlphaFold: open data contributions that build research community goodwill while the operationally valuable pipeline remains proprietary infrastructure.