Hacker News Full Archive (47M+ Items, 11.6GB) Available as Parquet on HuggingFace, Updated Every 5 Minutes

OpenIndex, an open data infrastructure organization on Hugging Face, has published a complete archive of Hacker News comprising 47.3 million items totaling 11.6 gigabytes in Parquet format. The dataset, available at huggingface.co/datasets/open-index/hacker-news, spans the platform's entire history from its earliest posts in October 2006 through the present, and is refreshed every five minutes — making it one of the most current bulk mirrors of Y Combinator's community platform publicly available. The archive captures all item types including stories, comments, jobs, polls, and poll options, and is licensed under the Open Data Commons Attribution License (ODC-BY), which permits unrestricted commercial and research use with attribution.

Each row includes fields for item ID, type, author, UTC timestamp, text body, URL, score, title, descendant count, parent and poll linkages, and a pre-tokenized word list. The Parquet columnar format makes the dataset compatible with DuckDB, Spark, Pandas, Polars, and the Hugging Face Datasets library. Hugging Face's built-in Dataset Viewer SQL console lets users run queries without downloading the full corpus. The pre-computed words field reduces setup time for text classification pipelines, and the 11.6 GB compressed size makes the nearly two-decade corpus workable on modest hardware.

For teams building AI agents that process or monitor online technical discourse, the sub-hourly refresh cadence makes the dataset viable as a near-real-time ingestion source, reducing reliance on the live HN Firebase API for bulk operations. The full historical archive is deep enough to train and evaluate language models on startup and technology community discourse, track how narratives around AI and open source have shifted over time, or build community-behavior classifiers at scale.

Nobody has publicly identified who runs or funds OpenIndex, which matters for teams considering a long-term dependency on the dataset. The organization's naming convention — open-index/hacker-news — suggests an intent to index other large internet communities in the same format, but no additional confirmed datasets existed at time of writing. The ODC-BY license and continuous update infrastructure suggest the project is built for ML pipelines, not just archival.