Cool Sample Datasets for Projects To Make Your Portfolio Stand Out

The right dataset can make or break your data analytics project.

If you’re creating your first (or hundredth) data analytics portfolio, you want real-world relevance.

You want data that has enough complexity to demonstrate your skills, and a clean enough structure that you’re not spending 90% of your time untangling missing values.

The problem?

It can be hard to find the right public datasets…if you don’t know where to look.

That’s why I’ve created this list of (mostly) free datasets that you can use for your analytics projects.

Each section focuses on a category or source (e.g., social media APIs, sports analytics, healthcare, finance, and more), complete with summaries so you can quickly scan for what’s useful to you.

Use the table of contents to jump around or scroll down to view them all.

Contents hide

Google Dataset Search

General & Multi-Domain Dataset Websites

Kaggle Datasets Summary

FiveThirtyEight Datasets

Data.gov

Data.europa.eu

GitHub Datasets

DataHub

Hugging Face Datasets

Papers With Code Datasets

Maven Analytics Data Playground

Data.gov.uk

Canada.ca Open Data

Data Ontario (Canada)

Registry of Open Data on AWS

Azure Open Datasets

Data Commons

UC Irvine Machine Learning Repository

University of Missouri Public Data Sets

Carnegie Mellon University AI Datasets

Linked Open Data Cloud

Pew Research Center Datasets

ClickHouse Example Datasets

S&P Global Datasets

Florida Atlantic University Data Sets

Healthcare & Medical Datasets

HealthData.gov

VAERS Data

Mount Sinai Health Data Resources

Health Data NY

Florida Department of Health Data

Climate, Environment & Earth Science Data Sets

NASA Earthdata Search

NOAA National Centers for Environmental Information (NCEI)

USDA NASS Datasets

Housing & Real Estate Datasets

Zillow Housing Data

Realtor.com Housing Data

Population & Demographics Datasets

U.S. Census Bureau Data

NCES (National Center for Education Statistics) Data

Bureau of Justice Statistics (BJS) Summary

FBI Crime Data Explorer (CDE) Summary

Internet & Web Analytics Data

BuiltWith Datasets

ORCAS Dataset

Ookla Open Data

IEEE DataPort RF Signal Dataset

Social Media & Digital Platform Datasets

YouTube-8M Dataset

Wikimedia Datasets

Meta Research Tools & Datasets

Meta Ad Library

X (formerly Twitter) Academic Research

Yelp Open Dataset

Sports & Fitness Data Sets

SURE Sports Data Sources

Sports-Statistics.com Datasets

Ohio State University Sports Data Sets

SCORE Sports Data Repository

Ticket Sales & Event Datasets

SeatData.io

StubHub API

Food & Hospitality Data Sets

FoodData Central

University of Tennessee Hospitality & Tourism Datasets

Where to Find Datasets: Communities & Forums

Reddit r/datasets

Wrapping It Up

Either way, this list is built to help you spend less time searching and more time analyzing.

You’ll find:

Hard-to-find sources like sales data, ad libraries, public health info, and sports tracking.
APIs that give you better access to dynamic data for more involved projects.
Dataset-rich platforms that are perfect for experimentation and idea generation.
Niche & specialty datasets to help your data analyst portfolio stand out from the crowd.

No matter if you’re into machine learning, visualization, storytelling, or statistical modeling, you’ll find something here that works for you.

Before we get into the categories, let’s start with the Google Dataset Search.

Google Dataset Search

Google Dataset Search works like a search engine specifically for data: pulling in results from thousands of repositories across the web.

What makes it work is standardized metadata (like schema.org), which means every result includes useful info like the dataset’s title, publisher, date, and licensing terms. You can filter by topic, file type, update date, and more to zero in on exactly what you need.

Whether I’m doing research, building models, or just poking around for a new project idea, I always start with the Google Dataset Search because it out a ton of noise and helps you find clean, usable data fast.

Explore datasets at: Google Dataset Search

General & Multi-Domain Dataset Websites

Perfect for beginners and seasoned analysts alike—these platforms cover everything from finance to sports to social issues.

Kaggle Datasets Summary

Kaggle is a popular platform for data science and machine learning, offering a massive collection of free, public datasets across a wide range of topics. Users can explore, analyze, and even share their own data, making it a hub for both beginners and professionals working on data projects.

The datasets span categories like computer science, business, education, healthcare, sports, natural language processing (NLP), computer vision, and more.

Each listing includes details like file size, format (CSV, JSON, SQL, etc.), and a usability score to help you gauge how clean and ready-to-use the dataset is.

New and trending datasets are continuously added, often tied to real-world use cases or active data competitions.

Featured Datasets for Projects:

Wikipedia Structured Contents – A 25 GB JSON dataset of structured data extracted from Wikipedia, ideal for natural language processing and information retrieval projects.
Machine Learning Job Postings in the US – A compact CSV file featuring job listing data, useful for trend analysis in the data science job market.
Football Data: European Top 5 Leagues – Detailed data on teams and matches across Europe’s major leagues, perfect for sports analytics and time-series projects.
Diabetes Prediction Dataset – A clean, compact dataset frequently used for classification modeling and machine learning practice.
15-Year Stock Data: NVDA, AAPL, MSFT, GOOGL & AMZN – Historical stock prices for major tech companies, great for financial modeling and forecasting.

Explore more datasets at: kaggle.com/datasets

FiveThirtyEight Datasets

FiveThirtyEight is a data journalism site known for its evidence-based reporting in politics, sports, science, economics, and culture.

As part of its transparency efforts, the site publishes the datasets and code behind its articles, allowing analysts, students, and journalists to explore or replicate their findings.

Most datasets are hosted on GitHub in CSV format and come with brief context and links to the original articles. The focus leans heavily toward American politics and sports but includes gems in public health, entertainment, and social science.

They’re especially valuable for storytelling with data, building dashboards, and practicing real-world analytics using clean, well-documented sources.

Featured Datasets for Projects:

NBA Player RAPTOR Ratings – A unique player-value model that evaluates offensive and defensive contributions for current and historical NBA players. Great for sports analytics and regression projects.
Generic Congressional Ballot Polls – A time-series dataset tracking partisan preferences for Congress. Useful for trend analysis and visualizing political shifts over time.
MLB Elo Ratings – Game-by-game predictions and team ratings updated throughout each season. Excellent for time series modeling and sports forecasting.
Pollster Ratings – FiveThirtyEight’s own scoring system for polling firms, with metrics on bias, predictive accuracy, and transparency.
Redlining and Home Loans – A historical dataset showing how government practices in the 1930s shaped long-term racial disparities in housing. Ideal for social science and GIS projects.
Super Bowl Ad Ratings – Survey data on how viewers perceived Super Bowl ads across categories like humor, patriotism, and sex appeal. Fun for sentiment analysis or clustering.
COVID-19 Polls – Public opinion data on pandemic response, vaccines, and restrictions. Strong for time-series or geographic comparison analysis.
World Cup Predictions – Match-level data including SPI (Soccer Power Index) scores, probabilities, and team ratings. Good for predictive modeling and international comparisons.

Explore more datasets at: data.fivethirtyeight.com

Data.gov

Data.gov is the U.S. government’s official open data portal, offering public access to hundreds of thousands of datasets from federal agencies.

With over 310,000 free datasets across sectors like agriculture, climate, education, energy, health, science, and public safety, Data.gov is a goldmine for analysts, developers, journalists, and civic technologists.

Datasets are typically provided in machine-readable formats (CSV, JSON, XML, GeoJSON) and often include metadata, geospatial components, and links to agency documentation or APIs.

Featured Datasets for Projects:

NOAA Global Historical Climatology Network – A massive climate dataset with decades of weather observations from thousands of stations worldwide. Ideal for time-series forecasting and climate trend analysis.
Consumer Complaint Database (CFPB) – Records over a million complaints submitted by consumers about financial products and services. Great for sentiment analysis, topic modeling, and regulatory research.
USDA Food Access Research Atlas – Geospatial data on food deserts and access to healthy food options in U.S. communities. Strong use case for mapping, GIS projects, and public health studies.
Fatality Analysis Reporting System (FARS) – A national database of fatal motor vehicle crashes from the National Highway Traffic Safety Administration. Useful for safety research and predictive modeling.
National Transit Database – Operational and financial performance metrics from public transit agencies. A go-to source for transportation planning and urban infrastructure analysis.
Medicare Hospital Compare Data – Publicly reported hospital performance data on measures like patient outcomes, satisfaction, and readmission rates. Perfect for healthcare benchmarking and visualization.
2020 Decennial Census Summary Files – Detailed demographic data from the U.S. Census Bureau, broken down by region, age, sex, race, and more. A foundational dataset for policy, equity research, and market segmentation.

Explore more datasets at: data.gov

Data.europa.eu

Data.europa.eu is the official open data portal of the European Union, providing centralized access to over 1.9 million free datasets from across EU institutions, member states, and affiliated agencies.

The portal includes data across 35 countries and 200 catalogs, covering domains such as agriculture, economy, environment, health, justice, education, and transportation.

Most datasets are available in open formats like CSV, JSON, XML, RDF, and GeoJSON, and come with multilingual metadata. The platform also features data stories, learning modules, and visualization guides to help users derive insights and develop impactful projects.

Featured Datasets for Projects:

EU Financial Sanctions List – A consolidated database of individuals and entities subject to EU financial sanctions, used in compliance, security analysis, and geopolitical research.
Pesticide Active Substances & Residue Levels – Information on approved plant protection substances and maximum residue levels across the EU. Useful for food safety, agriculture, and regulatory studies.
CORDIS Horizon Europe Projects (2021–2027) – Data on funded research projects, partners, and outcomes from the EU’s flagship research program. Ideal for innovation tracking and R&D network analysis.
Charging Stations for Electric Vehicles – Location and availability of EV charging infrastructure across European cities. A key dataset for sustainability, smart city planning, and transportation modeling.
European Air Quality Data – Time-series air pollution data from monitoring stations across the EU. Suitable for health impact assessments, climate studies, and environmental policy modeling.
EU Trade in Goods – Monthly trade statistics by product and partner country, enabling economic analysis, market research, and policy forecasting.
Urban Audit – Quality of Life in European Cities – Socioeconomic, environmental, and demographic indicators from over 900 cities in Europe. Great for comparative urban analysis and data storytelling.

Explore more datasets at: data.europa.eu

GitHub Datasets

GitHub hosts a vast and diverse ecosystem of public datasets shared by individuals, academic researchers, open source communities, and tech companies.

With over 500,000 repositories tagged with “dataset,” GitHub serves as a powerful discovery platform for niche, high-quality datasets often accompanied by code, documentation, and practical use cases.

Unlike curated data portals, GitHub datasets are project-driven and frequently updated, making it a valuable source for machine learning, computer vision, NLP, medical imaging, and domain-specific research. These datasets often come in raw or semi-structured formats (CSV, JSON, XML, SQL) and are ideal for developers, data scientists, and educators looking to dive deep into real-world data challenges.

This GitHub repository is essentially a community-curated directory of links to public datasets.

It doesn’t host data itself—instead, it categorizes external datasets across a wide range of topics like agriculture, economics, computer vision, social science, and transport.

Each entry includes a brief description and a link to the original source. What makes it “awesome” is its breadth and the fact that it’s constantly updated by contributors. This is a great starting point for project ideas or exploring domains you might not have considered.

You don’t need a GitHub account to use it, but if you’re a contributor or want to bookmark datasets via GitHub stars, logging in is helpful.

Featured Datasets & Repositories

Hugging Face – A hub of preprocessed, ready-to-use datasets for machine learning, NLP, and computer vision with streamlined integration for model training and evaluation.
Awesome Public Datasets by awesomedata – A massive, topic-centric list of high-quality public datasets across domains such as economics, health, science, and technology.
TensorFlow (TFDS) – A library of ML-ready datasets optimized for TensorFlow and JAX, ideal for deep learning workflows in computer vision, NLP, and reinforcement learning.
OpenImages – A massive dataset of annotated images for computer vision research, featuring object detection, segmentation, and visual relationship annotations.
Google Creative Lab QuickDraw Dataset – A fun and extensive collection of doodle sketches from the “Quick, Draw!” game, used in classification and generative model projects.
Linhandev – A curated index of open medical imaging datasets including MRI, CT, and segmentation data, widely used for health AI applications.

Explore more datasets at: GitHub dataset repositories

DataHub

DataHub is a modern open data platform built by Datopian that makes it easy to find, share, and publish quality datasets. It offers both free and premium datasets, with tools designed for data professionals, developers, and researchers.

Users can browse curated collections, publish datasets using GitHub integration, or request customized data through the Premium Data Service.

The platform includes thousands of datasets organized by topic—from environment and health to economics and machine learning.

Most datasets are available in CSV, JSON, or GeoJSON formats, and the Awesome Data section highlights high-quality, handpicked collections curated by the community.

Featured Datasets for Projects:

S&P 500 Companies – A clean, frequently updated list of S&P 500 firms with metadata, perfect for finance and investing projects.
Country Codes (ISO 3166-1) – Standardized country codes for geographic labeling, compatible with most data tools.
Airport Codes Worldwide – A comprehensive dataset for travel, shipping, or logistics applications.
Country Polygons (GeoJSON) – Detailed boundary files for mapping and geospatial analysis.
Core Economic Indicators – Key macroeconomic data including inflation, employment, and GDP metrics across countries.

Explore more datasets at: datahub.io

Hugging Face Datasets

Hugging Face is a leading platform in natural language processing and open-source AI, offering a massive repository of over 390,000 datasets for training, evaluating, and fine-tuning machine learning models. It is especially popular in the NLP and deep learning communities.

The datasets span multiple modalities, including text, tabular, audio, image, video, time-series, 3D, and geospatial data. Each dataset page includes an interactive preview, metadata, license info, task tags, and usage examples.

Whether you’re building language models, chatbots, or multimodal systems, Hugging Face is a go-to hub for real-world, large-scale, and benchmark datasets.

Featured Datasets for Projects:

openai/gsm8k – A benchmark dataset of 8K-grade math word problems designed for evaluating language model reasoning.
huggingfacefw/fineweb – One of the largest web-scale text datasets with over 25 billion tokens, optimized for pretraining large language models.
nvidia/OpenMathReasoning – A dataset of over 5 million advanced math problems used for training and benchmarking reasoning models.
tokyotech-llm/swallow-code – A 145M-sample data set focused on code generation and reasoning, great for building coding agents.
fka/awesome-chatgpt-prompts – A community-driven dataset of curated ChatGPT prompts for instruction-tuning and fine-tuning LLMs.

Explore more datasets at: huggingface.co/datasets

Papers With Code Datasets

Papers With Code is a comprehensive resource for machine learning and artificial intelligence research, featuring over 11,000 public datasets linked directly to academic papers, benchmarks, and task leaderboards.

It’s designed for both researchers and practitioners who want to replicate, compare, or improve upon state-of-the-art results across a huge range of ML tasks and modalities.

The site supports datasets across domains like image classification, NLP, speech, reinforcement learning, 3D vision, tabular data, and time series, with filtering options for modality, task type, and language.

Each dataset is linked to published papers, benchmarks, and model performance scores—making it one of the best places to find evaluation-ready, academically grounded datasets.

Featured Datasets for Projects:

ImageNet – The foundational large-scale image classification and object detection dataset, with over 14 million labeled images across thousands of categories.
MS COCO (Common Objects in Context) – A benchmark dataset for object detection, segmentation, keypoint recognition, and captioning.
CIFAR-10 / CIFAR-100 – Small, color image datasets often used in classification benchmarks and deep learning experiments.
GLUE Benchmark – A collection of 9 natural language understanding tasks used to evaluate the performance of language models like BERT and RoBERTa.
MIMIC-III – A de-identified dataset of medical records from intensive care units, widely used in clinical NLP and healthcare machine learning.

Explore more datasets at: paperswithcode.com/datasets

Maven Analytics Data Playground

The Maven Analytics Data Playground is a curated collection of real-world datasets designed for hands-on learning and portfolio development. Created by the team behind Maven Analytics’ popular data courses, the playground offers high-quality sample data sets ideal for practicing Excel, SQL, Power BI, and Tableau skills.

Each dataset includes metadata such as the number of records and fields, file format (typically CSV), and whether it contains multiple tables or time series data.

The topics are engaging and varied—ranging from entertainment and transportation to art, government, and business—making it a great place to find clean, beginner-friendly datasets that still reflect real analytical challenges.

Featured Datasets for Projects:

Himalayan Expeditions – Over 11,000 records detailing expeditions from 1905 to 2024, including climber details, weather conditions, and summit outcomes.
Spotify Streaming History – A full export of one user’s listening history, great for time series, personal analytics, and trend analysis.
Pixar Films – Clean and structured data on every Pixar film through 2024, including release dates, budgets, box office returns, and more.
MoMA Collection – Artworks and acquisitions from the Museum of Modern Art, with metadata like artist, classification, and acquisition year.
MTA Daily Ridership – Daily ridership and service usage across New York’s MTA system with pre- and post-COVID comparisons.

Explore more datasets at: mavenanalytics.io/data-playground

Data.gov.uk

Data.gov.uk is the United Kingdom’s official open data portal, offering access to over 55,000 datasets from public sector organizations, research institutions, and local authorities. It serves as a comprehensive hub for data across a wide range of domains, including environment, land use, government spending, infrastructure, education, and public health.

The platform aggregates data from major publishers such as the Environment Agency, Natural England, the British Geological Survey, and numerous regional councils. Dataset categories span environmental monitoring, land and resource management, geospatial mapping, procurement and economic activity, agriculture, and public services.

With support for multiple formats—CSV, XLSX, GeoJSON, KML, ZIP, PDF, and more—users can filter datasets by format, license, topic tag, or publisher, making it easy to locate data for research, analytics, or application development.

Most free datasets are released under the UK Open Government Licence, ensuring broad reusability.

Featured Datasets for Projects:

UK Public Procurement Notices (2021–2025) – Monthly XML files detailing public sector procurement activity, ideal for economic analysis and contract tracking.
Natural England Designated Sites – Official data on conservation areas, including SSSIs and protected habitats, valuable for environmental research and policy.
British Geological Survey Deposited Data – High-resolution geological data and soil profiles, supporting infrastructure planning and scientific research.
Environment Agency Flood Risk Zones – Geospatial boundary data showing areas at risk of flooding across England.
City of York Council Open Data – Local-level datasets including planning applications, transportation, waste services, and public facility usage.

Explore more datasets at: data.gov.uk

Canada.ca Open Data

Canada.ca provides centralized access to the Government of Canada’s public data, statistics, and archival information, supporting research, analysis, and innovation across various sectors. Public data sets span a wide range of categories, including census and demographic data, economic indicators, government spending, science and technology statistics, and historical archives.

Users can explore resources from Statistics Canada, Library and Archives Canada, and the Treasury Board Secretariat, among others.

The platform enables access to structured datasets, statistical studies, digital collections, and technical papers, many of which are freely available for reuse.

It includes themes like science and innovation, environment, health, economics, and indigenous affairs, making it valuable for policy analysts, researchers, educators, and developers.

Featured Resources

Canadian Census Data – Detailed demographic and socioeconomic information updated every five years, essential for urban planning, social research, and policy development.
Open Data Portal – A searchable interface for accessing thousands of government datasets, including formats like CSV, XLSX, and XML.
Science and Technology Statistics – Data on R&D spending, innovation metrics, patents, and tech sector employment, useful for economic and academic research.
Library and Archives Canada – Digitized historical records, photographs, census rolls, and genealogical information available for personal and scholarly use.
Data Tables and Statistical Studies – Ready-to-use socioeconomic data and in-depth research papers from Statistics Canada, suitable for academic, business, and public use.

Explore more datasets at: open.canada.ca

Data Ontario (Canada)

Data Ontario is the official open data portal for the Government of Ontario, offering public access to nearly 3,000 datasets across government services, business, health, environment, and more.

Datasets can be filtered by topic, ministry, format (e.g., CSV, XLSX, PDF), access level, and update frequency. While some datasets are open and ready to download, others are restricted or under review.

Data Ontario is particularly useful if you want to look at a subset of Canadian government data focused on the most popular province, especially for policy, community services, and environmental trends.

Featured Datasets for Projects:

School Information & Student Demographics – Data from Ontario’s Ministry of Education covering EQAO test results, school and board profiles, and academic performance trends.
Sediment Chemistry in Great Lakes – Environmental data on sediment samples across lake stations, useful for monitoring pollution and conservation efforts.
Bumble Bee Diversity and Abundance Survey – Survey results detailing species counts and diversity across Ontario, supporting environmental research and species protection.
Where to Buy Alcoholic Beverages – Retailer location data from the Alcohol and Gaming Commission of Ontario for licensed alcohol sales across the province.
Assistive Devices Program (ADP) Vendor Listings – A set of datasets listing vendors for mobility aids, prosthetics, hearing devices, and more under Ontario’s ADP.

Explore more datasets at: data.ontario.ca

Registry of Open Data on AWS

The Registry of Open Data on AWS (RODA) is a powerful platform for discovering open and commercial datasets hosted in the Amazon Web Services cloud.

Now integrated with AWS Data Exchange, it gives users access to thousands of free and premium datasets from leading institutions, including NASA, Meta’s Data for Good, NOAA, NIH, and the Allen Institute for Cell Science. RODA spans industries like genomics, environmental science, medicine, machine learning, and satellite imagery.

Data is typically stored in cloud-optimized formats (like Parquet, JSON, and CRAM) and ready for immediate use with AWS analytics tools such as SageMaker, Athena, and Redshift.

Each dataset includes documentation, example notebooks, and references to academic research or use cases.

Featured Datasets for Projects:

The Human Sleep Project – Clinical sleep physiology recordings from 15K+ patients, used to train deep learning models for sleep diagnostics and brain aging research.
Common Crawl – An enormous corpus of 50+ billion web pages used for web-scale NLP, language modeling, and multilingual embeddings.
The Cancer Genome Atlas (TCGA) – Genomic data from 11,000 cancer patients across 33 cancer types, enabling breakthroughs in bioinformatics and precision medicine.
Folding@Home COVID-19 – Large-scale protein folding simulations from the exascale Folding@Home network, focused on SARS-CoV-2 vulnerabilities.
Sentinel-2 Satellite Imagery – High-resolution global optical imagery updated every 5 days, ideal for land cover, vegetation, and environmental monitoring.

Explore more datasets at: registry.opendata.aws

Azure Open Datasets

Azure Open Datasets is Microsoft’s collection of curated, ready-to-use public datasets designed to streamline machine learning workflows. Hosted on Azure, these datasets save time on data sourcing and preparation, making them ideal for data science, AI, and analytics projects in transportation, healthcare, economics, and more.

Datasets are provided in formats optimized for Azure services (e.g., Parquet, CSV), and are often updated daily. Many collections are integrated with Azure Machine Learning Studio.

Featured Datasets for Projects:

NYC Taxi Trip Records – Includes yellow, green, and for-hire vehicle data with pickup/drop-off times, fares, and geolocations—perfect for time-series analysis, fare prediction, and urban mobility models.
COVID-19 Data Lake – A consolidated data lake of COVID-related data including mobility trends, hospital capacity, and policy interventions.
US Labor Force & Employment Data – Employment, earnings, and participation statistics at national, state, and local levels.
City Safety 311 Data – Public safety and service request data from cities like NYC, Boston, Seattle, and San Francisco, updated daily.
Microsoft News Dataset (MIND) – A benchmark dataset for news recommendation systems, including user behavior and news article metadata.

Explore more datasets at: azure.microsoft.com/en-us/services/open-datasets

Data Commons

Data Commons is a public knowledge graph that unifies open datasets from around the world, making it easier to explore, analyze, and compare information across domains like health, education, demographics, and the economy.

Built by Google, it aggregates data from trusted sources such as the U.S. Census Bureau, World Bank, OECD, and national statistics offices from over 30 countries.

Datasets are organized into categories such as agriculture, biomedical, crime, demographics, economy, education, energy, environment, health, housing, and transportation. Each dataset is harmonized into a common schema, enabling comparisons across geographies and time.

Most data is accessible via an easy-to-use web interface, APIs, and tools for querying and visualization.

Featured Datasets for Projects:

American Community Survey (ACS) – U.S. Census Bureau data covering population, housing, income, and employment, available at state, county, and city levels.
OECD Regional Demographics – Population, mortality, and life expectancy statistics for OECD countries at multiple geographic levels.
World Bank Development Indicators – Global data on health, energy, economy, labor, and education.
Opportunity Atlas – U.S. neighborhood-level social mobility data, including income, education, and life outcomes by race and parental background.
India Census Tables – District-level demographic data on literacy, religion, housing, and work, with extensive geographic coverage.
CDC Wonder: Mortality & Natality – U.S. mortality and birth data broken down by age, race, sex, and geography.

Explore more datasets at: datacommons.org

UC Irvine Machine Learning Repository

The UC Irvine Machine Learning Repository (UCI ML Repository) is one of the oldest and most widely used sources of datasets for teaching, research, and experimentation in machine learning.

Maintained by the University of California, Irvine, it hosts over 675 datasets contributed by the global research community. The datasets span a wide range of domains including biology, finance, healthcare, marketing, energy, and physics.

Most are structured in CSV or TXT format, and include rich metadata like number of instances, number of features, task type (classification, regression, clustering), and a brief description of the problem.

Featured Datasets for Projects:

Iris – A classic dataset introduced in 1936, containing flower measurements used to demonstrate classification algorithms.
Adult (Census Income) – Predict whether income exceeds $50K/year based on census data. Common for binary classification tasks.
Breast Cancer Wisconsin (Diagnostic) – Features derived from digital images of breast mass tissue samples to predict diagnosis.
Wine Quality – Data on physicochemical properties of Portuguese wines, used to model perceived wine quality.
Bank Marketing – Predict term deposit subscription based on customer data from marketing campaigns.

Explore more datasets at: archive.ics.uci.edu

University of Missouri Public Data Sets

The University of Missouri Libraries maintain a robust guide to public use datasets ideal for quantitative research across disciplines. This curated resource provides direct access to major repositories and thematic datasets, with a focus on social sciences, health, economics, education, environment, and more.

Many datasets are freely accessible, while some require institutional access or are available through affiliated platforms like ICPSR and the Roper Center.

Dataset categories span key research areas including political science, demographics, business and economics, health, sociology, education, and science/environmental studies.

Featured Datasets for Projects:

American National Election Studies (ANES) – A foundational political science resource providing public opinion, voting behavior, and election data since 1948.
IPUMS (Integrated Public Use Microdata Series) – High-quality demographic microdata from U.S. and international censuses.
Consumer Expenditure Surveys (CES) – U.S. household data on expenditures, income, and demographics, ideal for economic analysis.
Demographic and Health Surveys (DHS) – Large-scale surveys from developing countries focusing on health and population trends.
General Social Survey (GSS) – Long-running U.S. sociological survey covering social attitudes, behaviors, and demographics.
National Center for Education Statistics (NCES) – Comprehensive data on American education, including schools, students, and degrees.

Explore more datasets at: University of Missouri Libraries – Public Use Datasets Guide

Carnegie Mellon University AI Datasets

Home to one of the best computer science programs in the world, Carnegie Mellon University Libraries provide a comprehensive guide to datasets for artificial intelligence and machine learning research.

The collection emphasizes open-access, high-quality datasets curated from trusted academic, governmental, and industry sources. It supports a wide range of AI domains, including computer vision, natural language processing (NLP), deep learning, and data mining.

Dataset categories span machine learning benchmarks, computer vision, text mining, COVID-19, and generalist repositories for multidisciplinary research. Specialized repositories like Hugging Face, UCI Machine Learning Repository, and Papers with Code are highlighted, along with text analysis platforms like ProQuest TDM Studio and Gale Digital Scholar Lab.

Their website also links to global registries such as r3Data.org, FAIRsharing, and WorldData.AI, supporting exploratory and reproducible research in AI.

Featured Datasets & Platforms

UCI Machine Learning Repository – A classic collection of structured datasets used to test and benchmark ML algorithms.
Hugging Face Datasets – An open library of text, audio, and multimodal datasets for NLP and machine learning.
ImageNet – A foundational computer vision dataset with millions of labeled images organized by WordNet categories.
CORD-19 (COVID-19 Open Research Dataset) – A continually updated repository of scholarly articles related to coronavirus research.
Johns Hopkins University COVID-19 Data – Widely used for tracking global case counts and trends, aggregated from multiple public health sources.
KiltHub – CMU’s institutional repository for datasets, theses, and scholarly outputs.
Open Science Framework – A collaborative platform for project management and open data sharing across research domains.
Papers With Code – Merges AI research papers with associated code and datasets for transparency and benchmarking.

Explore more datasets at: guides.library.cmu.edu/artificial-intelligence

Linked Open Data Cloud

The Linked Open Data (LOD) Cloud is a visual and navigable network of datasets published using the principles of Linked Data. Maintained by the Insight Centre for Data Analytics, it serves as a centralized map of RDF-based datasets that are interlinked, accessible via SPARQL or RDF dumps, and structured for semantic web applications.

With over 1,350 public data sets, it provides a snapshot of the evolving open data ecosystem for researchers, developers, and data scientists interested in semantic technologies and data integration.

Dataset categories in the LOD Cloud include Cross-Domain, Geography, Government, Life Sciences, Linguistics, Media, Publications, Social Networking, and User-Generated Content.

Each dataset is required to have at least 1,000 RDF triples and be interlinked with others, creating a robust, machine-readable web of data.

Featured Elements

LOD Diagram Viewer – Interactive visualization of all linked datasets and their relationships.
SPARQL Access & RDF Dumps – Datasets are made fully accessible for semantic querying and analysis.
Historical Snapshots – Archived versions of the LOD cloud from 2007 to present allow exploration of growth trends.
IPFS Cloud – Experimental decentralized hosting of LOD datasets via the InterPlanetary File System (IPFS).
Submission Portal – Researchers can propose new datasets by adhering to Linked Data best practices and interlinking requirements.

Explore the live LOD Cloud and datasets at: lod-cloud.net

Pew Research Center Datasets

Pew Research Center is a leading nonpartisan think tank known for its rigorous public opinion polling, demographic research, and data-driven analysis on major social, political, and cultural issues.

As part of its commitment to transparency and accessibility, Pew offers free access to case-level microdata from many of its surveys for secondary analysis. These free datasets are typically released after a short embargo and are ideal for researchers, educators, and data analysts seeking high-quality survey data.

Pew’s datasets span a wide range of topics including politics, religion, technology, news consumption, economics, demographics, and global trends. Users can explore both U.S.-based and international survey data, with resources designed for both individuals and classrooms (like typology quizzes and interactive tools).

Featured Datasets & Tools

Religious Landscape Survey – Responses from over 35,000 U.S. adults covering religious affiliation, belief, practice, and political attitudes.
Global Religious Futures – Comparative international data on religious demographics and restrictions across countries.
Local News Dynamics – Data exploring how Americans consume and evaluate local news sources by region.
Survey Question Search – Access over three decades of Pew survey questions through the Roper iPoll database.
Political and Religious Typology Quizzes – Tools to compare yourself or your group to national surveys of thousands of adults.
Pew Research Methods R Package – Open-source tools on GitHub to help users work with Pew survey data using R.

Explore more datasets and tools at: pewresearch.org/datasets

ClickHouse Example Datasets

ClickHouse is a high-performance, open-source columnar database built for real-time analytics at scale.

To help users learn and benchmark its powerful features, ClickHouse offers a wide variety of public example datasets across domains such as e-commerce, geospatial analysis, social media, climate science, aviation, and web analytics. Many datasets contain billions of rows and are optimized for hands-on experimentation with real-world data.

In addition to creating a data analytics portfolio, use cases span machine learning, business intelligence, observability, log analysis, and data warehousing.

Each dataset is accompanied by sample queries, integration guides, or tutorials to demonstrate best practices.

Featured Datasets for Projects:

Amazon Customer Reviews – 150M+ reviews on Amazon products, great for sentiment and text analysis.
New York Taxi Data – Billions of ride records (yellow cabs, Uber, etc.), widely used for geospatial, time-series, and performance benchmarks.
Reddit Comments Dataset – 14B+ rows of public Reddit comments from 2005–2023, perfect for text mining and user behavior analysis.
Environmental Sensors Data – 20B sensor readings from a global air quality network, useful for time-series and IoT applications.
Laion-400M Dataset – 400 million image-text pairs for training multimodal models (great for GenAI).
Foursquare Places – 100M+ venue records for mapping and spatial analysis.
UK Property Prices – Historical home sale data for England and Wales.
COVID-19 Open Data – Epidemiological and economic data for pandemic-related studies.
GitHub Events – 3.1B GitHub activity logs for studying developer behavior or repository trends.

Explore more datasets at: clickhouse.com/docs/en/getting-started/example-datasets

S&P Global Datasets

S&P Global offers a robust suite of high-quality, institution-grade datasets covering global markets, industries, and ESG metrics. Designed primarily for financial analysts, researchers, and professionals, these datasets deliver point-in-time, real-time, and forecasted data across a wide array of sectors.

S&P Global’s data spans everything from equities, credit risk, and ownership structures to climate impact, automotive trends, and supply chain relationships.

Dataset categories include market intelligence, mobility, sustainability, and commodity insights, with deep coverage in finance, energy, insurance, bio/pharma, and ESG analytics.

Many datasets are updated frequently, offer global scope, and can be delivered via cloud platforms or APIs for seamless integration.

Featured Datasets for Projects:

Retail Advertised Inventory – Real-time inventory data from 19,000+ US vehicle dealers, including 40+ vehicle attributes.
OTC Derivatives Data – Price transparency and valuation insights from 60+ market makers and major exchanges.
Visible Alpha BioPharma – Drug pipeline data and consensus analyst estimates across biotech firms.
Trucost Environmental Data – ESG metrics for over 16,800 public and 4.5M private companies worldwide.
AutoCreditInsight – Detailed insights into automotive financing, lending trends, and consumer behavior.
ETF Factors – Fundamental and factor exposures for 4,000+ global exchange-traded funds.
Nature & Biodiversity Risk – Company-level analysis of nature-related risks and dependencies.
SNL RateWatch – Banking product interest rate and fee data across US institutions.
Securities Finance Short Interest – Data on borrow costs, short seller demand, and squeeze risk for global equities and fixed income.
Historical & Forecast LNG Prices – Deep commodity analytics on global liquefied natural gas markets.

Explore more datasets at: www.marketplace.spglobal.com

Florida Atlantic University Data Sets

The Florida Atlantic University Libraries’ guide to Florida Data and Statistics compiles a comprehensive list of publicly available datasets that reflect the state’s demographic, economic, environmental, health, transportation, and educational landscape.

These datasets are sourced from state agencies, universities, and federal databases, offering researchers, policymakers, and citizens a rich foundation for evidence-based analysis and decision-making.

Categories include agriculture, crime, education, economy, energy, environment, health, housing, GIS, transportation, tourism, and public safety, with many available in geospatial formats. Special emphasis is placed on Florida-specific statistics—ranging from boating accidents and crash reports to public school demographics and environmental metrics.

Many datasets are also accessible through platforms like data.gov, data.world, Florida CHARTS, Florida Geographic Data Library, and the Florida Department of Health.

Featured Datasets for Projects:

AERMET Meteorological Data – Regulatory-grade datasets from 32 ASOS stations used for air quality modeling.
Florida CHARTS – Community health assessment tools and public health statistics down to the county level.
Florida Geographic Data Library (FGDL) – Over 600 geospatial data layers covering land use, hydrography, soils, transportation, and more.
Boating Accident Statistical Reports – Historical accident data compiled by Florida’s Fish and Wildlife Commission.
Florida Housing Data Clearinghouse – Data on affordable housing supply, demand, and household demographics.
Florida HSMV Crash & Citation Reports – Traffic incident, citation, and trend data collected by the Department of Highway Safety.
Florida Legislative Datasets – Weekly snapshots of state legislative activity available in CSV and JSON formats.
Florida Climate Center – Historical weather and precipitation data from long-term observation stations across the state.
Florida Labor Market Statistics – Employment, wage, and industry trend data from the Florida Department of Commerce.
USGS Water Data for Florida – Real-time streamflow, groundwater, and water quality data for Florida from the U.S. Geological Survey.

Explore more datasets at: https://library.fau.edu

Healthcare & Medical Datasets

Ideal for health informatics, epidemiology, and public health analysis.

HealthData.gov

HealthData.gov is the U.S. Department of Health and Human Services’ (HHS) flagship open data portal, created to make health-related government datasets easily accessible to the public. It was designed to support innovation, research, and informed decision-making, and aggregates data from HHS agencies such as the CDC, NIH, FDA, and CMS.

Dataset categories span public health surveillance, healthcare cost and utilization, Medicare and Medicaid data, adverse drug events, cancer research, child health, and more. Many of these datasets are openly accessible, while some require data use agreements or registration, especially when involving potentially sensitive or restricted health data.

HealthData.gov also links to specialized repositories and tools tailored for specific research communities, such as cancer epidemiology or child abuse prevention.

HealthData.gov serves as a central hub for both downloadable datasets and API access, empowering users across government, academia, journalism, and the private sector.

Featured Data Resources

Data.CDC.gov – Over 21 categories of public health data including tobacco use, injury, maternal health, and COVID-19 surveillance.
OpenFDA – API and downloadable datasets from FDA reporting systems like drug adverse events and device recalls.
Data.CMS.gov – Public use files and dashboards from Medicare & Medicaid, including provider payment data, quality metrics, and utilization statistics.
NIH Data Repositories – Domain-specific biomedical datasets from NIH-funded studies, including genomic, clinical, and behavioral data.
HCUP (Healthcare Cost and Utilization Project) – Federally supported hospital care datasets used widely in health economics and policy research.
CDC WONDER – An online tool for customized analysis of CDC’s public health datasets, often used in epidemiologic research.
HRSA Data Warehouse – Maps, dashboards, and query tools that explore healthcare access and outcomes in underserved communities.

Explore more datasets at: https://healthdata.gov

VAERS Data

The Vaccine Adverse Event Reporting System (VAERS), co-managed by the CDC and FDA, is a national early warning system for identifying potential safety issues with vaccines. The platform now offers expanded public access to both primary and secondary adverse event reports through downloadable CSV files and the CDC WONDER database.

VAERS data spans from 1990 to the present, with separate downloadable files for each year and for non-domestic reports. Users can access data on vaccine types, reported symptoms, patient demographics, and outcomes, all while maintaining patient privacy.

Recent updates have enhanced dataset completeness by including multiple reports per vaccine-event case, helping researchers track follow-up submissions and data corrections.

Dataset Categories

VAERS Data (Core Dataset) – Contains case-level information such as age, sex, vaccination date, onset of symptoms, outcomes, and administrative details.
VAERS Symptoms – Lists up to five symptoms per report, categorized using MedDRA coding.
VAERS Vaccine – Includes vaccine product names, manufacturers, doses, and lot numbers associated with each report.
Non-Domestic Reports – International cases submitted to VAERS, with some fields redacted to comply with European regulations.

Access the full dataset archive at: vaers.hhs.gov

Mount Sinai Health Data Resources

Mount Sinai’s Levy Library offers an extensive and sophisticated suite of databases for healthcare research, medical education, clinical decision-making, and data analysis. These resources support clinicians, students, and researchers across various disciplines within medicine and biomedical sciences.

While many databases are open access or freely available with institutional credentials, some are restricted to Mount Sinai faculty, staff, and trainees, offering advanced tools and proprietary datasets not accessible to the general public.

The data and tools available span clinical trials, public health, pharmacology, genetics, epidemiology, consumer health, healthcare administration, and patient outcomes. Key resources include ClinicalTrials.gov, GHDx (Global Health Data Exchange), Health Services Research Queries, and proprietary clinical platforms like AccessMedicine and ClinicalKey.

Additional tools aid with systematic reviews, genomic analysis, point-of-care decisions, and medical education through immersive video tutorials, simulations, and evidence-based guidelines.

Featured Datasets & Resources

ClinicalTrials.gov – A public registry of clinical trials covering a wide range of conditions, treatments, and geographic regions.
Global Health Data Exchange (GHDx) – A catalog of health and demographic datasets globally, including surveys, indicators, and administrative health data.
Covidence – A restricted-access platform for conducting and managing systematic reviews and meta-analyses efficiently.
CINAHL Complete & HAPI – Comprehensive databases of healthcare and psychosocial instruments, measurements, and research articles for allied health professionals.
AccessMedicine & Specialty Platforms – Robust collections of searchable medical textbooks, case files, images, videos, and board prep materials across specialties like neurology, OB/GYN, surgery, and pediatrics.
Epistemonikos – Open-access database for systematic reviews and scientific evidence relevant to healthcare decision-making.
Dynamic Health & DynaMed – Point-of-care tools that offer clinical summaries, evidence-based interventions, and continuing education modules.
Health Technology Assessment Database – International evaluations of healthcare interventions and technologies, freely accessible via INAHTA.
Dietary Supplement Label Database (DSLD) – NIH’s searchable collection of information from dietary supplement product labels.
Infoshare Online – Aggregated demographic and health statistics for New York State and NYC, including hospitalizations, births, deaths, and socioeconomics.
Human Gene Mutation Database (HGMD) – Includes both a restricted academic-access version and a public version for disease-related genetic mutation data.

Explore more datasets at: https://libguides.mssm.edu

Health Data NY

Health Data NY is a public data portal operated by the New York State Department of Health, providing downloadable datasets that support transparency, public health research, and data-driven decision-making. The platform offers access to public use files (PUFs) across multiple health domains, with an emphasis on de-identified datasets that can be freely explored and analyzed by researchers, journalists, policy analysts, and the general public.

The datasets are organized into major categories, including insurance claims, hospital discharges, COVID-19, and other health-related administrative data.

Many of these datasets are refreshed regularly and are designed to support secondary analysis of trends in healthcare access, cost, outcomes, and public health metrics across New York State.

Featured Datasets for Projects:

All-Payer Claims Data PUFs – Aggregated and de-identified data from commercial, Medicaid, and Medicare payers for examining healthcare utilization, expenditures, and service patterns.
Hospital Discharge Data PUFs – Detailed information on inpatient, outpatient, and emergency room visits to New York hospitals.
COVID-19 Data PUFs – Datasets capturing testing, hospitalization, vaccination, and death metrics across New York during the pandemic.
All Department Data PUFs – A consolidated catalog that spans various DOH program areas including environmental health, long-term care, substance use, and more.

Explore more datasets at: https://health.data.ny.gov

Florida Department of Health Data

The Florida Department of Health (FDOH) provides a wide range of health datasets and statistical resources covering population health, disease surveillance, environmental hazards, emergency care, and prescription drug monitoring.

FDOH’s data offerings include public dashboards, survey results, and raw data files, with many tools hosted on FLHealthCharts (the department’s central data portal). It supports data exploration by county, region, or statewide metrics, empowering users to conduct detailed geographic and demographic analysis.

Dataset Categories and Programs

FLHealthCharts – Florida’s central hub for health indicators, visualizations, and downloadable datasets on topics such as birth outcomes, chronic disease, and preventive care.
Vital Statistics – Annual data and research files on births, deaths, marriages, and divorces.
BRFSS (Behavioral Risk Factor Surveillance System) – State-level survey tracking behavioral health risks, chronic conditions, and access to care.
Florida Injury Surveillance System – Emergency department and hospitalization data related to injuries, including falls, violence, and motor vehicle crashes.
Pregnancy Risk Assessment Monitoring System (PRAMS) – Monitors maternal attitudes and experiences before, during, and shortly after pregnancy.
Trauma Registry – Records data from trauma centers on patient outcomes, injury mechanisms, and care interventions.
FL-DOSE – Real-time surveillance on drug overdose events and trends.
E-FORCSE – The state’s prescription drug monitoring program, providing data to combat opioid misuse.
Environmental Public Health Tracking – Data on air quality, water safety, and other environmental risk factors.
Immunization Coverage Survey – Reports on vaccination rates across different age groups and regions.

Explore more datasets at: floridahealth.gov/statistics-and-data or access dashboards via FLHealthCharts.com.

Climate, Environment & Earth Science Data Sets

Explore environmental, weather, and climate datasets from top institutions.

NASA Earthdata Search

NASA Earthdata Search is a gateway to over 10,000 open climate, Earth science, and environmental monitoring datasets. The platform, managed by NASA’s EOSDIS (Earth Observing System Data and Information System), offers global-scale satellite data on weather, clouds, temperature, land cover, ocean conditions, and atmospheric composition.

Data is available through NASA’s Distributed Active Archive Centers (DAACs) and supports scientific research in environmental modeling, remote sensing, and long-term climate analysis.

Many datasets are hosted in the Earthdata Cloud and can be customized by spatial or temporal parameters before download.

Featured Datasets for Projects:

SENTINEL-1A_SLC / SENTINEL-1B_SLC – Synthetic Aperture Radar data in slant-range format used for land surface monitoring, flood mapping, and infrastructure analysis.
Aqua AIRS-MODIS Matchup Indexes (Aqua_AIRS_MODIS1km_IND) – Provides 1-km atmospheric observations for studies on water vapor, cloud properties, and temperature trends.
AIRS-CloudSat Collocation Indexes (AIRS_CPR_IND) – Aligns observations from AIRS, AMSU, and CloudSat for multi-instrument atmospheric profiling and scene classification.
AIRS-CloudSat Cloud Classification Matchups (AIRS_CPR_MAT) – Merged data including radar reflectivity, radiance, cloud classification, and elevation for climate data records.
NOAA Global Historical Climatology Network – Over 2.5 billion rows of long-term weather observations, useful for studying climate variability and trends.

Explore more interesting data sets at: earthdata.nasa.gov

NOAA National Centers for Environmental Information (NCEI)

The NCEI is the United States’ central hub for official climate and environmental data, housing one of the world’s largest archives of atmospheric, oceanic, and geophysical information.

Through the Climate Data Online (CDO) portal, users can search, map, and download thousands of high-quality datasets for free—ideal for scientific research, forecasting, and policy development. Datasets cover historical and near real-time records across land, ocean, and atmosphere, with varying time resolutions (hourly, daily, monthly, seasonal).

NCEI also provides climate normals, radar data, and legacy climate summaries, accessible via search tools, FTP downloads, or mapping interfaces.

Featured Datasets for Projects:

Daily Summaries (GHCNd) – Sub-daily to daily observations (temperature, precipitation, snow depth) from thousands of global land stations.
Global Marine Data – Ship and buoy data spanning from 1662 to present, covering weather, sea surface temperatures, cloud cover, and wave activity.
Local Climatological Data (LCD) – Detailed monthly summaries including hourly observations from ~1,000 U.S. weather stations since 2005.
Global Summary of the Day/Month/Year (GSOD/GSOM/GSOY) – Aggregated weather data computed from hourly readings, available from 1763 onward.
Weather Radar (Level II & III) – Dual-polarization NEXRAD data for real-time and retrospective radar reflectivity and atmospheric tracking.
Climate Normals – Annual, monthly, daily, and hourly climatological averages for thousands of U.S. locations (e.g., 1991–2020 baseline).
National Solar Radiation Database (NSRDB) – Solar energy and irradiance data modeled across 1,400+ U.S. sites, useful for renewable energy planning.
Precipitation Archives – Historical records at 15-minute and hourly intervals from select U.S. weather stations.
Regional Snowfall Index (RSI) – A historical scale that measures snowstorm societal impact across the U.S. from 1900 onward.

Explore more datasets at: ncei.noaa.gov

USDA NASS Datasets

The U.S. Department of Agriculture’s National Agricultural Statistics Service (USDA NASS) provides one of the most comprehensive collections of agricultural data in the world.

Through its Quick Stats Database and Census of Agriculture, NASS offers detailed, customizable datasets covering every state and county in the U.S. Topics range from crop production and livestock inventory to farm economics, environmental data, demographics, and disaster analysis.

Datasets are updated frequently and are available in tabular, visual, and geospatial formats. Users can explore data by commodity, geography, or time period, and download pre-defined tables or run custom queries.

NASS also offers interactive tools like VegScape, Crop-CASMA, and CroplandCROS for real-time monitoring and satellite-based analysis.

Featured Datasets for Projects:

Quick Stats Database – A searchable interface to explore U.S. crop yields, acreage, livestock inventory, economic indicators, and more, with filtering by year, state, and county.
Census of Agriculture – Conducted every five years, this dataset includes historical and current information on farm size, ownership, income, practices, and demographics dating back to 1997.
Crop Condition & Soil Moisture Analytics (Crop-CASMA) – Remotely sensed data on U.S. soil moisture and vegetation health using NASA satellite feeds.
Cropland Data Layer (CDL) – High-resolution geospatial crop-specific land cover data for mapping and analysis.
County-Level Estimates – Detailed breakdowns of production statistics, including livestock counts and yield estimates, available at the county level.
Disaster Analysis Maps – Geospatial assessments of agricultural damage due to natural disasters, updated in near real-time.

Explore more datasets at: www.nass.usda.gov

Housing & Real Estate Datasets

Perfect for real estate investors, urban planners, and housing market analysis.

Zillow Housing Data

Zillow Research provides one of the most comprehensive, publicly available datasets on U.S. housing markets.

Updated monthly, these datasets offer insight into home values, rents, affordability, for-sale inventory, and sales trends across metro areas and nationally. Zillow’s indices and forecasts are built using proprietary models like the Zestimate and reflect millions of real estate transactions and listing engagements.

This resource is invaluable for real estate analysts, economists, data scientists, and housing policy researchers seeking granular, time-series data that is both smoothed and seasonally adjusted.

Featured Datasets for Projects:

Zillow Home Value Index (ZHVI) – Tracks the typical home value across price tiers, home types, and regions. Available as a smoothed, seasonally adjusted time series.
Zillow Home Value Forecast (ZHVF) – Month-, quarter-, and year-ahead forecasts for home values using ZHVI.
Zillow Observed Rent Index (ZORI) – A repeat-rent index estimating typical market rate rent; available for single-family and multifamily homes.
Zillow Observed Renter Demand Index (ZORDI) – Measures engagement with rental listings to estimate regional rental demand.
For-Sale Inventory and Listings – Includes active listings, new listings, pending listings, and median list prices.
Sales Data (Nowcast) – Estimates monthly sales counts, median/mean sale prices, transaction values, and sale-to-list ratios.
Affordability Metrics – Estimates the income needed to afford renting or buying, affordable home prices, and years to save for a down payment.
Market Heat Index – Gauges supply-demand balance in a housing market based on engagement and listing performance.
New Construction – Tracks new construction sales count, median sale prices, and price per square foot.
Time on Market & Price Cuts – Tracks days to pending/closing and price reductions over time.

Explore more datasets at: zillow.com/research/data

Realtor.com Housing Data

Realtor.com offers one of the most granular residential real estate data sources in the U.S., pulling directly from the nation’s largest MLS-backed listing database.

Their Economic Research Data Library provides extensive historical and current housing metrics across national, state, metro, county, and ZIP code levels. The data is curated for comparability and reliability, although volatility may occur in smaller or less-complete markets.

This dataset is ideal for housing market analysts, real estate professionals, policymakers, and researchers looking for timely, ZIP-level market activity, pricing, inventory trends, and consumer demand insights.

Featured Datasets for Projects:

Active Listing Count – Monthly snapshot of the number of properties actively listed for sale (excluding pending sales).
Median & Average Listing Prices – Tracks both median and average listing prices at various geographic levels, with M/M and Y/Y changes.
Days on Market (Median DOM) – Measures how long homes stay on the market before selling or being removed.
Hotness Index & Rank – Proprietary “Hotness Score” combining supply and demand metrics to rank ZIPs, counties, and metros.
New & Pending Listings – Weekly-to-monthly metrics showing fresh inventory and contract activity.
Price Adjustments – Counts of listings with price decreases or increases during a month.
Listing Demand Metrics – Tracks listing page views per property, including relative measures vs. national average.
Median Price per Square Foot – Helps normalize price comparisons across regions and home sizes.
Total Listings – Combined count of active and pending listings, useful for gauging market volume.
Pending Ratio – Ratio of pending to active listings, indicating buyer competition intensity.
Supply Score – Ranks areas by speed of market turnover, based on median DOM.

Explore more datasets at: realtor.com/research/data

Population & Demographics Datasets

Government-collected census and education data to power insights into population trends.

U.S. Census Bureau Data

The U.S. Census Bureau is the nation’s authoritative source for population and demographic data, offering massive datasets across geographies ranging from national to neighborhood (block group) level.

Its surveys—such as the Decennial Census and American Community Survey (ACS)—serve as the foundation for nearly all public and private sector population studies, policymaking, urban planning, and academic research in the United States.

Users can access data through tables, maps, charts, and downloadable formats, filtered by geography, time, and topic.

Featured Datasets for Projects:

Decennial Census (Total Population) – Official count of all people living in the U.S., collected every 10 years, providing foundational population totals.
American Community Survey (ACS) Age and Sex – Annual estimates of age and gender distributions at detailed geographic levels.
Demographic and Housing Estimates – A summary table offering key demographic variables such as race, ethnicity, housing tenure, and more.
Income in the Past 12 Months – Inflation-adjusted income distributions and medians across household types.
Poverty Status – Estimates of individuals and families living below the federal poverty line, segmented by age, race, and household type.
Educational Attainment – Tracks the highest level of education completed among U.S. residents aged 25 and older.
Economic Characteristics – Provides employment status, occupation, industry, and commuting behavior.
Social Characteristics – Includes marital status, language spoken at home, disability status, and nativity.
Housing Characteristics – Covers homeownership rates, housing age, cost burdens, and housing types.

Explore more datasets at: data.census.gov

NCES (National Center for Education Statistics) Data

The National Center for Education Statistics (NCES), a division of the U.S. Department of Education, is the primary federal entity for collecting and analyzing data related to education in the United States.

It provides extensive datasets across early childhood, K–12, and postsecondary education levels, supporting education policy, research, and program development.

Users can access raw microdata, pre-built tables, online dashboards, and analysis tools through the NCES DataLab, which allows for custom data queries using PowerStats and the Online Codebook.

Featured Datasets for Projects:

Integrated Postsecondary Education Data System (IPEDS) – A comprehensive dataset covering institutional characteristics, enrollment, graduation rates, faculty, finances, and student aid across U.S. colleges and universities.
National Assessment of Educational Progress (NAEP) – Known as the “Nation’s Report Card,” NAEP tracks academic performance in key subjects across states and demographic groups.
National Postsecondary Student Aid Study (NPSAS) – In-depth data on how students pay for college, including grants, loans, and work-study, for both undergraduate and graduate levels.
Baccalaureate and Beyond (B&B) – Tracks college graduates’ education and work experiences over time.
Early Childhood Longitudinal Studies (ECLS) – Follows young children from preschool through elementary years to assess early learning environments and development.
National Household Education Survey (NHES) – Collects data on early childhood participation, adult education, homeschooling, and parental involvement in K–12 education.
National Teacher and Principal Survey (NTPS) – Offers detailed insights into teacher and administrator demographics, experiences, working conditions, and school environments.
High School Longitudinal Study (HSLS) – Examines high school students’ trajectories into postsecondary education and careers in science and technology.
School Survey on Crime and Safety (SSOCS) – Focuses on school climate and incidents of crime and discipline in public schools.

Explore more datasets and tools at: nces.ed.gov

Bureau of Justice Statistics (BJS) Summary

The U.S. Department of Justice’s Bureau of Justice Statistics (BJS) offers a rich collection of criminal justice datasets that cover law enforcement, corrections, victimization, and judicial processes.

BJS manages a wide array of statistical programs, including national surveys, incident-based crime reporting, correctional institution metrics, recidivism rates, and firearm background checks.

Many of the datasets are updated annually and are available through partnerships like the National Archive of Criminal Justice Data (NACJD), offering downloadable files suitable for in-depth quantitative research.

Featured Datasets for Projects:

National Survey of Crime and Safety (NSCS) – A household-level dataset focused on public perceptions of safety, police interactions, and victimization trends.
National Corrections Reporting Program (NCRP) – Tracks offender-level data on prison admissions, releases, parole entries, and exits across participating states.
National Incident-Based Reporting System (NIBRS) – Detailed, incident-level data on crimes reported to law enforcement, including victim, offender, and circumstance variables.
Mortality in Correctional Institutions (MCI) – Tracks inmate deaths in custody with demographics, cause of death, and facility-level information.
Federal Justice Statistics Program (FJSP) – Provides offender tracking across the entire federal criminal process, including prosecution, sentencing, and corrections.
National Crime Victimization Survey (NCVS) – Measures unreported crime and victim demographics, giving a broader view of public safety beyond police records.
National Firearm Background Check System (NICS) – Aggregated data on background check results for firearm purchases across the U.S.
Prison Rape Statistics Program – Annual reviews of prison sexual violence in compliance with the Prison Rape Elimination Act (PREA).

Explore all datasets and tools at: bjs.ojp.gov or via the National Archive of Criminal Justice Data (NACJD).

FBI Crime Data Explorer (CDE) Summary

The FBI’s Crime Data Explorer (CDE) is a public portal designed to increase transparency and accessibility of law enforcement data in the U.S. It compiles crime statistics submitted by thousands of agencies nationwide, offering datasets on violent crime, property crime, hate crimes, arrests, and law enforcement employment.

CDE data is sourced from the Uniform Crime Reporting (UCR) program and the more detailed National Incident-Based Reporting System (NIBRS), making it especially valuable for time-series analyses, policy evaluation, and geographic crime comparisons.

All data is available in .CSV format and can be explored through interactive visualizations or downloaded in bulk.

Featured Datasets for Projects:

NIBRS Estimations – Modeled national-level estimates based on detailed incident-level data, including victim, offender, and circumstance details.
Hate Crime Reports – Annual statistics on incidents and motivations, categorized by bias type, location, and offender demographics.
Expanded Homicide Data – Includes data on victim/offender relationships, weapons used, and circumstances.
Property & Violent Crime Trends – Year-over-year national and state-level statistics on offenses such as burglary, robbery, assault, and motor vehicle theft.
Law Enforcement Employment Data – Tracks staffing levels across agencies, useful for workforce planning and public policy analysis.
Arrest Data – Breakdowns by offense type, age, gender, and race, available by state and local agency.

Explore more datasets at: fbi.gov/cde

Internet & Web Analytics Data

Track online behavior, web technology trends, and site-level performance.

BuiltWith Datasets

BuiltWith offers a massive collection of structured datasets that track the technologies used across millions of websites worldwide. These datasets are ideal for professionals in competitive intelligence, digital marketing, sales, investment research, and web analytics.

Dataset categories include analytics tools, CMS platforms, eCommerce providers, hosting services, advertising networks, and many more.

Each entry includes detailed metadata like first/last detection dates, monthly tech spend, sales estimates, social profiles, and traffic rank. Datasets are available by country, domain extension, tech category, or public company.

Featured Datasets for Projects:

Entire Internet Tech Stack – A full listing of technologies used by websites globally, including timestamps, category, and usage history.
Premium Spend Dataset – A breakdown of websites by estimated monthly tech spend, from $0 to $10,000+.
Traffic-Ranked Websites – Dataset of websites ranked by Tranco and PageRank, covering up to 1 million domains.
Shopify Plus Usage – A dataset of high-volume eCommerce sites using Shopify Plus, including tech spend and sales estimates.
Web Hosting Providers by Country – Analyze which providers dominate hosting in markets like the U.S., UK, India, and Australia.

Explore more datasets at: builtwith.com/datasets

ORCAS Dataset

The ORCAS dataset (Open Resource for Click Analysis in Search) is a large-scale, click-based dataset released as part of the TREC Deep Learning Track.

Designed for academic research, ORCAS connects over 10 million real user queries to 1.4 million TREC documents, totaling nearly 19 million query-document click pairs. The dataset enables advanced research in information retrieval, web mining, search ranking, and query understanding.

Nothing comes close to the ORCAS data set in terms of scale. Compared to standard TREC datasets, ORCAS offers 28 times more queries, 49 times more query-document pairs, and broader document coverage.

It’s particularly useful for training and evaluating search relevance models, studying query clustering, autocomplete systems, and mining synonym relationships from real search behavior.

Featured Files

orcas.tsv.gz – Main dataset file with 18.8M records connecting query IDs, queries, document IDs, and URLs.
orcas-doctrain-queries.tsv.gz – Over 10M unique queries used in the dataset.
orcas-doctrain-qrels.tsv.gz – TREC-style relevance judgments used for supervised learning.
orcas-doctrain-top100.gz – Top 100 document candidates per query, suitable for reranking models.

All files are provided in TSV or TREC-compatible formats and are available for non-commercial research only. Dataset access is governed by Microsoft’s research use terms.

Explore more datasets at: microsoft.github.io/msmarco

Ookla Open Data

Ookla, the company behind Speedtest, provides open datasets through its Ookla for Good initiative to support research, policy, infrastructure planning, and public service projects around internet connectivity. These datasets are powered by billions of real-world network performance tests conducted via the Speedtest app.

The datasets are especially valuable for geospatial analysis, broadband coverage mapping, telecom benchmarking, and public policy development.

They are available through the AWS Registry of Open Data and Ookla’s own website.

Featured Datasets for Projects:

Global Fixed & Mobile Network Performance Maps – Crowdsourced speed and latency data from GPS-verified Speedtest results.
Speedtest Global Index – A monthly ranking of internet speeds by country and region, based on aggregated Speedtest Intelligence data.
Interactive Coverage Map – A web-based tool to visually explore Ookla’s dataset at the tile level using filters for speed ranges and network types.

Explore more datasets at: speedtest.net or registry.opendata.aws

IEEE DataPort RF Signal Dataset

IEEE DataPort hosts a real-world wireless communication dataset featuring RF signals from three major technologies: Wi-Fi (IEEE 802.11ax), LTE, and 5G-NR.

Captured under diverse channel and modulation conditions, the dataset serves as a valuable resource for developers and researchers working on machine learning, wireless communication optimization, and signal processing.

The files are stored in .data and .json formats (zipped), with reading scripts provided in Python.

This dataset is designed to support research in areas like signal classification, interference detection, and spectrum analysis. However, access to the dataset may require an IEEE DataPort subscription or academic credentials.

Featured Dataset

Real-World RF Signals (Wi-Fi, LTE, 5G) – A 784 MB dataset with labeled raw RF signals ideal for machine learning and spectrum analysis. Includes varied data rates and channel conditions. Useful for IoT device fingerprinting, 5G signal classification, spectrum intelligence models, and deep learning on wireless signals.

Explore more datasets at: ieee-dataport.org/datasets

Social Media & Digital Platform Datasets

Use these datasets to analyze trends, sentiment, virality, or content dynamics.

YouTube-8M Dataset

YouTube-8M is a massive, open-source video dataset developed by Google Research to support large-scale video understanding and machine learning.

It includes millions of YouTube video IDs with associated high-quality, machine-generated labels across thousands of visual categories. Precomputed visual and audio features make it easy to train models on standard hardware.

The dataset supports research in video classification, representation learning, temporal localization, and multi-label prediction. With billions of frame-level features and a wide array of topics, it’s designed to mirror the diversity of real-world video content on YouTube.

The main dataset has been expanded with the YouTube-8M Segments Dataset, which includes human-verified, time-localized labels on 237K segments, enhancing its value for temporal modeling tasks.

Featured Datasets for Projects:

YouTube-8M Core Dataset – 6.1 million video IDs annotated with over 3,800 entities. Includes 2.6 billion precomputed audio-visual features and 350,000 hours of video.
YouTube-8M Segments – A segment-level dataset with 237,000 human-verified labeled segments across 1,000 classes. Useful for fine-grained temporal localization.
MediaPipe Feature Extractor – A lightweight tool for extracting visual and audio features, enabling custom processing from new videos.
Vocabulary CSV – A metadata file listing class IDs, names, and their knowledge graph links for semantic context and label mapping.

Explore more datasets at: research.google.com/youtube8m

Wikimedia Datasets

Wikimedia offers a wide variety of open datasets derived from Wikipedia, Wikidata, and other Wikimedia projects. These resources support research in natural language processing, knowledge graph construction, web traffic analysis, and more.

The datasets range from full Wikipedia database dumps and traffic statistics to taxonomic infoboxes and edit histories. A large ecosystem of tools also exists for parsing, analyzing, and visualizing this data—making it useful for developers, academics, and data scientists alike.

The data is available in multiple formats, including XML, JSON, RDF, and CSV, and covers both structured metadata and raw article content.

Featured Datasets for Projects:

Wikipedia Database Dumps – Complete dumps of Wikipedia articles and metadata across languages, updated regularly. Useful for offline analysis or custom NLP tasks.
Wikipedia Page Traffic Statistics – Historical and current article view data, ideal for web analytics, trend detection, and attention modeling.
DBpedia – A structured RDF dataset extracted from Wikipedia infoboxes and link structures, widely used in semantic web and knowledge graph applications.
Cultural Context Content (CCC) – A dataset identifying culturally relevant articles in each Wikipedia language edition, useful for diversity and multilingual research.
Wikipedia Edit History – Full revision history of pages up to 2008, allowing time-series and contributor behavior analysis.
Commons Datasets – Shared table and map data used within Wikipedia, accessible via Lua and graphs.

Explore more datasets at: meta.wikimedia.org/wiki/Datasets

Meta Research Tools & Datasets

Meta’s Transparency Center provides a suite of research tools and datasets designed to support independent study of the political, economic, and social impacts of Facebook and Instagram. These resources include APIs, data archives, and interactive maps to help researchers analyze content trends, ad spending, misinformation, civic engagement, and more.

The tools focus on transparency, enabling insight into how content and advertising perform across Meta platforms and how policies are enforced globally.

Featured Datasets & Tools

Meta Content Library & API – Provides full access to the archive of public Facebook and Instagram content. Researchers can search and retrieve content metadata, including post type, engagement metrics, and timestamps.
Meta Ad Library Tools – Includes the Ad Library Report, Ad Library API, and Ad Targeting Dataset. These allow users to track political ad spending, examine ad content, and analyze demographic targeting data.
URL Sharing Dataset – Offers data on how URLs are shared across Facebook, useful for studying virality, misinformation, or media ecosystem dynamics.
2020 US Election Study – A dataset created in collaboration with academics to analyze the role of Meta platforms in the 2020 U.S. election cycle.
Data for Good – A set of maps and survey data designed for public interest research in areas like health, crisis response, and mobility (e.g., COVID-19 trends, social connectedness index, displacement maps).

Explore more datasets at: transparency.meta.com/researchtools

Meta Ad Library

The Meta Ad Library is part of the Meta Research Tools & Datasets, but it’s so valuable, it deserves its own section.

It’s a searchable archive of all ads currently running across Meta platforms (Facebook, Instagram, Messenger) and select historical political or issue-based ads. It provides transparency into ad content, sponsorship, and spend, helping researchers, journalists, marketers, and the public analyze advertising trends and behaviors.

The library covers both commercial and political advertising, with special reports and API tools available to enable deep exploration of ad activity by geography, advertiser, audience targeting, and more.

Data is updated regularly and accessible without a Facebook account.

Featured Features & Datasets

Active Ad Listings – Search ads currently running on Meta platforms using filters such as keyword, country, and ad category (e.g., politics, branded content, or social issues).
Political & Issue Ad Archive – Includes all political or issue-based ads from the past 7 years with transparency around funding entity, spend, reach, and targeting.
Ad Library Report – Downloadable reports summarizing political ad spending by advertiser, region, and time frame. Ideal for campaign analysis and watchdog efforts.
Ad Library API – Programmatic access to ad metadata including estimated reach, country, platform, language, and targeting insights. Designed for developers and researchers.
Branded Content Search – Discover influencer marketing posts across Facebook and Instagram, including stories, reels, and video content involving paid brand partnerships.

Explore more datasets at: facebook.com/ads/library

X (formerly Twitter) Academic Research

X offers specialized tools and data access for academic researchers interested in analyzing public conversations on the platform. The Academic Research program supports a wide range of disciplines from social science to computer science by providing curated datasets, robust APIs, and community support tailored for scholarly work.

Researchers can leverage historical and real-time data to study topics such as misinformation, elections, health crises, public opinion, and more.

Featured Tools & Datasets

Academic Research Access to X API – Offers enhanced API capabilities for researchers, including full-archive search, higher tweet cap limits, and access to enriched metadata for academic projects.
Curated Datasets – Free, ready-to-use datasets on major topics (e.g., COVID-19, elections, climate) containing all public posts relevant to each theme. No coding required to access or use.
Developer Documentation & Tutorials – In-depth guides, sample code, and use cases to help academics extract, process, and analyze data using the X API.
Community Forum – A dedicated academic research forum where users can share insights, seek technical support, and collaborate with fellow researchers.
Research Blog – Highlights how other scholars have used X data in studies, including insights from projects on responsible machine learning and global crises.

Explore more datasets and tools at: developer.x.com/en/solutions/academic-research

Yelp Open Dataset

The Yelp Open Dataset is a well-known public dataset released by Yelp for educational and research purposes. It offers real-world data on local businesses, user reviews, check-ins, and photos—making it a valuable resource for projects in natural language processing, recommendation systems, sentiment analysis, and geographic data modeling.

The dataset is especially popular among students and researchers building models related to local search, social graphs, user behavior, and customer feedback analysis.

The data spans 11 U.S. metropolitan areas and includes business attributes like hours, categories, amenities (e.g., parking or Wi-Fi), as well as over 6 million user reviews.

Featured Datasets for Projects:

business.json – Metadata on 150,000+ businesses including name, location, hours, categories, and attributes
review.json – 6.9M reviews with star ratings, timestamps, text, user ID, and business ID
user.json – Information on 1.5M users including number of reviews, average stars, friends, and elite status
checkin.json – Aggregated check-in data over time for each business
tip.json – Short user tips for businesses
photos.json – 200K user-uploaded photos with labels and links to businesses

Explore more cool datasets at: yelp.com/dataset

Sports & Fitness Data Sets

Great for predictive modeling, performance analysis, or just exploring your favorite sport.

SURE Sports Data Sources

The SURE 2022 project offers a curated list of public datasets and R packages across a wide range of sports. This collection is valuable for analysts, data scientists, and researchers looking to explore historical or play-by-play data for various professional and collegiate sports.

The list includes official APIs, scraped datasets, and open repositories for traditional sports like baseball, football, and basketball, as well as emerging categories like esports and women’s sports leagues.

Many resources also include R packages for easier analysis and visualization.

Featured Datasets and Tools by Sport

Baseball – Historical player and team data from 1871 to present day. API access to FanGraphs, Baseball-Reference, Statcast, and PITCHf/x data, and Retrosheet and Statcast tools for deeper analysis.
American Football – Play-by-play data from 2000+, including EP and WP models, College football data with advanced metrics, as well as older NFL data from 2009–2019.
Basketball – NBA, NCAA, WNBA data (live play-by-play, box scores, shot locations), pull stats from basketball-reference.com, and CSVs with detailed play-level NBA data.
Hockey – NHL and NWHL APIs and scrapers, Shot-level and player/team stats since 2007 from Moneypuck, as well as tracked data from junior/Olympic/NCAA hockey games.
Soccer – Women’s league data, English/European match history, event-level data including player positions from Statsbomb, and downloadable player and team-level data for MLS and NWSL.
Other Sports & Tools – deuce (tennis), cricketdata, Cricsheet (cricket), ROpenDota (DOTA2), World Rowing (rowing).
SportyR – R functions to plot fields and courts across multiple sports.

Explore more cool datasets at: SURE 2022 Sports Data Overview

Sports-Statistics.com Datasets

Sports-Statistics.com is a hub for curated, downloadable datasets across a broad range of sports. It’s designed for data scientists, analysts, and enthusiasts interested in modeling, visualization, machine learning, and predictive analytics in sports.

The datasets include both player-level and team-level data across football, soccer, basketball, racing, baseball, hockey, and more. Many files are available in CSV format and span multiple seasons or decades.

Featured Datasets for Projects:

NFL Play-by-Play (2009–2018) – Includes win probabilities, play results, and player data.
College Football Stats – Covers major conferences with kicking, passing, rushing, and scoring data.
NBA Shot Logs (2014–2015) – Shot location, clock time, defender distance, and shooter identity.
WNBA & NCAA Stats – Player/game stats including box scores and play-by-play data.
FIFA Player Dataset (2015–2022) – 19,000+ players and 100+ attributes for machine learning applications.
World Cup History – Match-level data for all World Cups.
Formula 1 Race Results (1950–2017) – Driver, constructor, lap time, and pit stop data.
MLB Historical Odds & Scores (2010–2020) – Betting lines, final scores, and run totals.
NHL Player Stats (1940–2018) – Yearly offensive stats for every NHL player.
Olympic History (1896–2016) – Data from all modern Olympic Games.
Cricket & SPORTS-1M – Ball-by-ball cricket data and video-labeled dataset for 487 sports classes.
NHL Game Coordinate Data – x,y location tracking for team, player, and puck actions.

Explore more cool datasets at: sports-statistics.com

Ohio State University Sports Data Sets

The Sports and Society Initiative at Ohio State University offers an extensive collection of sports datasets spanning professional, collegiate, and youth sports.

These datasets are curated for academic research and public analysis, covering player stats, play-by-play data, betting odds, injuries, NIL earnings, and participation trends.

Data is organized by sport (football, baseball, basketball, hockey, soccer, MMA, tennis, golf, motorsports) and theme (injuries, salaries, participation, stadiums, academics).

Most public data sets are available in CSV format and drawn from trusted sources like Pro-Football-Reference, Basketball-Reference, Baseball Savant, and GitHub.

Featured Datasets for Projects:

NFL Play-by-Play (2009–2022) – Includes advanced metrics like expected points, QB hits, and air yards.
College Football Data – Historical and predictive stats including EPA/WPA; accessible via APIs and R packages (cfbscrapR).
Baseball Savant & Fangraphs – Rich analytics including xwOBA, barrel%, Statcast metrics, and sabermetrics.
NBA Play-by-Play & Player Stats (1991–2020) – Detailed play logs, combine results, and advanced metrics (USG%, PER).
World Cup & FIFA Data (1930–2022) – Historical tournament statistics, goals, assists, and match results.
MoneyPuck NHL Data (2008–2023) – CSVs include xG, player tracking, playoff odds, and game-by-game detail.
ATP & WTA Match Data (2000–2017) – Tournament-level breakdowns including serve stats, match winners, and surfaces.
UFC Fight Stats & Predictions – Includes betting odds, fighter nicknames, and advanced stat modeling.
PGA Tour Stats (2015–2022) – Hole-level data including strokes, rankings, and fairway percentage.
Sports Injuries & Participation – Injury data by age and sport, high school participation since 1969, and COVID-era effects.
NIL Revenue & Rankings – Breakdown by sport, compensation method, and endorsement value.

Explore more datasets at: Ohio State Sports and Society Initiative

SCORE Sports Data Repository

The SCORE Sports Data Repository is an academic initiative aimed at advancing statistics and data science education through curated sports datasets.

Developed and maintained by the SCORE Network and supported by the National Science Foundation, this resource is especially designed for instructional use, offering real-world datasets tied to pedagogical goals.

Each dataset is accompanied by a sports-related research question, a relevant statistical concept, and sample classroom activities—making it ideal for educators building lesson plans, assignments, or labs.

The repository spans dozens of sports categories including basketball, football, hockey, golf, cricket, esports, and more. Datasets are organized both by sport and by teaching objective (e.g., regression, probability, data wrangling).

Featured Datasets for Projects:

Football – NFL Play-by-Play Data – Explore probability modeling, game outcomes, and strategy using granular play-by-play data.
Basketball – NCAA and NBA Stats – Use player performance and team stats to teach regression, hypothesis testing, or clustering.
Esports – DOTA2 Match Data – Leverage API-accessible game stats to model outcomes or player performance metrics.
Olympics – Performance Trends by Year and Country – Perfect for time-series analysis and data visualization exercises.
Swimming & Track – Split Times & Athlete Comparisons – Great for understanding pacing, variability, and distribution concepts.

Explore more datasets at: scorewithdata.org

Ticket Sales & Event Datasets

Useful for demand forecasting, dynamic pricing, and event trend analysis.

SeatData.io

SeatData.io is a subscription-based analytics platform offering real-time and historical sales data from the secondary ticket market.

Designed for ticket brokers, event professionals, and developers, SeatData provides access to over 600,000 events and more than 50 million ticket sales since 2021. Its data includes concerts, theater, and sports events worldwide. Users can explore trends such as get-in prices, sales velocity, price trajectories, and venue capacities—all crucial for optimizing pricing strategies, buying decisions, and event forecasting.

Advanced analytics and API access are available through SeatData Pro for high-volume users and developers.

Featured Datasets & Tools

Real-Time Event Sales Data – Live updates on 120,000+ active events with insights like median prices, get-in costs, and ticket volume.
Historical Ticket Sales – Detailed event-level history across thousands of shows since early 2021, ideal for performance benchmarking and trend analysis.
Event Analytics Dashboard – Quick visual tools to analyze pricing shifts, active listings, and venue details at a glance.
Developer API Access – High-volume, low-cost data endpoints for integrating SeatData into internal systems or applications.

Explore more datasets at: seatdata.io

StubHub API

StubHub offers an official API that provides developers with access to the world’s largest ticket marketplace. It’s primarily designed for ticket sellers, resellers, aggregators, and developers building apps in the event and live entertainment space.

All data is accessed via HTTPS in a hal+json format and requires OAuth2 authentication.

The API allows applications to search for and view events, list tickets for sale, or purchase tickets directly through the StubHub platform. The API supports features such as pagination, localization, and sparse fieldsets, making it flexible for a variety of applications and user interfaces.

Key Features & Capabilities
• Event Discovery – Search for events by name, location, category, date, and more.
• Ticket Listings – Retrieve listings for resale, including pricing, seating, and quantity.
• Purchasing & Selling – Programmatically purchase tickets or list them for sale on StubHub.
• Real-Time Updates – Access live inventory data and pricing updates.
• OAuth2 Authentication – Secure access with token-based authorization for all API endpoints.

Explore more at: developer.stubhub.com

Food & Hospitality Data Sets

Perfect for analyzing nutrition, menus, or hospitality trends.

FoodData Central

FoodData Central is the U.S. Department of Agriculture’s (USDA) authoritative platform for food composition data.

Managed by the Agricultural Research Service, it provides open-access datasets covering the nutritional makeup of foods commonly consumed in the United States. It integrates multiple data types, including raw analytical data, brand-name food labels, and research-based datasets, making it a valuable resource for dietitians, nutritionists, researchers, and app developers.

FoodData Central updates its datasets on a regular schedule, with some components updated monthly and others annually or biannually.

Featured Data Types
• Foundation Foods – Detailed analytical data on minimally processed foods; updated twice per year.
• SR Legacy – Historical data compiled by USDA; final update in 2018.
• FNDDS – Used to analyze foods in NHANES; updated every two years.
• Branded Foods – Nutrition label data from commercial food brands; updated monthly.
• Experimental & Peer-Reviewed Data – Nutrient data from academic publications; updated as available.
• Child Nutrition Database (CNDB) – Food data relevant to USDA’s school and child nutrition programs.

Explore more datasets at: fdc.nal.usda.gov

University of Tennessee Hospitality & Tourism Datasets

The University of Tennessee’s Libraries guide on Hospitality & Tourism offers a curated list of public and academic datasets for research in tourism, consumer behavior, and travel patterns. It includes government reports, consumer datasets, and international market insights, making it a strong resource for students and analysts focused on hospitality, travel, and related sectors.

These sources provide access to travel statistics, social trends, consumer spending behavior, and international tourism performance.

Many datasets also include analytical tools or visualization options.

Featured Datasets for Projects:
• National Household Travel Survey – Free U.S. government data on commuting, business and leisure travel, demographics, and travel modes.
• International Trade Administration: Travel & Tourism Stats – Inbound and outbound tourism data from the U.S. Department of Commerce.
• ProQuest Statistical Insight – Indexed and abstracted statistical publications from U.S. federal, state, and international sources, with partial full-text access.
• Catalyst (MRI-Simmons) – U.S. consumer data on product usage, travel spending, and media habits.
• Mintel Academic – U.S. market research reports with trends, consumer behavior, and category-specific data.
• Passport by Euromonitor – International market and consumer data across industries with visualization tools.
• ICPSR – Extensive archive of political and social science datasets for academic research.
• Sage Data – Broad repository of curated social science datasets with filtering and export options.
• IMD World Competitiveness Online – Global rankings and economic performance data by country.

Explore more datasets at: libguides.utk.edu/hospitality

Where to Find Datasets: Communities & Forums

Not all great datasets live on government portals or corporate repositories.

Online communities are often the fastest way to find real-world, obscure, or crowdsourced datasets—and they come with the added bonus of discussion, curation, and advice from others who work with data every day.

These platforms are especially useful if you’re looking for something specific, unusual, or want to see how others are using publicly available data in creative ways.

Reddit r/datasets

The r/datasets subreddit is a dynamic, user-driven hub for discovering, sharing, and discussing datasets across virtually every domain imaginable. With over 200,000 members, it’s a go-to resource for finding obscure, niche, or cutting-edge datasets that may not be indexed by major repositories.

Posts range from requests for specific data (e.g., sports odds, environmental stats, movie budgets), to links to new releases (e.g., AI benchmark datasets, Reddit comment dumps, and WHO immunization data). Researchers and students often use the subreddit to crowdsource project datasets or advice, making it a useful starting point for exploratory data projects.

Featured Dataset Threads

Reddit Comments Post-2020 – Community discussions around finding alternatives to Pushshift for archived Reddit comment data post-2020.
AI Conversational Dataset: Time Waster Retreat – Annotated dataset designed for AI churn prediction in conversational agents.
D.B. Cooper FBI Files Dataset – Raw textual dataset hosted on Hugging Face for NLP experimentation.
IMDb Movie Budget Dataset – Ongoing discussion on locating public datasets with movie revenue and budget information.
Sports Betting Datasets – Multiple threads about finding odds and results data for soccer, UFC, and golf tournaments.
Beginner-Friendly Dataset List – Community-curated thread listing clean, small, and fun datasets ideal for machine learning learners.

Explore more datasets and discussions at: reddit.com/r/datasets

Wrapping It Up

Whether you’re building your first portfolio project or leveling up your skills, the right dataset can make all the difference. The sources in this list span every industry and format imaginable — from clean, beginner-friendly CSVs to massive, real-world datasets that mimic the messiness of the job.

Don’t wait for the “perfect” dataset to show up. Pick something that sparks your curiosity, start asking questions, and let the project evolve from there. The goal isn’t just to showcase technical skills — it’s to show how you think, solve problems, and tell a compelling story with data.

Bookmark this list, dig in, and start building.

Google Dataset Search

General & Multi-Domain Dataset Websites

Kaggle Datasets Summary

FiveThirtyEight Datasets

Data.gov

Data.europa.eu

GitHub Datasets

DataHub

Hugging Face Datasets

Papers With Code Datasets

Maven Analytics Data Playground

Data.gov.uk

Canada.ca Open Data

Data Ontario (Canada)

Registry of Open Data on AWS

Azure Open Datasets

Data Commons

UC Irvine Machine Learning Repository

University of Missouri Public Data Sets

Carnegie Mellon University AI Datasets

Linked Open Data Cloud

Pew Research Center Datasets

ClickHouse Example Datasets

S&P Global Datasets

Florida Atlantic University Data Sets

Healthcare & Medical Datasets

HealthData.gov

VAERS Data

Mount Sinai Health Data Resources

Health Data NY

Florida Department of Health Data

Climate, Environment & Earth Science Data Sets

NASA Earthdata Search

NOAA National Centers for Environmental Information (NCEI)

USDA NASS Datasets

Housing & Real Estate Datasets

Zillow Housing Data

Realtor.com Housing Data

Population & Demographics Datasets

U.S. Census Bureau Data

NCES (National Center for Education Statistics) Data

Bureau of Justice Statistics (BJS) Summary

FBI Crime Data Explorer (CDE) Summary

Internet & Web Analytics Data

BuiltWith Datasets

ORCAS Dataset

Ookla Open Data

IEEE DataPort RF Signal Dataset

Social Media & Digital Platform Datasets

YouTube-8M Dataset

Wikimedia Datasets

Meta Research Tools & Datasets

Meta Ad Library

X (formerly Twitter) Academic Research

Yelp Open Dataset

Sports & Fitness Data Sets

SURE Sports Data Sources

Sports-Statistics.com Datasets

Ohio State University Sports Data Sets

SCORE Sports Data Repository

Ticket Sales & Event Datasets

SeatData.io

StubHub API

Food & Hospitality Data Sets

FoodData Central

University of Tennessee Hospitality & Tourism Datasets

Where to Find Datasets: Communities & Forums

Reddit r/datasets

Wrapping It Up

Leave a Comment Cancel reply