Collecting Data with R and Python

Your Markdown content

Outline

  • Modern Data Repositories
  • World Bank Indicators
  • R/Python Data Collection Tools
  • Web Scraping Techniques
  • API Integration
  • COVID-19 Data Sources

Modern Data Repositories

Major Public Data Repositories

  • Data.gov - The U.S. Government’s open data site
    • Over 308,000 datasets available
    • Tools and resources for research, visualization, and application development
    • https://data.gov
  • GitHub Public APIs - Collective list of free APIs
  • Kaggle - Platform for data science competitions and datasets

Government and Institutional Data Sources

World Bank Indicators

World Bank Development Indicators

  • A comprehensive collection of development data through web API
  • Covers over 200 economies with 1,400+ time series indicators
  • Categories include:
    • Economic Policy & Debt
    • Education
    • Environment
    • Financial Sector
    • Health
    • Infrastructure
    • Social Development
  • Original Website: https://databank.worldbank.org

Data from the World Bank Website

  • Sign up at https://databank.worldbank.org
  • Search by keywords or Indicator name
  • Filter by country, year, and indicator
  • Download data in various formats (CSV, Excel, XML, JSON)
  • API access available for programmatic data retrieval

WDI Data with R

# Install the WDI package if needed
# install.packages("WDI")
library(WDI)

# Search for GDP-related indicators
indicators <- WDI::WDIsearch('gdp')
head(indicators)
#       indicator            name
# [1,] "5.51.01.10.gdp"     "Per capita GDP growth"                       
# [2,] "6.0.GDP_current"    "GDP (current $)"                               
# [3,] "6.0.GDP_growth"     "GDP growth (annual %)" 

# Retrieve GDP per capita data for selected countries
gdp_data <- WDI(indicator='NY.GDP.PCAP.KD', 
                country=c('MX','CA','US','DE','JP'), 
                start=2010, end=2023)

# View the first few rows
head(gdp_data)

WDI Data with Python

# Install the wbdata module if needed
# pip install wbdata pandas matplotlib

import wbdata
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

# Check available data sources
sources = wbdata.get_source()
print(f"Number of data sources: {len(sources)}")

# Get indicators from source 1 (World Development Indicators)
# wbdata.get_indicator(source=1)

# Get GDP per capita for selected countries
countries = ['USA', 'CHN', 'DEU', 'JPN', 'GBR']
indicator = {'NY.GDP.PCAP.CD': 'GDP per capita (current US$)'}

# Set date range
data_date = (datetime(2010, 1, 1), datetime(2023, 1, 1))

# Retrieve the data
df = wbdata.get_dataframe(indicator, country=countries, 
                         data_date=data_date)

# Reshape data for plotting
df = df.reset_index()
data_pivoted = df.pivot(index='date', 
                       columns='country', 
                       values='GDP per capita (current US$)')

# Plot the data
data_pivoted.plot(figsize=(10, 6))
plt.title('GDP per Capita Comparison')
plt.ylabel('Current US$')
plt.grid(True)

R/Python Data Collection Tools

The Tidyverse Ecosystem in R

  • Tidyverse - Collection of R packages designed for data science
    • Consistent design philosophy and API
    • Install with: install.packages("tidyverse")
    • Load with: library(tidyverse)
  • Core Tidyverse Packages for Data Collection:
    • readr - Fast and friendly way to read rectangular data
    • httr - Tools for working with HTTP
    • rvest - Web scraping simplified
    • jsonlite - JSON parsing and generation
    • xml2 - XML parsing and generation
  • Specialized Data Import Packages:
    • readxl - Excel files (.xls and .xlsx)
    • haven - SPSS, Stata, and SAS data
    • googlesheets4 - Google Sheets via API
    • DBI - Database connections

Python Data Collection Libraries

  • Data Reading and Manipulation:
    • pandas - Data analysis and manipulation
    • numpy - Numerical computing
    • polars - Fast DataFrame library (alternative to pandas)
  • API Interaction:
    • requests - HTTP library for API calls
    • httpx - Next generation HTTP client
    • aiohttp - Asynchronous HTTP client/server
  • Data Formats:
    • json - JSON encoding and decoding
    • csv - CSV file reading and writing
    • openpyxl - Excel file manipulation
    • PyYAML - YAML parsing and emission

Example: Data Collection with Python Pandas

import pandas as pd
import requests
from io import StringIO

# Example 1: Reading from CSV URL
url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"
df = pd.read_csv(url)
print(f"Data shape: {df.shape}")
print(df.head())

# Example 2: Reading from API with JSON response
api_url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(api_url)
data = response.json()
rates = data['rates']

# Convert to DataFrame
currencies_df = pd.DataFrame(list(rates.items()), 
                            columns=['Currency', 'Rate'])
print(currencies_df.head())

# Example 3: Reading Excel files
# df_excel = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Example 4: Reading from databases
# from sqlalchemy import create_engine
# engine = create_engine('sqlite:///database.db')
# df_sql = pd.read_sql("SELECT * FROM table_name", engine)

FiveThirtyEight and Other Data Packages

  • FiveThirtyEight - Data journalism and statistical analysis

  • Other Specialized Data Packages:

    • gapminder - Data about world development
    • nycflights13 - Airline on-time data for NYC
    • palmerpenguins - Size measurements for penguins
    • Lahman - Baseball database
    • babynames - US baby names from the SSA
  • Finding Available Datasets in R:

    > data(package = .packages(all.available = TRUE))
    

Web Scraping Techniques

Modern Web Scraping Libraries in Python

BeautifulSoup - HTML/XML parsing library - Easy to use for beginners - Works with various parsers - 10M+ weekly downloads - pip install beautifulsoup4

Scrapy - Full-featured web crawling framework - High performance - Built-in support for extraction - Handles JavaScript with Splash - pip install scrapy

Selenium - Browser automation tool - Handles JavaScript-rendered content - Simulates user interactions - Works with multiple browsers - pip install selenium

Playwright - Modern browser automation - Supports Chrome, Firefox, Safari - Faster than Selenium - Handles modern web features - pip install playwright

Web Scraping Libraries Comparison

Library Ease of Use Performance JS Support Anti-Detection
BeautifulSoup ✓✓✓ ✓✓
Scrapy ✓✓✓
Selenium ✓✓ ✓✓✓
Playwright ✓✓ ✓✓ ✓✓✓ ✓✓
Requests-HTML ✓✓✓ ✓✓
LXML ✓✓✓

Key Considerations: - JavaScript rendering requirements - Anti-bot detection measures on target sites - Performance needs (speed vs. resource usage) - Complexity of the target website

BeautifulSoup Example

import requests
from bs4 import BeautifulSoup

# URL to scrape
url = "https://quotes.toscrape.com/"
response = requests.get(url)

# Create BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all quotes and authors
quotes = []
for quote in soup.select('.quote'):
    text = quote.select_one('.text').get_text()
    author = quote.select_one('.author').get_text()
    tags = [tag.get_text() for tag in quote.select('.tag')]
    
    quotes.append({
        'text': text,
        'author': author,
        'tags': tags
    })

# Print the first 3 quotes
for i, quote in enumerate(quotes[:3]):
    print(f"Quote {i+1}: {quote['text']}")
    print(f"Author: {quote['author']}")
    print(f"Tags: {', '.join(quote['tags'])}")
    print('-' * 50)

Selenium Example for Dynamic Content

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

# Initialize the Chrome driver
driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=chrome_options
)

# Navigate to a dynamic website
driver.get("https://www.nasdaq.com/market-activity/stocks/aapl")

# Wait for JavaScript to load content
time.sleep(5)

# Extract stock price
try:
    price_element = driver.find_element(By.XPATH, 
        "//span[contains(@class, 'last-price')]")
    price = price_element.text
    print(f"Current AAPL stock price: {price}")
except Exception as e:
    print(f"Error extracting price: {e}")

# Clean up
driver.quit()

Web Scraping in R with rvest

  • rvest - Web scraping package inspired by Python’s BeautifulSoup
    • Part of the tidyverse ecosystem
    • Simple and consistent syntax
    • Works well with the pipe operator (%>%)
    • Install with: install.packages("rvest")
  • Key Functions:
    • read_html() - Read HTML from a URL or file
    • html_nodes() - Select nodes using CSS selectors
    • html_text() - Extract text from nodes
    • html_attr() - Extract attributes from nodes
    • html_table() - Extract tables from HTML
  • For Dynamic Content:
    • RSelenium package for browser automation
    • chromote package for Chrome DevTools Protocol

API Integration

API Integration Fundamentals

  • What is an API?
    • Application Programming Interface
    • Allows systems to communicate with each other
    • Provides structured data access
  • Common API Types:
    • REST (Representational State Transfer)
    • GraphQL
    • SOAP (Simple Object Access Protocol)
    • WebSockets
  • API Authentication Methods:
    • API Keys
    • OAuth 2.0
    • JWT (JSON Web Tokens)
    • Basic Authentication
  • Common Response Formats:
    • JSON (JavaScript Object Notation)
    • XML (eXtensible Markup Language)
    • CSV (Comma-Separated Values)

API Integration with Python

import requests
import json
import pandas as pd

# Example 1: Simple GET request
api_url = "https://api.openweathermap.org/data/2.5/weather"
params = {
    "q": "Paris",
    "appid": "YOUR_API_KEY",  # Replace with actual API key
    "units": "metric"
}

response = requests.get(api_url, params=params)

if response.status_code == 200:
    data = response.json()
    print(f"Temperature in Paris: {data['main']['temp']}°C")
    print(f"Weather: {data['weather'][0]['description']}")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

# Example 2: Handling pagination
def get_all_pages(base_url, params, max_pages=5):
    all_results = []
    current_page = 1
    
    while current_page <= max_pages:
        params['page'] = current_page
        response = requests.get(base_url, params=params)
        
        if response.status_code != 200:
            break
            
        page_data = response.json()
        if not page_data.get('results', []):
            break
            
        all_results.extend(page_data['results'])
        current_page += 1
        
    return all_results

API Integration with R

# Install required packages if needed
# install.packages(c("httr", "jsonlite", "dplyr"))

library(httr)
library(jsonlite)
library(dplyr)

# Example 1: Basic API request
api_url <- "https://api.openweathermap.org/data/2.5/weather"
params <- list(
  q = "London",
  appid = "YOUR_API_KEY",  # Replace with actual API key
  units = "metric"
)

response <- GET(api_url, query = params)

if (http_status(response)$category == "Success") {
  data <- content(response, "text") %>% fromJSON()
  cat("Temperature in London:", data$main$temp, "°C\n")
  cat("Weather:", data$weather[[1]]$description, "\n")
} else {
  cat("Error:", http_status(response)$message, "\n")
}

# Example 2: Working with paginated API
get_all_pages <- function(base_url, params, max_pages = 5) {
  all_results <- list()
  current_page <- 1
  
  while (current_page <= max_pages) {
    params$page <- current_page
    response <- GET(base_url, query = params)
    
    if (http_status(response)$category != "Success") {
      break
    }
    
    page_data <- content(response, "text") %>% fromJSON()
    
    if (length(page_data$results) == 0) {
      break
    }
    
    all_results[[current_page]] <- page_data$results
    current_page <- current_page + 1
  }
  
  return(bind_rows(all_results))
}

COVID-19 Data Sources

Modern COVID-19 Data Sources

COVID-19 Data Analysis Tools

  • R Packages:
    • coronavirus - R interface to Johns Hopkins data
    • covid19 - R interface to the COVID-19 Data Hub
    • covdata - COVID-19 case and mortality data
    • covidcast - Access to COVIDcast Epidata API
  • Python Packages:
    • covid19dh - COVID-19 Data Hub
    • covid - Coronavirus disease 2019 (COVID-19) data
    • covid19py - Python wrapper for tracking COVID-19
    • coronavirus - Coronavirus (COVID-19) data analysis
  • Dashboards and Visualization Tools:
    • WHO COVID-19 Dashboard
    • Johns Hopkins COVID-19 Dashboard
    • Our World in Data COVID-19 Explorer
    • COVID-19 Data Explorer by Google

COVID-19 Data Analysis Example

# Install required packages if needed
# install.packages(c("coronavirus", "dplyr", "ggplot2"))

library(coronavirus)
library(dplyr)
library(ggplot2)

# Load the latest data
data(coronavirus)

# Get the most recent date in the dataset
max_date <- max(coronavirus$date)
cat("Data updated to:", format(max_date, "%B %d, %Y"), "\n")

# Summarize global cases and deaths
global_summary <- coronavirus %>%
  filter(date == max_date) %>%
  group_by(type) %>%
  summarise(total = sum(cases, na.rm = TRUE))

print(global_summary)

# Get top 10 countries by confirmed cases
top_countries <- coronavirus %>%
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases, na.rm = TRUE)) %>%
  arrange(desc(total_cases)) %>%
  head(10)

# Plot top 10 countries
ggplot(top_countries, aes(x = reorder(country, total_cases), 
                          y = total_cases)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Countries by Total COVID-19 Cases",
       x = "Country",
       y = "Total Confirmed Cases") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

Additional COVID-19 Data Resources

  • Government Health Agencies:
    • CDC COVID Data Tracker (US)
    • ECDC (European Centre for Disease Prevention and Control)
    • UK Coronavirus Dashboard
  • Specialized COVID-19 Datasets:
    • COVID-19 Open Research Dataset (CORD-19)
    • COVID-19 Symptom Data Challenge
    • Global COVID-19 Treatment Trial Registry
  • Wastewater Surveillance:
    • CDC National Wastewater Surveillance System
    • Biobot COVID-19 Wastewater Dashboard
    • Global Water Pathogen Project
  • Long COVID and Post-Pandemic Research:
    • NIH RECOVER Initiative
    • Long COVID Registry
    • COVID-19 Longitudinal Health and Wellbeing National Core Study

Summary: Modern Data Collection Approaches

  • Diverse Data Sources
    • Public repositories (Data.gov, Kaggle)
    • Government and institutional databases
    • Specialized data packages and APIs
  • Powerful Collection Tools
    • R: Tidyverse ecosystem (readr, httr, rvest)
    • Python: Web scraping libraries, API clients
    • Specialized packages for specific data sources
  • Collection Methods
    • Direct downloads and imports
    • API integration for real-time data
    • Web scraping for unstructured data
    • Database connections for large datasets
  • Best Practices
    • Check data licenses and terms of use
    • Respect rate limits and robots.txt
    • Document data sources and collection methods
    • Validate and clean collected data

Resources and Further Learning

Thank You!

Dhafer Malouche
Dhafer Malouche
Professor of Statistics