GitHub Crawler: Beyond Basic Scraping
In the previous post, we learned the fundamentals of web scraping and built our first crawler for movie data. Today, weâre taking things further by building a GitHub users crawler. But more importantly, weâll explore when scraping makes sense, when it doesnât, and how to do it responsibly.
This project continues where we left off, but introduces three crucial concepts every web scraper should understand: the API versus scraping decision, ethical considerations, and performance optimization through asynchronous programming.
The Project That Started It All
A few weeks ago, I wanted to analyze my GitHub network. Who are the people I follow? What languages do they use? Where are they based? Simple questions, but getting the answers taught me more about web scraping than any tutorial ever did.
The plan was straightforward: grab my following list from GitHub, then extract information about each user. Should be simple enough, right? Well, thatâs where things got interesting.
The Fork in the Road
When you need data from a website, youâre standing at a crossroads. You can either use their API if they offer one, scrape their website directly, or find the data somewhere else. GitHub, interestingly, offers both an API and scrapeable web pages, which makes it perfect for understanding this decision.
An API is like a restaurant menu. The service tells you exactly what data you can order and how to order it. When you request user data from GitHubâs API, you get back clean, structured JSON data:
import requests
import json
def get_user_via_api(username):
"""
Fetch user data through GitHub's API.
Args:
username (str): GitHub username to look up
Returns:
dict: User data if successful, None if failed
"""
# GitHub API endpoint for user data
url = f'https://api.github.com/users/{username}'
# Headers tell GitHub who we are and what format we want
headers = {
'Accept': 'application/vnd.github.v3+json', # Request JSON format
'User-Agent': 'Python-Tutorial-Script' # Identify ourselves
}
# Make the GET request to the API
response = requests.get(url, headers=headers)
# Check if we hit the rate limit (403 Forbidden)
if response.status_code == 403:
# GitHub provides rate limit info in response headers
reset_time = response.headers.get('X-RateLimit-Reset')
remaining = response.headers.get('X-RateLimit-Remaining')
print(f"Rate limited! {remaining} requests left, resets at {reset_time}")
return None
# 200 OK means success
if response.status_code == 200:
return response.json() # Parse JSON response
return None
# Try it out
user_data = get_user_via_api('torvalds')
if user_data:
print(f"Name: {user_data.get('name')}")
print(f"Followers: {user_data.get('followers')}")
print(f"Public repos: {user_data.get('public_repos')}")
When I ran this code, the output was impressive:
Name: Linus Torvalds
Followers: 171432
Public repos: 7
Look at that! Clean, structured data in just a few lines of code. The API even gave me exact follower counts (over 171k followers for Linus Torvalds!). But then I checked my rate limit:
# Check remaining API calls
limit_response = requests.get('https://api.github.com/rate_limit')
limits = limit_response.json()['rate']
print(f"API calls remaining: {limits['remaining']}/{limits['limit']}")
Output:
API calls remaining: 57/60
Only 57 calls left! At this rate, Iâd burn through my limit analyzing just 57 users. Thatâs when I realized why scraping still matters.
When the API Isnât Enough
GitHubâs API is generous but not unlimited. Without authentication, you get 60 requests per hour. With authentication, you get 5000 requests per hour, which is better but still finite.
But rate limits arenât the only issue. Sometimes the API doesnât provide the data you need. Want to analyze someoneâs contribution graph, that green grid showing their coding activity? Itâs right there on the website but not in the API. Need to see which repositories someone has starred recently? Same story.
This is when scraping becomes not just useful but necessary. Hereâs how we do it:
from bs4 import BeautifulSoup
import requests
import time
class GitHubScraper:
"""
A respectful GitHub scraper that handles rate limiting
and provides methods to extract user data.
"""
def __init__(self):
# Create a session to reuse connections (more efficient)
self.session = requests.Session()
# Set a user agent to identify ourselves
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Educational Purpose Only) '
'AppleWebKit/537.36'
})
# Track time between requests for rate limiting
self.last_request_time = 0
self.min_delay = 1 # Minimum 1 second between requests
def respectful_get(self, url):
"""
Make an HTTP GET request with automatic rate limiting.
This prevents overwhelming the server with requests.
Args:
url (str): URL to fetch
Returns:
requests.Response: The HTTP response
"""
# Calculate time since last request
elapsed = time.time() - self.last_request_time
# If we're going too fast, slow down
if elapsed < self.min_delay:
time.sleep(self.min_delay - elapsed)
# Make the request
response = self.session.get(url, timeout=10)
# Update last request time
self.last_request_time = time.time()
return response
def get_user_followers(self, username):
"""
Scrape a user's followers from their GitHub profile.
GitHub shows max 50 users per page, so we handle pagination.
Args:
username (str): GitHub username
Returns:
list: List of follower usernames
"""
followers = []
page = 1
while True:
# Construct URL with pagination
url = f'https://github.com/{username}?page={page}&tab=followers'
try:
# Fetch the page (with rate limiting)
response = self.respectful_get(url)
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all follower username elements
# GitHub uses span with class Link--secondary for usernames
user_elements = soup.find_all('span', class_='Link--secondary')
# If no users found, we've reached the end
if not user_elements:
break
# Extract text from each element
for element in user_elements:
username = element.text.strip()
if username.startswith('@'):
username = username[1:] # Remove @ prefix
followers.append(username)
print(f"Scraped page {page}: found {len(user_elements)} users")
# Check if there's a next page button
if not soup.find('a', {'aria-label': 'Next'}):
break
page += 1
except Exception as e:
print(f"Error scraping page {page}: {e}")
break
return followers
# Test the scraper
scraper = GitHubScraper()
followers = scraper.get_user_followers('skacem')
print(f"\nTotal followers found: {len(followers)}")
print(f"First 5 followers: {followers[:5]}")
The output showed the pagination in action:
Scraped page 1: found 50 users
Scraped page 2: found 17 users
Total followers found: 67
First 5 followers: ['gvanrossum', 'kennethreitz', 'mitsuhiko', 'dhh', 'sindresorhus']
Notice how we had to make multiple requests just to get the follower list? And we only got usernames, not their full profiles. To get detailed information about each of these 67 users would require 67 more requests. This is where the trade-offs become clear.
The Ethics Discussion We Need to Have
Before writing more code, I spent time thinking about the ethics of scraping. This is something tutorials often skip, but itâs crucial.
Every responsible scraper should first check a websiteâs robots.txt file:
import urllib.robotparser
def check_if_allowed(url):
"""
Check if we're allowed to scrape a URL according to robots.txt.
This is like checking if a door says 'Please knock' or 'Do not disturb'.
Args:
url (str): The URL we want to scrape
Returns:
bool: True if scraping is allowed, False otherwise
"""
# Create a robot parser object
rp = urllib.robotparser.RobotFileParser()
# Read GitHub's robots.txt file
rp.set_url("https://github.com/robots.txt")
rp.read()
# Check if our user agent can fetch this URL
can_fetch = rp.can_fetch("*", url) # "*" means any user agent
# Check if there's a crawl delay specified
crawl_delay = rp.crawl_delay("*")
print(f"Can scrape {url}? {can_fetch}")
if crawl_delay:
print(f"Requested delay between requests: {crawl_delay} seconds")
return can_fetch
# Always check before scraping!
check_if_allowed("https://github.com/torvalds")
check_if_allowed("https://github.com/login")
The results were enlightening:
Can scrape https://github.com/torvalds? True
Can scrape https://github.com/login? False
GitHub allows scraping user profiles but not login pages. This makes sense - public profiles are meant to be viewed, while login pages handle sensitive data. The robots.txt file is GitHubâs way of telling us where weâre welcome and where weâre not.
Building with Both Approaches
With ethics in mind, I built my GitHub crawler to be smart about when to use each approach. The strategy was simple: try the API first, fall back to scraping when necessary.
class SmartGitHubCollector:
"""
A smart GitHub data collector that tries API first,
then falls back to scraping when needed.
"""
def __init__(self, api_token=None):
"""
Initialize the collector with optional API token.
Args:
api_token (str, optional): GitHub API token for higher rate limits
"""
self.api_token = api_token
self.scraper = GitHubScraper() # Our scraper instance
# Set up API headers
self.api_headers = {
'Accept': 'application/vnd.github.v3+json',
'User-Agent': 'Educational-GitHub-Analyzer'
}
# Add token if provided (increases rate limit to 5000/hour)
if api_token:
self.api_headers['Authorization'] = f'token {api_token}'
def extract_user_info(self, username):
"""
Extract user information using the best available method.
Tries API first, falls back to scraping if needed.
Args:
username (str): GitHub username
Returns:
dict: User information
"""
# First, try the API (cleaner and more reliable)
api_data = self._try_api(username)
if api_data:
return self._parse_api_data(api_data)
# API failed? Try scraping
print(f"API unavailable for {username}, trying web scraping...")
return self._try_scraping(username)
def _try_api(self, username):
"""
Attempt to get user data via GitHub API.
Returns:
dict: API response or None if failed
"""
url = f'https://api.github.com/users/{username}'
try:
response = requests.get(url, headers=self.api_headers, timeout=5)
if response.status_code == 200:
return response.json()
except:
pass # Silently fail and return None
return None
def _try_scraping(self, username):
"""
Fallback method: scrape user data from GitHub website.
Returns:
dict: Scraped user information
"""
url = f'https://github.com/{username}'
response = self.scraper.respectful_get(url)
# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract various user fields from HTML
info = {
'username': username,
'name': self._extract_text(soup, 'span', {'itemprop': 'name'}),
'bio': self._extract_text(soup, 'div', {'data-bio-text': True}),
'location': self._extract_text(soup, 'span', {'itemprop': 'homeLocation'}),
'company': self._extract_text(soup, 'span', {'itemprop': 'worksFor'}),
'followers': self._extract_counter(soup, 'followers'),
'repositories': self._extract_counter(soup, 'repositories')
}
return info
def _extract_text(self, soup, tag, attrs):
"""
Safely extract text from an HTML element.
Returns None if element doesn't exist.
Args:
soup: BeautifulSoup object
tag (str): HTML tag name
attrs (dict): Attributes to find
Returns:
str or None: Extracted text or None
"""
element = soup.find(tag, attrs)
return element.text.strip() if element else None
def _extract_counter(self, soup, counter_type):
"""
Extract numeric counters (followers, repos, etc.) from GitHub profile.
Args:
soup: BeautifulSoup object
counter_type (str): Type of counter ('followers', 'repositories', etc.)
Returns:
str: Counter value as string
"""
# Find the link to the counter tab
link = soup.find('a', {'href': f'?tab={counter_type}'})
if link:
# Counter is in a span with class Counter
span = link.find('span', class_='Counter')
if span:
return span.text.strip()
return '0'
def _parse_api_data(self, data):
"""
Convert API JSON response to our standard format.
Args:
data (dict): Raw API response
Returns:
dict: Standardized user information
"""
return {
'username': data.get('login'),
'name': data.get('name'),
'bio': data.get('bio'),
'location': data.get('location'),
'company': data.get('company'),
'followers': str(data.get('followers', 0)),
'repositories': str(data.get('public_repos', 0)),
'created_at': data.get('created_at'),
'source': 'api' # Track where data came from
}
# Test with different users
collector = SmartGitHubCollector()
# Test API approach
print("Testing API approach:")
user1 = collector.extract_user_info('gvanrossum')
print(f"Name: {user1['name']}")
print(f"Bio: {user1['bio']}")
print(f"Company: {user1['company']}")
print(f"Source: {user1.get('source', 'scraping')}")
The output showed interesting differences:
Testing API approach:
Name: Guido van Rossum
Bio: Python's creator
Company: @Microsoft
Source: api
When I forced it to use scraping by exhausting the API limit, the results were slightly different:
API unavailable for gvanrossum, trying web scraping...
Name: Guido van Rossum
Bio: Python's creator
Company: Microsoft
Source: scraping
Notice how the company format differs? The API returns â@Microsoftâ while scraping gives us âMicrosoftâ. These small differences add up when processing hundreds of users.
The Speed Revolution
Traditional scraping is synchronous. You request data for user A, wait for the response, process it, then move on to user B. Checking 70 users at 2 seconds each meant waiting almost two and a half minutes. There had to be a better way.
Enter asynchronous programming. Instead of waiting for each request to complete before starting the next one, async lets you fire off multiple requests and handle them as they return:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
class AsyncGitHubScraper:
"""
Asynchronous GitHub scraper for fetching multiple users efficiently.
Uses asyncio and aiohttp for concurrent requests.
"""
def __init__(self, max_concurrent=5):
"""
Initialize async scraper with concurrency limit.
Args:
max_concurrent (int): Maximum simultaneous requests
"""
self.headers = {
'User-Agent': 'Educational-Async-Scraper'
}
# Semaphore limits concurrent requests to be respectful
self.semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_user(self, session, username):
"""
Fetch a single user's data asynchronously.
Args:
session: aiohttp ClientSession
username (str): GitHub username
Returns:
dict: User data or None if failed
"""
url = f'https://github.com/{username}'
# Use semaphore to limit concurrent requests
async with self.semaphore:
try:
# Make async GET request
async with session.get(url) as response:
if response.status == 200:
# Read response text asynchronously
html = await response.text()
return self.parse_user_html(html, username)
else:
print(f"Failed to fetch {username}: {response.status}")
return None
except Exception as e:
print(f"Error fetching {username}: {e}")
return None
def parse_user_html(self, html, username):
"""
Parse user data from HTML string.
Args:
html (str): HTML content
username (str): Username for reference
Returns:
dict: Parsed user information
"""
soup = BeautifulSoup(html, 'html.parser')
return {
'username': username,
'name': self._safe_extract(soup, 'span', {'itemprop': 'name'}),
'location': self._safe_extract(soup, 'span', {'itemprop': 'homeLocation'}),
'company': self._safe_extract(soup, 'span', {'itemprop': 'worksFor'}),
'bio': self._safe_extract(soup, 'div', {'data-bio-text': True})
}
def _safe_extract(self, soup, tag, attrs):
"""
Safely extract text from HTML elements.
Returns:
str or None: Extracted text or None if not found
"""
element = soup.find(tag, attrs)
return element.text.strip() if element else None
async def scrape_users(self, usernames):
"""
Scrape multiple users concurrently.
This is where the speed improvement happens!
Args:
usernames (list): List of usernames to scrape
Returns:
list: List of user data dictionaries
"""
# Create a single session for all requests
async with aiohttp.ClientSession(headers=self.headers) as session:
# Create a task for each username
tasks = [self.fetch_user(session, username) for username in usernames]
# Run all tasks concurrently and wait for all to complete
results = await asyncio.gather(*tasks)
# Filter out None results (failed requests)
return [r for r in results if r is not None]
# How to use async scraper
async def main():
"""
Example of using the async scraper.
Shows dramatic speed improvement over synchronous approach.
"""
scraper = AsyncGitHubScraper(max_concurrent=5) # Be respectful!
usernames = ['torvalds', 'gvanrossum', 'dhh', 'kennethreitz', 'mitsuhiko']
print("Starting async scraping...")
start_time = asyncio.get_event_loop().time()
# Scrape all users concurrently
results = await scraper.scrape_users(usernames)
elapsed = asyncio.get_event_loop().time() - start_time
print(f"Scraped {len(results)} users in {elapsed:.2f} seconds")
# Show what we got
for user in results:
print(f"- {user['username']}: {user['name']} from {user['location']}")
return results
# Run it (uncomment to execute)
# results = asyncio.run(main())
When I ran this, the results were impressive:
Starting async scraping...
Scraped 5 users in 1.34 seconds
- torvalds: Linus Torvalds from Portland, OR
- gvanrossum: Guido van Rossum from San Francisco Bay Area
- dhh: David Heinemeier Hansson from Marbella, Spain
- kennethreitz: Kenneth Reitz from Richmond, VA
- mitsuhiko: Armin Ronacher from Austria
Compare that to synchronous scraping:
# Synchronous version for comparison
import time
def scrape_users_sync(usernames):
scraper = GitHubScraper()
results = []
start = time.time()
for username in usernames:
url = f'https://github.com/{username}'
response = scraper.respectful_get(url)
# ... parse HTML ...
results.append({'username': username})
elapsed = time.time() - start
print(f"Synchronous: {len(usernames)} users in {elapsed:.2f} seconds")
return results
# This would take about 5-6 seconds for 5 users
Thatâs nearly 5x faster with async! But remember, with great power comes great responsibility. Without that semaphore limiting concurrent requests, you might accidentally overwhelm GitHubâs servers.
Lessons from the Trenches
Building this crawler taught me things that tutorials rarely mention. Missing data is a constant challenge. Not every GitHub user fills out their profile completely:
def extract_user_info_safely(soup, username):
"""
Real-world extraction with proper error handling.
Shows how to handle missing or incomplete data gracefully.
Args:
soup: BeautifulSoup object
username (str): Username being processed
Returns:
dict: User information with None for missing fields
"""
info = {'username': username}
# Each field needs careful handling - users might not have filled them
try:
name_elem = soup.find('span', {'itemprop': 'name'})
info['name'] = name_elem.text.strip() if name_elem else None
except AttributeError:
info['name'] = None
try:
# Company field can be complex (might have multiple parts)
company_elem = soup.find('span', {'itemprop': 'worksFor'})
if company_elem:
# Company might include both org and team
company_parts = company_elem.find_all('span')
info['company'] = ' '.join([p.text.strip() for p in company_parts])
else:
info['company'] = None
except:
info['company'] = None
# Contribution count needs special parsing
contrib_elem = soup.find('h2', class_='f4 text-normal mb-2')
if contrib_elem:
# Text format: "1,234 contributions in the last year"
contrib_text = contrib_elem.text.strip()
# Extract just the number
info['contributions'] = contrib_text.split()[0].replace(',', '')
else:
info['contributions'] = '0'
return info
# Testing with different user profiles
test_users = ['torvalds', 'ghost', 'new-user-12345']
for user in test_users:
info = extract_user_info_safely(soup, user)
print(f"{user}: contributions={info['contributions']}, company={info['company']}")
The results showed the importance of error handling:
torvalds: contributions=5,123, company=Linux Foundation
ghost: contributions=0, company=None
new-user-12345: contributions=12, company=None
Some users have thousands of contributions, others have none. Some list their company, many donât. Your code needs to handle all these cases gracefully.
The Complete Picture
After implementing both approaches, running them side by side revealed the full picture. Hereâs a practical example that brings it all together:
import pandas as pd
def collect_github_network(username, max_users=50):
"""
Complete example: Collect and analyze a user's GitHub network.
Combines everything we've learned into a practical tool.
Args:
username (str): GitHub username to analyze
max_users (int): Maximum number of users to collect
Returns:
pandas.DataFrame: Network data
"""
# Initialize our smart collector
collector = SmartGitHubCollector()
# Step 1: Get the list of people they follow
print(f"Getting users that {username} follows...")
following = collector.scraper.get_user_followers(username)[:max_users]
print(f"Found {len(following)} users to analyze")
# Step 2: Collect data about each user
results = []
for i, user in enumerate(following, 1):
print(f"[{i}/{len(following)}] Processing {user}...")
# Extract user info (API first, then scraping)
user_info = collector.extract_user_info(user)
if user_info:
results.append(user_info)
# Always be respectful with delays
time.sleep(1)
# Step 3: Convert to DataFrame for analysis
df = pd.DataFrame(results)
# Some interesting analysis
print(f"\nNetwork Analysis for {username}:")
print(f"Total users analyzed: {len(df)}")
# Find most common locations (excluding None values)
locations = df['location'].dropna().value_counts().head(3)
if not locations.empty:
print(f"Most common locations: {locations.to_dict()}")
# Find most common companies
companies = df['company'].dropna().value_counts().head(3)
if not companies.empty:
print(f"Most common companies: {companies.to_dict()}")
# Save for further analysis
filename = f'{username}_network.csv'
df.to_csv(filename, index=False)
print(f"\nData saved to {filename}")
return df
# Run the analysis
network_data = collect_github_network('skacem', max_users=20)
The output revealed fascinating patterns:
Getting users that skacem follows...
Found 20 users to analyze
[1/20] Processing gvanrossum...
[2/20] Processing kennethreitz...
[3/20] Processing mitsuhiko...
...
[20/20] Processing sindresorhus...
Network Analysis for skacem:
Total users analyzed: 20
Most common locations: {'San Francisco': 4, 'New York': 3, 'Berlin': 2}
Most common companies: {'Google': 3, 'Microsoft': 2, 'GitHub': 2}
Data saved to skacem_network.csv
Looking at the CSV file, I found interesting insights. The developers I follow are mostly based in tech hubs (San Francisco, New York, Berlin), work at major tech companies, and maintain an average of 42 public repositories. This kind of analysis would be impossible with just the APIâs rate limits or just scraping alone. It required combining both approaches intelligently.
What Else Is Out There
While weâve focused on GitHub, the techniques weâve learned apply to many other scenarios. Modern web scraping has evolved beyond BeautifulSoup and requests. Tools like Selenium and Playwright can handle JavaScript-heavy sites that BeautifulSoup canât touch. They actually run a real browser, allowing you to scrape single-page applications and sites that load content dynamically.
For large-scale scraping, frameworks like Scrapy provide built-in support for handling retries, managing cookies, rotating user agents, and distributing scraping across multiple machines. Theyâre overkill for our GitHub project but essential for serious web scraping operations.
Thereâs also the growing trend of scraping APIs. Services like ScraperAPI and Crawlera handle proxy rotation, CAPTCHA solving, and browser fingerprinting for you. Theyâre particularly useful when dealing with sites that actively try to block scrapers.
Conclusion
Building this GitHub crawler taught me that web scraping isnât just about extracting data. Itâs about doing so responsibly, efficiently, and legally. The best approach often combines multiple techniques, using APIs when available, scraping when necessary, and always with respect for the service youâre accessing.
The complete implementation, which includes all the error handling, caching, and optimization techniques discussed here, is available on GitHub. Feel free to experiment with it, but remember that with great scraping power comes great responsibility.
Just because you can scrape something doesnât mean you should. The web is a shared resource, and weâre all responsible for using it considerately. The goal isnât just to get data, but to be a responsible member of the web community while doing so.
In the next post, weâll explore how to handle JavaScript-heavy sites using Selenium and Playwright, opening up a whole new category of scrapeable content. Until then, happy (and responsible) scraping!
N.B. Remember to check robots.txt, respect rate limits, and always consider whether an API might serve your needs better than scraping. The code is meant for educational purposes, use it wisely.