Contributing to Kiosque¶
Thank you for your interest in contributing to Kiosque! This guide will help you add support for new websites.
Table of Contents¶
- Getting Started
- Adding a New Website
- Website Scraper Implementation Guide
- Testing Your Implementation
- Code Style Guidelines
- Submitting a Pull Request
Getting Started¶
Prerequisites¶
- Python 3.12 or higher
- uv package manager
- Active subscription to the website you want to add (for testing authentication)
- Basic understanding of HTML/CSS selectors
Development Setup¶
# Clone the repository
git clone https://github.com/yourusername/kiosque.git
cd kiosque
# Install dependencies with development tools
uv sync --dev
# Run the application
uv run kiosque
# Run tests (excluding login tests that require credentials)
uv run pytest -m "not login"
# Run all tests including login tests (requires credentials in kiosque.conf)
uv run pytest
# Run only login tests
uv run pytest -m "login"
# Format and lint code
uv run ruff format .
uv run ruff check .
Test Configuration¶
Login Tests: Tests in tests/test_login.py require real credentials and are marked with @pytest.mark.login. These tests:
- Make real HTTP requests to websites
- Require credentials configured in
~/.config/kiosque/kiosque.conf - Are automatically excluded in CI/CD to protect credentials
- Should be run locally before submitting login-related changes
Running Tests:
# CI-safe tests only (no credentials required)
uv run pytest -m "not login"
# All tests including login (requires credentials)
uv run pytest
# Specific login test
uv run pytest tests/test_login.py::test_website_login -k "lemonde"
Adding a New Website¶
Step 1: Analyze the Website Structure¶
Before writing code, inspect the target website:
- Find the article container:
- Open an article in your browser
- Right-click the main article text → "Inspect Element"
- Identify the HTML element containing the article (usually
<article>,<div class="article">, etc.) -
Note the CSS class or ID
-
Identify elements to remove:
- Look for elements to exclude: ads, "Read more" buttons, related articles, social media buttons
-
Note their HTML tags and classes
-
Check authentication (if paywalled):
- Open browser DevTools → Network tab
- Log in to the website
- Inspect the login request:
- URL (usually ends in
/login,/signin,/auth) - Request method (POST)
- Form data (username field, password field, CSRF tokens, etc.)
- URL (usually ends in
Step 2: Create the Website Module¶
Create a new file in kiosque/website/ named after the publication:
File naming convention:
- Lowercase, no spaces
- Use publication name:
lemonde.py,nytimes.py,washingtonpost.py
Step 3: Implement the Website Class¶
Here's a complete template with explanations:
from typing import ClassVar
from ..core.website import Website
class YourWebsiteName(Website):
# REQUIRED: Base URL of the website (must end with /)
base_url = "https://example.com/"
# OPTIONAL: Login URL (only if authentication required)
login_url = "https://example.com/login"
# OPTIONAL: Short aliases for CLI usage
alias: ClassVar = ["example", "ex"]
# OPTIONAL: HTML elements to remove (list of tags or (tag, attributes) tuples)
clean_nodes: ClassVar = [
"figure", # Remove all <figure> tags
"aside", # Remove all <aside> tags
("div", {"class": "advertisement"}), # Remove specific divs
("section", {"class": ["social", "related"]}), # Multiple classes
]
# OPTIONAL: Elements to strip all attributes from
clean_attributes: ClassVar = ["h2", "blockquote"]
# REQUIRED: Extract article content from page
def article(self, url):
"""Return the BeautifulSoup element containing article text."""
soup = self.bs4(url) # Fetch and parse HTML
# Find the main article container
# Option 1: Single element
article = soup.find("article", class_="main-content")
# Option 2: Multiple possible selectors
if article is None:
article = soup.find("div", class_="article-body")
# Option 3: Complex selection
# article = soup.find("div", {"id": "content"}).find("article")
return article
# OPTIONAL: Custom login logic (only if authentication required)
@property
def login_dict(self):
"""Return form data for login POST request."""
credentials = self.credentials
assert credentials is not None
# Simple case: just username and password
return {
"username": credentials["username"],
"password": credentials["password"],
}
# Complex case: CSRF token from login page
# response = get_with_retry(self.login_url)
# soup = BeautifulSoup(response.content, features="lxml")
# token = soup.find("input", {"name": "csrf_token"})["value"]
# return {
# "email": credentials["username"],
# "password": credentials["password"],
# "csrf_token": token,
# }
# OPTIONAL: Additional cleanup (if declarative cleanup isn't enough)
def clean(self, article):
"""Perform custom cleanup transformations."""
# Always call parent implementation first
article = super().clean(article)
# Example: Convert <h3> tags to <blockquote>
for elem in article.find_all("h3"):
elem.name = "blockquote"
elem.attrs.clear()
# Example: Remove empty paragraphs
for p in article.find_all("p"):
if not p.get_text(strip=True):
p.decompose()
return article
Website Scraper Implementation Guide¶
Minimal Implementation (No Authentication)¶
For public websites like The Guardian:
from typing import ClassVar
from ..core.website import Website
class TheGuardian(Website):
base_url = "https://www.theguardian.com/"
clean_nodes: ClassVar = ["figure", "aside"]
def article(self, url):
soup = self.bs4(url)
return soup.find("div", class_="article-body-viewer-selector")
Simple Authentication¶
For websites with straightforward login forms:
from typing import ClassVar
from ..core.website import Website
class SimpleNews(Website):
base_url = "https://simplenews.com/"
login_url = "https://simplenews.com/auth/login"
@property
def login_dict(self):
return {
"email": self.credentials["username"],
"password": self.credentials["password"],
}
def article(self, url):
soup = self.bs4(url)
return soup.find("article", class_="content")
Complex Authentication (CSRF Token)¶
For websites that require CSRF tokens or multi-step login:
from typing import ClassVar
from bs4 import BeautifulSoup
from ..core.client import get_with_retry
from ..core.website import Website
class ComplexNews(Website):
base_url = "https://complexnews.com/"
login_url = "https://secure.complexnews.com/login"
@property
def login_dict(self):
credentials = self.credentials
assert credentials is not None
# Fetch login page to get CSRF token
response = get_with_retry(self.login_url)
soup = BeautifulSoup(response.content, features="lxml")
# Extract token from hidden input
token_input = soup.find("input", {"name": "_token"})
token = token_input["value"]
return {
"username": credentials["username"],
"password": credentials["password"],
"_token": token,
"remember_me": 1,
}
def article(self, url):
soup = self.bs4(url)
return soup.find("div", class_="article-content")
Advanced Cleanup¶
For websites with complex HTML that needs transformation:
def clean(self, article):
"""Example: Transform Guardian-style article HTML."""
article = super().clean(article)
# Convert styled headings to proper semantic headings
for elem in article.find_all("p", class_="heading"):
elem.name = "h3"
elem.attrs.clear()
# Remove empty paragraphs
for p in article.find_all("p"):
if not p.get_text(strip=True):
p.decompose()
# Unwrap unnecessary div wrappers
for div in article.find_all("div", class_="paragraph-wrapper"):
div.unwrap()
return article
Testing Your Implementation¶
1. Manual Testing¶
# Test article download (public website)
uv run kiosque https://example.com/article test.md
# Test with verbose logging (to see login flow)
uv run kiosque -v https://paywalled.com/article test.md
2. Configure Credentials (for paywalled sites)¶
Edit ~/.config/kiosque/kiosque.conf:
3. Write Tests¶
Create or update tests/test_login.py to add your website:
# Add your website credentials to kiosque.conf, then:
# The parameterized test will automatically test your login flow
# If login is broken or website structure changed:
KNOWN_BROKEN_LOGINS = {
# ... existing entries ...
"https://yourwebsite.com/", # Brief reason why broken
}
4. Run Tests¶
# Run all tests
uv run pytest
# Run specific test for your website (if configured)
uv run pytest tests/test_login.py -k "yourwebsite"
# Run with output to debug
uv run pytest tests/test_login.py -v -s
Code Style Guidelines¶
Formatting and Linting¶
Always run before committing:
# Format code
uv run ruff format .
# Check for issues
uv run ruff check .
# Fix auto-fixable issues
uv run ruff check --fix .
Type Hints¶
Use modern type syntax (Python 3.12+):
# Good ✓
from typing import ClassVar
clean_nodes: ClassVar = ["figure", "aside"]
def article(self, url: str) -> BeautifulSoup | None:
...
# Bad ✗
from typing import Optional, List, Type
clean_nodes = ["figure", "aside"] # Missing ClassVar
def article(self, url): # Missing type hints
...
Import Organization¶
# Standard library
from typing import ClassVar
# Third-party
from bs4 import BeautifulSoup
# Local imports (relative)
from ..core.client import get_with_retry
from ..core.website import Website
Class Variable Annotations¶
All mutable class attributes must be annotated with ClassVar:
# Good ✓
clean_nodes: ClassVar = ["figure"]
clean_attributes: ClassVar = ["h2"]
alias: ClassVar = ["example"]
# Bad ✗
clean_nodes = ["figure"] # Missing ClassVar annotation
Docstrings¶
Add docstrings for complex methods:
def article(self, url):
"""Extract main article content from the page.
Args:
url: Full URL of the article
Returns:
BeautifulSoup element containing article text, or None if not found
"""
...
Submitting a Pull Request¶
Before Submitting¶
- Test your implementation:
# Download multiple articles
uv run kiosque https://example.com/article1 test1.md
uv run kiosque https://example.com/article2 test2.md
# Verify Markdown output is clean and complete
bat test1.md # or open in your editor
- Run all checks:
- Update documentation:
- Add website to Supported Sites in the appropriate language section
- Indicate if authentication is supported (☑️)
PR Guidelines¶
-
Branch naming:
add-website-<name>(e.g.,add-website-nytimes) -
Commit message format:
Add support for Example News website
- Implements article extraction for example.com
- Adds authentication support with CSRF token handling
- Includes test for login flow
- PR description should include:
- Brief description of the website
- Whether authentication is supported
- Any special considerations (e.g., "Login requires solving captcha manually")
-
Example article URLs you tested
-
Don't commit credentials:
kiosque.confis in.gitignore- never commit it- Don't include passwords or API tokens in code or PR
PR Template¶
## Description
Adds support for Example News (https://example.com/)
## Authentication
- [x] Requires subscription
- [x] Login flow implemented
- [ ] Publicly accessible
## Testing
Tested with the following articles:
- https://example.com/article1
- https://example.com/article2
## Checklist
- [x] Code formatted with `ruff format`
- [x] No linting errors (`ruff check`)
- [x] Tests pass (`pytest`)
- [x] Added website to [Supported Sites](../websites/supported-sites.md)
- [x] Tested article extraction manually
Common Patterns and Gotchas¶
Finding Article Elements¶
# Try multiple selectors
def article(self, url):
soup = self.bs4(url)
# Try primary selector
article = soup.find("article", class_="main")
# Fallback to alternative
if article is None:
article = soup.find("div", id="content")
return article
Handling CSRF Tokens¶
from bs4 import BeautifulSoup
from ..core.client import get_with_retry
@property
def login_dict(self):
# Fetch login page
response = get_with_retry(self.login_url)
soup = BeautifulSoup(response.content, features="lxml")
# Find token (adapt selector to website)
token = soup.find("input", {"name": "csrf"})["value"]
return {
"username": self.credentials["username"],
"password": self.credentials["password"],
"csrf": token,
}
Removing Unwanted Elements¶
# Declarative (preferred for simple cases)
clean_nodes: ClassVar = [
"figure",
("div", {"class": "ad"}),
]
# Imperative (for complex logic)
def clean(self, article):
article = super().clean(article)
# Remove by complex criteria
for elem in article.find_all("div"):
if "ad" in elem.get("class", []):
elem.decompose()
return article
Debugging Tips¶
# Add verbose logging
import logging
def article(self, url):
soup = self.bs4(url)
article = soup.find("article")
if article is None:
logging.error(f"Could not find article in {url}")
logging.debug(f"Page HTML: {soup.prettify()[:500]}")
return article
Getting Help¶
- Issues: Open a GitHub issue with the
website-supportlabel - Questions: Start a discussion in GitHub Discussions
- Documentation: See Architecture for system design
- Examples: Check
kiosque/website/for 30+ real implementations
Thank You!¶
Your contributions help make journalism more accessible. Thank you for supporting Kiosque! 🎉