Contributing to Kiosque¶

Thank you for your interest in contributing to Kiosque! This guide will help you add support for new websites.

Table of Contents¶

Getting Started
Adding a New Website
Website Scraper Implementation Guide
Testing Your Implementation
Code Style Guidelines
Submitting a Pull Request

Getting Started¶

Prerequisites¶

Python 3.12 or higher
uv package manager
Active subscription to the website you want to add (for testing authentication)
Basic understanding of HTML/CSS selectors

Development Setup¶

# Clone the repository
git clone https://github.com/yourusername/kiosque.git
cd kiosque

# Install dependencies with development tools
uv sync --dev

# Run the application
uv run kiosque

# Run tests (excluding login tests that require credentials)
uv run pytest -m "not login"

# Run all tests including login tests (requires credentials in kiosque.conf)
uv run pytest

# Run only login tests
uv run pytest -m "login"

# Format and lint code
uv run ruff format .
uv run ruff check .

Test Configuration¶

Login Tests: Tests in tests/test_login.py require real credentials and are marked with @pytest.mark.login. These tests:

Make real HTTP requests to websites
Require credentials configured in ~/.config/kiosque/kiosque.conf
Are automatically excluded in CI/CD to protect credentials
Should be run locally before submitting login-related changes

Running Tests:

# CI-safe tests only (no credentials required)
uv run pytest -m "not login"

# All tests including login (requires credentials)
uv run pytest

# Specific login test
uv run pytest tests/test_login.py::test_website_login -k "lemonde"

Adding a New Website¶

Step 1: Analyze the Website Structure¶

Before writing code, inspect the target website:

Find the article container:
Open an article in your browser
Right-click the main article text → "Inspect Element"
Identify the HTML element containing the article (usually <article>, <div class="article">, etc.)
Note the CSS class or ID
Identify elements to remove:
Look for elements to exclude: ads, "Read more" buttons, related articles, social media buttons
Note their HTML tags and classes
Check authentication (if paywalled):
Open browser DevTools → Network tab
Log in to the website
Inspect the login request:
- URL (usually ends in /login, /signin, /auth)
- Request method (POST)
- Form data (username field, password field, CSRF tokens, etc.)

Step 2: Create the Website Module¶

Create a new file in kiosque/website/ named after the publication:

File naming convention:

Lowercase, no spaces
Use publication name: lemonde.py, nytimes.py, washingtonpost.py

Step 3: Implement the Website Class¶

Here's a complete template with explanations:

from typing import ClassVar
from ..core.website import Website

class YourWebsiteName(Website):
    # REQUIRED: Base URL of the website (must end with /)
    base_url = "https://example.com/"

    # OPTIONAL: Login URL (only if authentication required)
    login_url = "https://example.com/login"

    # OPTIONAL: Short aliases for CLI usage
    alias: ClassVar = ["example", "ex"]

    # OPTIONAL: HTML elements to remove (list of tags or (tag, attributes) tuples)
    clean_nodes: ClassVar = [
        "figure",  # Remove all <figure> tags
        "aside",   # Remove all <aside> tags
        ("div", {"class": "advertisement"}),  # Remove specific divs
        ("section", {"class": ["social", "related"]}),  # Multiple classes
    ]

    # OPTIONAL: Elements to strip all attributes from
    clean_attributes: ClassVar = ["h2", "blockquote"]

    # REQUIRED: Extract article content from page
    def article(self, url):
        """Return the BeautifulSoup element containing article text."""
        soup = self.bs4(url)  # Fetch and parse HTML

        # Find the main article container
        # Option 1: Single element
        article = soup.find("article", class_="main-content")

        # Option 2: Multiple possible selectors
        if article is None:
            article = soup.find("div", class_="article-body")

        # Option 3: Complex selection
        # article = soup.find("div", {"id": "content"}).find("article")

        return article

    # OPTIONAL: Custom login logic (only if authentication required)
    @property
    def login_dict(self):
        """Return form data for login POST request."""
        credentials = self.credentials
        assert credentials is not None

        # Simple case: just username and password
        return {
            "username": credentials["username"],
            "password": credentials["password"],
        }

        # Complex case: CSRF token from login page
        # response = get_with_retry(self.login_url)
        # soup = BeautifulSoup(response.content, features="lxml")
        # token = soup.find("input", {"name": "csrf_token"})["value"]
        # return {
        #     "email": credentials["username"],
        #     "password": credentials["password"],
        #     "csrf_token": token,
        # }

    # OPTIONAL: Additional cleanup (if declarative cleanup isn't enough)
    def clean(self, article):
        """Perform custom cleanup transformations."""
        # Always call parent implementation first
        article = super().clean(article)

        # Example: Convert <h3> tags to <blockquote>
        for elem in article.find_all("h3"):
            elem.name = "blockquote"
            elem.attrs.clear()

        # Example: Remove empty paragraphs
        for p in article.find_all("p"):
            if not p.get_text(strip=True):
                p.decompose()

        return article

Website Scraper Implementation Guide¶

Minimal Implementation (No Authentication)¶

For public websites like The Guardian:

from typing import ClassVar
from ..core.website import Website

class TheGuardian(Website):
    base_url = "https://www.theguardian.com/"

    clean_nodes: ClassVar = ["figure", "aside"]

    def article(self, url):
        soup = self.bs4(url)
        return soup.find("div", class_="article-body-viewer-selector")

Simple Authentication¶

For websites with straightforward login forms:

from typing import ClassVar
from ..core.website import Website

class SimpleNews(Website):
    base_url = "https://simplenews.com/"
    login_url = "https://simplenews.com/auth/login"

    @property
    def login_dict(self):
        return {
            "email": self.credentials["username"],
            "password": self.credentials["password"],
        }

    def article(self, url):
        soup = self.bs4(url)
        return soup.find("article", class_="content")

Complex Authentication (CSRF Token)¶

For websites that require CSRF tokens or multi-step login:

from typing import ClassVar
from bs4 import BeautifulSoup
from ..core.client import get_with_retry
from ..core.website import Website

class ComplexNews(Website):
    base_url = "https://complexnews.com/"
    login_url = "https://secure.complexnews.com/login"

    @property
    def login_dict(self):
        credentials = self.credentials
        assert credentials is not None

        # Fetch login page to get CSRF token
        response = get_with_retry(self.login_url)
        soup = BeautifulSoup(response.content, features="lxml")

        # Extract token from hidden input
        token_input = soup.find("input", {"name": "_token"})
        token = token_input["value"]

        return {
            "username": credentials["username"],
            "password": credentials["password"],
            "_token": token,
            "remember_me": 1,
        }

    def article(self, url):
        soup = self.bs4(url)
        return soup.find("div", class_="article-content")

Advanced Cleanup¶

For websites with complex HTML that needs transformation:

def clean(self, article):
    """Example: Transform Guardian-style article HTML."""
    article = super().clean(article)

    # Convert styled headings to proper semantic headings
    for elem in article.find_all("p", class_="heading"):
        elem.name = "h3"
        elem.attrs.clear()

    # Remove empty paragraphs
    for p in article.find_all("p"):
        if not p.get_text(strip=True):
            p.decompose()

    # Unwrap unnecessary div wrappers
    for div in article.find_all("div", class_="paragraph-wrapper"):
        div.unwrap()

    return article

Testing Your Implementation¶

1. Manual Testing¶

# Test article download (public website)
uv run kiosque https://example.com/article test.md

# Test with verbose logging (to see login flow)
uv run kiosque -v https://paywalled.com/article test.md

2. Configure Credentials (for paywalled sites)¶

Edit ~/.config/kiosque/kiosque.conf:

[https://example.com/]
username = your.email@example.com
password = your_password

3. Write Tests¶

Create or update tests/test_login.py to add your website:

# Add your website credentials to kiosque.conf, then:
# The parameterized test will automatically test your login flow

# If login is broken or website structure changed:
KNOWN_BROKEN_LOGINS = {
    # ... existing entries ...
    "https://yourwebsite.com/",  # Brief reason why broken
}

4. Run Tests¶

# Run all tests
uv run pytest

# Run specific test for your website (if configured)
uv run pytest tests/test_login.py -k "yourwebsite"

# Run with output to debug
uv run pytest tests/test_login.py -v -s

Code Style Guidelines¶

Formatting and Linting¶

Always run before committing:

# Format code
uv run ruff format .

# Check for issues
uv run ruff check .

# Fix auto-fixable issues
uv run ruff check --fix .

Type Hints¶

Use modern type syntax (Python 3.12+):

# Good ✓
from typing import ClassVar

clean_nodes: ClassVar = ["figure", "aside"]

def article(self, url: str) -> BeautifulSoup | None:
    ...

# Bad ✗
from typing import Optional, List, Type

clean_nodes = ["figure", "aside"]  # Missing ClassVar

def article(self, url):  # Missing type hints
    ...

Import Organization¶

# Standard library
from typing import ClassVar

# Third-party
from bs4 import BeautifulSoup

# Local imports (relative)
from ..core.client import get_with_retry
from ..core.website import Website

Class Variable Annotations¶

All mutable class attributes must be annotated with ClassVar:

# Good ✓
clean_nodes: ClassVar = ["figure"]
clean_attributes: ClassVar = ["h2"]
alias: ClassVar = ["example"]

# Bad ✗
clean_nodes = ["figure"]  # Missing ClassVar annotation

Docstrings¶

Add docstrings for complex methods:

def article(self, url):
    """Extract main article content from the page.

    Args:
        url: Full URL of the article

    Returns:
        BeautifulSoup element containing article text, or None if not found
    """
    ...

Submitting a Pull Request¶

Before Submitting¶

Test your implementation:

# Download multiple articles
uv run kiosque https://example.com/article1 test1.md
uv run kiosque https://example.com/article2 test2.md

# Verify Markdown output is clean and complete
bat test1.md  # or open in your editor

Run all checks:

uv run ruff format .
uv run ruff check .
uv run pytest

Update documentation:
Add website to Supported Sites in the appropriate language section
Indicate if authentication is supported (☑️)

PR Guidelines¶

Branch naming: add-website-<name> (e.g., add-website-nytimes)
Commit message format:

Add support for Example News website

- Implements article extraction for example.com
- Adds authentication support with CSRF token handling
- Includes test for login flow

PR description should include:
Brief description of the website
Whether authentication is supported
Any special considerations (e.g., "Login requires solving captcha manually")
Example article URLs you tested
Don't commit credentials:
kiosque.conf is in .gitignore - never commit it
Don't include passwords or API tokens in code or PR

PR Template¶

## Description

Adds support for Example News (https://example.com/)

## Authentication

- [x] Requires subscription
- [x] Login flow implemented
- [ ] Publicly accessible

## Testing

Tested with the following articles:

- https://example.com/article1
- https://example.com/article2

## Checklist

- [x] Code formatted with `ruff format`
- [x] No linting errors (`ruff check`)
- [x] Tests pass (`pytest`)
- [x] Added website to [Supported Sites](../websites/supported-sites.md)
- [x] Tested article extraction manually

Common Patterns and Gotchas¶

Finding Article Elements¶

# Try multiple selectors
def article(self, url):
    soup = self.bs4(url)

    # Try primary selector
    article = soup.find("article", class_="main")

    # Fallback to alternative
    if article is None:
        article = soup.find("div", id="content")

    return article

Handling CSRF Tokens¶

from bs4 import BeautifulSoup
from ..core.client import get_with_retry

@property
def login_dict(self):
    # Fetch login page
    response = get_with_retry(self.login_url)
    soup = BeautifulSoup(response.content, features="lxml")

    # Find token (adapt selector to website)
    token = soup.find("input", {"name": "csrf"})["value"]

    return {
        "username": self.credentials["username"],
        "password": self.credentials["password"],
        "csrf": token,
    }

Removing Unwanted Elements¶

# Declarative (preferred for simple cases)
clean_nodes: ClassVar = [
    "figure",
    ("div", {"class": "ad"}),
]

# Imperative (for complex logic)
def clean(self, article):
    article = super().clean(article)

    # Remove by complex criteria
    for elem in article.find_all("div"):
        if "ad" in elem.get("class", []):
            elem.decompose()

    return article

Debugging Tips¶

# Add verbose logging
import logging

def article(self, url):
    soup = self.bs4(url)
    article = soup.find("article")

    if article is None:
        logging.error(f"Could not find article in {url}")
        logging.debug(f"Page HTML: {soup.prettify()[:500]}")

    return article

Getting Help¶

Issues: Open a GitHub issue with the website-support label
Questions: Start a discussion in GitHub Discussions
Documentation: See Architecture for system design
Examples: Check kiosque/website/ for 30+ real implementations

Thank You!¶

Your contributions help make journalism more accessible. Thank you for supporting Kiosque! 🎉