Architecture Overview¶
Project Structure¶
kiosque/
├── kiosque/
│ ├── core/ # Core functionality
│ │ ├── client.py # HTTP client with retry logic
│ │ ├── config.py # Configuration loading & validation
│ │ └── website.py # Base Website class
│ ├── website/ # Individual website scrapers (30+ files)
│ │ ├── lemonde.py
│ │ ├── nytimes.py
│ │ └── ...
│ ├── api/ # External API integrations
│ │ ├── raindrop.py # Raindrop.io bookmarks
│ │ └── pocket.py # Pocket (deprecated)
│ ├── tui/ # Terminal UI
│ │ ├── tui.py # Textual TUI application
│ │ └── kiosque.tcss # TUI styles
│ └── __init__.py # CLI entry point
├── tests/ # Test suite
└── pyproject.toml # Dependencies & metadata
Component Overview¶
Core Layer (kiosque/core/)¶
client.py - HTTP Client¶
Provides HTTP request wrappers with automatic retry logic:
- **
get_with_retry(url, **kwargs)** - GET request with exponential backoff - **
post_with_retry(url, **kwargs)** - POST request with exponential backoff - Retry strategy: 3 attempts, exponential backoff (stamina library)
- Timeout: 30 seconds per request
- Shared client: Single
httpx.Clientinstance for connection pooling
from kiosque.core.client import get_with_retry
response = get_with_retry("https://example.com/article")
config.py - Configuration Management¶
Handles loading and validating user configuration:
- Configuration file:
~/.config/kiosque/kiosque.conf(or$XDG_CONFIG_HOME) - Format: INI file with website credentials and API tokens
- Validation: Pydantic models ensure data integrity
WebsiteCredentials- Username/password pairsRaindropConfig- API token validation- Access:
config_dictdictionary maps URLs to credentials
from kiosque.core.config import config_dict, configuration_file
credentials = config_dict.get("https://www.lemonde.fr/")
# Returns: {"username": "...", "password": "..."}
website.py - Base Website Class¶
Abstract base class for all website scrapers:
Key Components:
- Class Attributes:
base_url: Website home URL (required)login_url: Authentication endpoint (optional)alias: List of short names for CLI (optional)clean_nodes: HTML elements to remove (optional)clean_attributes: Elements to strip attributes from (optional)-
header_entries: Custom HTTP headers (optional) -
Core Methods:
instance(url)- Factory method, returns appropriate Website subclassbs4(url)- Fetch URL and return BeautifulSoup objectlogin()- Authenticate with websitearticle(url)- Extract article HTML elementclean(article)- Remove unwanted elementsfull_text(url)- Convert article to Markdown-
save(url, filename)- Download article to file -
Performance Features:
- Module caching: Website implementations discovered once, cached
- Connection pooling: Shared HTTP client across requests
- Retry logic: Automatic retry with backoff for network failures
from kiosque.core.website import Website
# Get website-specific scraper
website = Website.instance("https://www.lemonde.fr/article")
# Extract article as Markdown
markdown = website.full_text("https://www.lemonde.fr/article")
Website Layer (kiosque/website/)¶
Each file implements a website-specific scraper by subclassing Website:
Implementation Pattern:
from typing import ClassVar
from ..core.website import Website
class ExampleSite(Website):
base_url = "https://example.com/"
login_url = "https://example.com/login" # If auth required
# Optional: CSS selectors to remove
clean_nodes: ClassVar = [
"figure",
("div", {"class": "advertisement"}),
]
# Optional: Elements to strip attributes from
clean_attributes: ClassVar = ["h2", "blockquote"]
# Optional: Custom login flow
@property
def login_dict(self):
return {
"username": self.credentials["username"],
"password": self.credentials["password"],
}
# Required: Extract article content
def article(self, url):
soup = self.bs4(url)
return soup.find("article", class_="main-content")
# Optional: Additional cleanup
def clean(self, article):
article = super().clean(article)
# Custom transformations
return article
Key Customization Points:
login_dictproperty: Custom authentication payloadarticle(url)method: Locate article content in pageclean(article)method: Website-specific HTML cleanup- Class attributes: Declarative cleanup rules
API Layer (kiosque/api/)¶
raindrop.py - Raindrop.io Integration¶
Fetches and manages bookmarks from Raindrop.io:
- Authentication: API token from config
- Data models: Pydantic models for type safety
RaindropTag- Bookmark tagsRaindropItem- Individual bookmark- Operations:
retrieve()- Fetch all unsorted bookmarksasync_retrieve()- Async bookmark fetchingaction(item_id, action)- Archive/delete bookmarks
from kiosque.api.raindrop import Raindrop, RaindropItem
raindrop = Raindrop(token="your_token")
items = raindrop.retrieve() # List[RaindropItem]
for item in items:
print(f"{item.title}: {item.link}")
TUI Layer (kiosque/tui/)¶
tui.py - Terminal User Interface¶
Interactive bookmark browser built with Textual:
Components:
RaindropTUI(Main App):- Displays bookmarks from Raindrop.io
- Auto-refresh every 5 minutes
-
Keyboard-driven interface
-
MarkdownModalScreen(Preview Modal): - Shows article content in scrollable modal
- Syntax highlighting for Markdown
- Escape to close
Key Features:
- Async bookmark fetching (non-blocking)
- Error handling with user-friendly messages
- Persistent state (selected item across refreshes)
- Browser integration (open URLs externally)
CLI Layer (kiosque/__init__.py)¶
Command-line interface built with Click:
Commands:
@click.group()
def cli():
"""Main entry point"""
@cli.command()
@click.argument("url")
@click.argument("output", default=None)
def download(url, output):
"""Download article to file"""
@cli.command()
def tui():
"""Launch TUI"""
Features:
- Logging configuration (verbose mode with
-v) - Error handling with user-friendly messages
- Configuration file auto-creation
Data Flow¶
Article Download Flow¶
User URL
↓
Website.instance(url) # Get scraper for URL
↓
website.login() # Authenticate if credentials exist
↓
website.bs4(url) # Fetch HTML with retry logic
↓
website.article(url) # Extract article element
↓
website.clean(article) # Remove unwanted elements
↓
pypandoc.convert_text() # HTML → Markdown
↓
Save to file or return string
TUI Bookmark Flow¶
Launch TUI
↓
Raindrop.async_retrieve() # Fetch bookmarks
↓
Display in ListView
↓
User selects item → Enter
↓
Website.instance(item.link) # Get scraper
↓
website.full_text(item.link) # Extract & convert
↓
MarkdownModalScreen(markdown) # Show in modal
Key Design Patterns¶
Factory Pattern¶
Website.instance(url) returns the appropriate subclass based on URL:
# Discovers and caches website modules
website = Website.instance("https://www.lemonde.fr/article")
# Returns: LeMonde instance
Template Method Pattern¶
Website class defines the article extraction workflow:
bs4()- Fetch HTML (implemented in base)article()- Extract content (overridden by subclass)clean()- Remove unwanted elements (optionally overridden)- Convert to Markdown (implemented in base)
Configuration as Code¶
Website cleanup rules declared as class attributes:
clean_nodes: ClassVar = [
"figure", # Remove all <figure> tags
("div", {"class": "ad"}), # Remove specific divs
]
Retry Decorator Pattern¶
Network requests wrapped with automatic retry:
@stamina.retry(on=httpx.HTTPError, attempts=3)
def get_with_retry(url, **kwargs):
return client.get(url, **kwargs)
Error Handling Strategy¶
Network Errors¶
- Strategy: Automatic retry with exponential backoff
- Implementation:
staminalibrary (3 attempts) - Timeout: 30 seconds per request
- User feedback: Logging warnings on retry
Authentication Errors¶
- Missing credentials: Skip login, attempt public access
- Invalid credentials: Raise exception with helpful message
- Changed login flow: Log error, suggest updating scraper
Parsing Errors¶
- Missing article element: NotImplementedError with context
- Unsupported URL: ValueError with supported websites list
- Invalid HTML: BeautifulSoup handles gracefully
Configuration Errors¶
- Missing config file: Auto-create template on first run
- Invalid config format: Pydantic validation errors
- Missing API token: Skip Raindrop.io features
Performance Considerations¶
HTTP Performance¶
- Connection pooling: Single shared
httpx.Client - Timeout: 30s prevents hanging on slow servers
- Retry logic: Max 3 attempts to avoid excessive delays
Module Loading¶
- Lazy imports: Website modules imported only when needed
- Module cache: Discovery happens once, results cached
- Dict lookup: O(1) website selection by URL
TUI Performance¶
- Async operations: Non-blocking bookmark fetching
- Background refresh: Auto-refresh doesn't block UI
- Efficient rendering: Textual framework handles DOM diffing
Testing Strategy¶
See tests/ directory for implementation:
- Unit Tests:
- Configuration validation (
test_config.py) - Website base class (
test_website.py) -
API models (
test_raindrop.py) -
Integration Tests:
- Real login flows (
test_login.py) - HTTP request/response handling
-
End-to-end article extraction (manual testing)
-
Mocking Strategy:
- Mock HTTP client for unit tests
- Real HTTP for integration tests
- Mark known-broken logins as
xfail
Dependencies¶
Core Dependencies¶
- httpx - Modern HTTP client with connection pooling
- stamina - Retry logic with exponential backoff
- beautifulsoup4 - HTML parsing
- lxml - Fast HTML parser backend
- pypandoc - HTML to Markdown conversion
- pydantic - Configuration validation
TUI Dependencies¶
- textual - Modern TUI framework
Development Dependencies¶
- pytest - Testing framework
- ruff - Fast linting and formatting
- ty - Fast type checking (typos alternative)
Future Architecture Considerations¶
See plan.md for detailed future work, including:
- Plugin Architecture: Modular source system (Raindrop, GitHub, Firefox)
- Action Registry: Context-aware actions based on URL patterns
- Async/Await: Parallel article downloads
- Caching: Local cache for downloaded articles
- Extensibility: User-defined website scrapers
Related Documentation¶
- Quick Start Guide - Getting started with Kiosque
- Contributing Guide - How to add new websites
- Troubleshooting - Common issues and solutions
- Supported Websites - Complete list of supported sites