Article Extraction¶
Extract full-text articles from paywalled news websites.
Overview¶
Kiosque can download complete articles from major news publications, converting them to clean Markdown format with metadata preserved.
Basic Usage¶
Command Line¶
# Extract to file
kiosque https://www.lemonde.fr/article.html output.md
# Print to stdout
kiosque https://www.nytimes.com/article.html -
# Use with pipes
kiosque https://www.lemonde.fr/article.html - | grep "keyword"
Python API¶
from kiosque import Website
# Extract article
url = "https://www.lemonde.fr/article.html"
website = Website.instance(url)
markdown = website.full_text(url)
# Get metadata
title = website.title(url)
author = website.author(url)
date = website.date(url)
Output Format¶
Articles are extracted as Markdown with YAML frontmatter:
---
title: Article Title
author: Author Name
date: 2025-12-31
url: https://www.example.com/article
description: Article summary
---
# Article Title
**By Author Name** · 2025-12-31
_Article summary_
---
Article content in clean Markdown format...
Supported Sites¶
See Supported Sites for the complete list.
Major publications include:
- English: New York Times, The Guardian, Financial Times, The Atlantic
- French: Le Monde, Le Figaro, Les Échos, Mediapart
Authentication¶
Many sites require a paid subscription. Configure credentials in ~/.config/kiosque/kiosque.conf:
[https://www.lemonde.fr/]
username = your.email@example.com
password = your_password
[https://www.nytimes.com/]
cookie_nyt_s = your_nyt_cookie_value
See Authentication Guide for site-specific setup.
Proxy Support¶
Bypass geo-blocking with SOCKS or HTTP proxies:
# Command line
kiosque --proxy socks5://localhost:8765 https://www.nytimes.com/article.html output.md
# Configuration file
[proxy]
url = socks5://localhost:8765
Supported formats: socks5://, socks4://, http://, https://
See Troubleshooting for detailed proxy setup instructions.
Advanced Options¶
Retry Logic¶
Kiosque automatically retries failed requests with exponential backoff:
- 3 attempts by default
- 2s, 4s, 8s delays between retries
- Handles temporary network issues
PDF Downloads¶
Some publications provide downloadable PDF editions (front pages or complete magazine issues):
# Download New York Times front page (daily)
kiosque nyt
# Download Le Monde Diplomatique issue (monthly)
kiosque lmd
# Download Courrier International issue (weekly)
kiosque courrier
# Download Pour la Science issue (monthly)
kiosque pls
PDFs are saved to the current directory with timestamped filenames (e.g., nyt-frontpage-2025-12-31.pdf).
Publications with PDF support:
| Publication | Alias(es) | Type | Frequency | Auth Required |
|---|---|---|---|---|
| New York Times | nyt, nytimes |
Front page | Daily | Cookie-based |
| Le Monde Diplomatique | lmd, diplomatique |
Full issue | Monthly | Yes |
| Courrier International | courrier |
Full issue | Weekly | Yes |
| Pour la Science | pls |
Full issue | Monthly | Yes |
Note: Courrier International may be geo-blocked outside France/Europe and require a proxy.
Python API:
from kiosque.website.nytimes import NewYorkTimes
# Download today's front page PDF
nyt = NewYorkTimes()
nyt.save_latest_issue() # Saves to current directory
# Get PDF URL
pdf_url = nyt.latest_issue_url() # https://static01.nyt.com/images/2025/12/31/nytfrontpage/scan.pdf
Custom Headers¶
Override HTTP headers for specific sites:
from kiosque.core.client import client
client.headers.update({
"User-Agent": "Custom User Agent",
"Referer": "https://example.com"
})
Troubleshooting¶
Common issues:
- 403 Forbidden: Check authentication credentials
- Geo-blocked: Use a proxy server
- Incomplete content: Site may have updated their HTML structure
- Rate limiting: Wait and retry later
See Troubleshooting Guide for more help.
Next Steps¶
- Supported Sites - Full list of websites
- Authentication - Set up site logins
- Contributing - Add your favorite site