Skip to content

Article Extraction

Extract full-text articles from paywalled news websites.


Overview

Kiosque can download complete articles from major news publications, converting them to clean Markdown format with metadata preserved.

Basic Usage

Command Line

# Extract to file
kiosque https://www.lemonde.fr/article.html output.md

# Print to stdout
kiosque https://www.nytimes.com/article.html -

# Use with pipes
kiosque https://www.lemonde.fr/article.html - | grep "keyword"

Python API

from kiosque import Website

# Extract article
url = "https://www.lemonde.fr/article.html"
website = Website.instance(url)
markdown = website.full_text(url)

# Get metadata
title = website.title(url)
author = website.author(url)
date = website.date(url)

Output Format

Articles are extracted as Markdown with YAML frontmatter:

---
title: Article Title
author: Author Name
date: 2025-12-31
url: https://www.example.com/article
description: Article summary
---

# Article Title

**By Author Name** · 2025-12-31

_Article summary_

---

Article content in clean Markdown format...

Supported Sites

See Supported Sites for the complete list.

Major publications include:

  • English: New York Times, The Guardian, Financial Times, The Atlantic
  • French: Le Monde, Le Figaro, Les Échos, Mediapart

Authentication

Many sites require a paid subscription. Configure credentials in ~/.config/kiosque/kiosque.conf:

[https://www.lemonde.fr/]
username = your.email@example.com
password = your_password

[https://www.nytimes.com/]
cookie_nyt_s = your_nyt_cookie_value

See Authentication Guide for site-specific setup.


Proxy Support

Bypass geo-blocking with SOCKS or HTTP proxies:

# Command line
kiosque --proxy socks5://localhost:8765 https://www.nytimes.com/article.html output.md

# Configuration file
[proxy]
url = socks5://localhost:8765

Supported formats: socks5://, socks4://, http://, https://

See Troubleshooting for detailed proxy setup instructions.


Advanced Options

Retry Logic

Kiosque automatically retries failed requests with exponential backoff:

  • 3 attempts by default
  • 2s, 4s, 8s delays between retries
  • Handles temporary network issues

PDF Downloads

Some publications provide downloadable PDF editions (front pages or complete magazine issues):

# Download New York Times front page (daily)
kiosque nyt

# Download Le Monde Diplomatique issue (monthly)
kiosque lmd

# Download Courrier International issue (weekly)
kiosque courrier

# Download Pour la Science issue (monthly)
kiosque pls

PDFs are saved to the current directory with timestamped filenames (e.g., nyt-frontpage-2025-12-31.pdf).

Publications with PDF support:

Publication Alias(es) Type Frequency Auth Required
New York Times nyt, nytimes Front page Daily Cookie-based
Le Monde Diplomatique lmd, diplomatique Full issue Monthly Yes
Courrier International courrier Full issue Weekly Yes
Pour la Science pls Full issue Monthly Yes

Note: Courrier International may be geo-blocked outside France/Europe and require a proxy.

Python API:

from kiosque.website.nytimes import NewYorkTimes

# Download today's front page PDF
nyt = NewYorkTimes()
nyt.save_latest_issue()  # Saves to current directory

# Get PDF URL
pdf_url = nyt.latest_issue_url()  # https://static01.nyt.com/images/2025/12/31/nytfrontpage/scan.pdf

Custom Headers

Override HTTP headers for specific sites:

from kiosque.core.client import client

client.headers.update({
    "User-Agent": "Custom User Agent",
    "Referer": "https://example.com"
})

Troubleshooting

Common issues:

  • 403 Forbidden: Check authentication credentials
  • Geo-blocked: Use a proxy server
  • Incomplete content: Site may have updated their HTML structure
  • Rate limiting: Wait and retry later

See Troubleshooting Guide for more help.


Next Steps