Web Scraping Financial Statements with Python and Excel
Analysts and founders live in Excel, but most financial statements still arrive as PDFs or clunky web tables. Copy-paste does not scale when you track dozens of tickers.
With Python, you can automate downloading, parsing, and reshaping balance sheets, income statements, and cash-flow data. With pandas and Excel, you can deliver this as a clean workbook that matches your existing models.
In this guide, we show practical patterns for scraping financial statements and moving the data into Excel, so your team spends time on decisions instead of data janitor work.
Workflow Overview
- Locate a consistent data source for statements and confirm usage terms.
- Fetch HTML or PDF pages with Python.
- Parse tables into pandas DataFrames.
- Normalize tickers, periods, and line-item names.
- Export to Excel using a layout compatible with your models.
Scraping HTML-Based Financial Tables
Many investor relations pages publish annual and quarterly statements as HTML tables. These are ideal for table parsing with requests, BeautifulSoup, and pandas.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://example.com/company/financials/income-statement"
response = requests.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
table = soup.select_one("table.financials-table")
rows = []
for tr in table.select("tr"):
cells = [td.get_text(strip=True) for td in tr.select("th, td")]
if cells:
rows.append(cells)
header, data = rows[0], rows[1:]
df = pd.DataFrame(data, columns=header)
df.to_excel("income_statement.xlsx", index=False)
This pattern works when the HTML structure is stable. For more complex layouts, you may build a mapping layer that renames raw labels to standardized line items used in your models.
Using pandas.read_html for Faster Extraction
When the tables are clean, pandas can parse them directly from the URL. This is effective for financial portals that expose statement tables without heavy JavaScript.
import pandas as pd
url = "https://example.com/company/financials/balance-sheet"
tables = pd.read_html(url)
balance_sheet = tables[0]
balance_sheet.columns = [c.strip() for c in balance_sheet.columns]
balance_sheet.to_excel("balance_sheet.xlsx", sheet_name="BalanceSheet", index=False)
For robust pipelines, you can validate that expected line items exist and that key totals match control sums before accepting the data into your main models.
Combining Multiple Statements into One Excel File
Most teams want a single workbook per ticker with multiple sheets for income statement, balance sheet, and cash flow. pandas supports writing multiple DataFrames into one .xlsx using ExcelWriter.
import pandas as pd
income = pd.read_excel("income_statement.xlsx")
balance = pd.read_excel("balance_sheet.xlsx")
cashflow = pd.read_excel("cashflow_statement.xlsx")
with pd.ExcelWriter("financial_statements.xlsx", engine="openpyxl") as writer:
income.to_excel(writer, sheet_name="IncomeStatement", index=False)
balance.to_excel(writer, sheet_name="BalanceSheet", index=False)
cashflow.to_excel(writer, sheet_name="CashFlow", index=False)
You can extend this pattern to handle multiple tickers by looping over symbols and building one workbook per company or one workbook per portfolio.
Handling PDFs and Regulatory Filings
Annual reports and regulatory filings often arrive as PDFs. These are harder to parse reliably but can still be automated with the right tools and quality checks.
For production-grade extraction, we typically combine a PDF parser with validation rules and manual review for low-confidence data. The structured result is then normalized into the same Excel layout as HTML-based pipelines.
Best Practices for Reliable Financial Scraping
- Respect robots.txt, site terms, and licensing requirements.
- Cache responses to avoid hitting the same pages repeatedly.
- Log every extraction run with source URL and timestamp.
- Validate totals and key ratios before feeding models.
- Separate scraping, transformation, and Excel export into distinct steps.
Key Takeaways
- Python and Excel form a strong pipeline for financial statement automation.
- HTML tables and pandas.read_html enable fast wins for many issuers.
- ExcelWriter lets you package multiple statements into one workbook.
- Validation and governance are as important as scraping itself.
Where Schema Markup Fits
For SEO and discovery, keep Article, BreadcrumbList, and FAQPage JSON-LD in the head of your article template. Ensure that the headline, description, and URLs match the visible content and canonical link.
Related Reading
You can combine this approach with broader automation projects: Python Web Scraping for Financial Data, Automated Data Reporting Systems, Business Process Automation Trends 2026.
Data Normalization and Mapping
Raw statement tables vary by issuer. Normalize naming and units before Excel export to keep models stable across companies and periods.
- Standardize line-item names (e.g., “Revenue” vs “Net Sales”).
- Align fiscal periods and convert quarterly to trailing-twelve if needed.
- Unify currencies and scale (thousands vs millions).
- Resolve subtotals and ensure totals equal sum of components.
import pandas as pd
df = pd.read_excel("income_statement.xlsx")
rename_map = {"Net Sales": "Revenue", "Operating Income": "OperatingProfit"}
df = df.rename(columns=rename_map)
for col in df.columns:
if col.endswith("(USD, millions)"):
df[col] = df[col].astype(float) * 1_000_000
df.to_excel("income_statement_normalized.xlsx", index=False)
Excel Formatting and Templates
Analysts expect clean, readable sheets. Apply basic formatting and column widths so the workbook is usable out of the box.
import pandas as pd
from openpyxl.styles import Font
from openpyxl.utils import get_column_letter
df = pd.read_excel("income_statement_normalized.xlsx")
with pd.ExcelWriter("financials_formatted.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="IncomeStatement", index=False)
ws = writer.book["IncomeStatement"]
for i, col in enumerate(df.columns, start=1):
ws.cell(row=1, column=i).font = Font(bold=True)
ws.column_dimensions[get_column_letter(i)].width = max(14, len(str(col)) + 2)
Dynamic Pages with Playwright
Some IR sites render tables with JavaScript. Use a headless browser to retrieve the fully-rendered HTML before parsing.
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/financials", wait_until="networkidle")
html = page.content()
browser.close()
soup = BeautifulSoup(html, "html.parser")
table = soup.select_one("table.financials")
rows = [[c.get_text(strip=True) for c in r.select("th, td")] for r in table.select("tr")]
df = pd.DataFrame(rows[1:], columns=rows[0])
df.to_excel("financials_dynamic.xlsx", index=False)
Scheduling and Monitoring
Production pipelines need scheduling, alerting, and audit trails. Monitor expected totals and publish health metrics for each run.
- Schedule extraction with cron or an orchestrator.
- Log source URLs, timestamps, and row counts.
- Alert on missing tables or failed validations.
- Version output files and archive per run.
Performance and Politeness
Balance throughput and respect for data sources. Concurrency must include backoff, rate limits, and caching.
- Cache responses per ticker and period.
- Use exponential backoff on transient errors.
- Limit parallel requests and honor robots.txt.
- Prefer official APIs or filings when available.
Batch Processing Multiple Tickers
Generate one workbook per ticker or a portfolio workbook with separate sheets. Keep a registry of mappings per issuer.
import pandas as pd
tickers = ["AAA", "BBB", "CCC"]
for t in tickers:
income = pd.read_excel(f"{t}_income.xlsx")
balance = pd.read_excel(f"{t}_balance.xlsx")
cashflow = pd.read_excel(f"{t}_cashflow.xlsx")
with pd.ExcelWriter(f"{t}_financials.xlsx", engine="openpyxl") as writer:
income.to_excel(writer, sheet_name="IncomeStatement", index=False)
balance.to_excel(writer, sheet_name="BalanceSheet", index=False)
cashflow.to_excel(writer, sheet_name="CashFlow", index=False)
Validation Checks Before Export
Simple control rules prevent bad data from entering models. Validate totals and ratios per period.
import pandas as pd
df = pd.read_excel("income_statement_normalized.xlsx")
df["GrossMargin"] = (df["GrossProfit"] / df["Revenue"]).round(4)
df = df[df["Revenue"] > 0]
df.to_excel("income_statement_validated.xlsx", index=False)
FAQ: Web Scraping Financial Statements
Can I use this with my existing Excel models?
Yes. Design the DataFrames to match your model’s expected layout, then write them into specific sheets and ranges using pandas and ExcelWriter.
What if the website layout changes?
Wrap selectors and parsing logic in tests and monitoring. For critical workflows, implement alerts when structure or totals change unexpectedly.
Can you build this for our team?
BohD Solutions designs, implements, and maintains end-to-end scraping and reporting pipelines tailored to your stack and compliance needs.