Web Scraping SEC EDGAR with Python

SEC EDGAR: Not Really "Scraping" Anymore

The term "web scraping" implies parsing HTML pages, but the SEC has invested heavily in structured data APIs that make traditional scraping largely unnecessary. The EDGAR system (Electronic Data Gathering, Analysis, and Retrieval) is the SEC's filing database, and its modern data.sec.gov REST API provides clean JSON responses for most common use cases.

That said, some data -- particularly the content of individual filings like Form 4 XML documents -- still requires fetching and parsing individual files. This tutorial covers both the structured API endpoints and the techniques for parsing filing documents directly.

Required: User-Agent header. The SEC requires all programmatic requests to include a User-Agent header with your company name (or your name) and email address. Requests without a proper User-Agent will be blocked. The SEC uses this to contact you if your requests are causing problems.

Setting Up: Headers and Rate Limiting

Before making any requests, set up a session with proper headers and rate limiting:

import requests
import time

# Required: identify yourself to the SEC
HEADERS = {
    "User-Agent": "YourCompany [email protected]"
}

# Create a session for connection reuse
session = requests.Session()
session.headers.update(HEADERS)

def rate_limited_get(url):
    """GET request with rate limiting for SEC fair access."""
    time.sleep(0.1)  # 10 requests/second max
    response = session.get(url)
    response.raise_for_status()
    return response

The SEC's rate limit is 10 requests per second. The time.sleep(0.1) ensures you stay within this limit. In practice, being slightly more conservative (such as 0.12 seconds between requests) is wise to account for timing imprecision. If you exceed the rate limit, the SEC will temporarily block your IP address.

SEC EDGAR Key Endpoints

Company submissions: data.sec.gov/submissions/CIK{cik}.json
XBRL company facts: data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json
Full-text search: efts.sec.gov/LATEST/search-index
Filing documents: sec.gov/Archives/edgar/data/{cik}/{accession}/
Recent filings feed: sec.gov/cgi-bin/browse-edgar

Finding a Company's CIK Number

Every company in EDGAR is identified by a CIK (Central Index Key) number. You need the CIK to use most API endpoints. There are several ways to look up a CIK from a ticker symbol.

The simplest method uses the SEC's company tickers JSON file, which maps tickers to CIKs:

def get_cik_from_ticker(ticker):
    """Look up CIK number from a stock ticker symbol."""
    url = "https://www.sec.gov/files/company_tickers.json"
    response = rate_limited_get(url)
    data = response.json()

    # The response is a dict with numeric keys
    for entry in data.values():
        if entry["ticker"].upper() == ticker.upper():
            # CIK must be zero-padded to 10 digits for API URLs
            return str(entry["cik_str"]).zfill(10)

    return None

cik = get_cik_from_ticker("AAPL")
print(f"Apple's CIK: {cik}")  # 0000320193

Apple's CIK is 320193. When constructing API URLs, the CIK must be zero-padded to 10 digits: 0000320193.

You can also look up CIK numbers through the EDGAR company search page at https://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=AAPL&type=&dateb=&owner=include&count=40&search_text=&action=getcompany. This returns an HTML page with the company's filings, which includes the CIK in the URL and page content.

Getting Company Filing History

The company submissions endpoint returns a comprehensive JSON object with the company's filing history:

def get_company_filings(cik):
    """Fetch all filing metadata for a company."""
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    response = rate_limited_get(url)
    data = response.json()

    # Company info
    print(f"Company: {data['name']}")
    print(f"Ticker(s): {data.get('tickers', [])}")
    print(f"SIC: {data.get('sic', 'N/A')} - {data.get('sicDescription', '')}")

    # Recent filings are in data['filings']['recent']
    recent = data['filings']['recent']

    # Convert to DataFrame for easier manipulation
    import pandas as pd
    filings_df = pd.DataFrame({
        'accessionNumber': recent['accessionNumber'],
        'filingDate': recent['filingDate'],
        'form': recent['form'],
        'primaryDocument': recent['primaryDocument'],
    })

    return filings_df, data

filings_df, company_data = get_company_filings("0000320193")
print(f"\nTotal recent filings: {len(filings_df)}")
print(filings_df.head(10))

The filings['recent'] object contains arrays for each field (accession number, filing date, form type, etc.) for the most recent filings (up to 1,000). For companies with more filings, additional pages are available through the filings['files'] array, which lists supplementary JSON files.

Filtering by Form Type

You can filter the filings DataFrame to find specific form types. The most common form types for insider trading analysis are:

Form 4: Statement of changes in beneficial ownership (the primary insider trading disclosure)
Form 3: Initial statement of beneficial ownership
Form 5: Annual statement of changes in beneficial ownership
Form 144: Notice of proposed sale of securities

# Filter for Form 4 filings only
form4s = filings_df[filings_df['form'] == '4'].copy()
print(f"Form 4 filings: {len(form4s)}")

# Filter for filings in the last 30 days
from datetime import datetime, timedelta
cutoff = (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d')
recent_form4s = form4s[form4s['filingDate'] >= cutoff]
print(f"Form 4s in last 30 days: {len(recent_form4s)}")

Parsing Form 4 XML Documents

Form 4 filings are submitted in XML format. Each Form 4 reports insider transactions -- purchases, sales, grants, and other changes in beneficial ownership. To extract the actual transaction data, you need to download and parse the XML document.

Downloading a Form 4 Filing

import xml.etree.ElementTree as ET

def fetch_form4_xml(cik, accession_number):
    """Download and parse a Form 4 XML filing."""
    # Format accession number for URL (remove dashes)
    acc_no_dashes = accession_number.replace('-', '')

    # The primary document URL
    # First, get the filing index to find the XML document
    index_url = (
        f"https://www.sec.gov/Archives/edgar/data/"
        f"{cik.lstrip('0')}/{acc_no_dashes}/"
    )

    # Form 4 XML files typically end in .xml
    # We can construct the URL from the accession number
    xml_url = (
        f"https://www.sec.gov/Archives/edgar/data/"
        f"{cik.lstrip('0')}/{acc_no_dashes}/"
        f"{accession_number}.xml"
    )

    response = rate_limited_get(xml_url)
    return response.text

Extracting Transaction Data

The Form 4 XML schema contains several important sections: the issuer (the company), the reporting owner (the insider), and the transaction details. Here is a comprehensive parser:

def parse_form4(xml_text):
    """
    Parse Form 4 XML and extract transaction details.

    Returns a dict with issuer info, owner info, and list of transactions.
    """
    root = ET.fromstring(xml_text)

    # Namespace handling: Form 4 XML may or may not use namespaces
    # Try without namespace first
    ns = ''
    if root.tag.startswith('{'):
        ns = root.tag.split('}')[0] + '}'

    def find(element, tag):
        """Find a child element, handling optional namespace."""
        result = element.find(f"{ns}{tag}")
        if result is None:
            result = element.find(tag)
        return result

    def find_text(element, tag, default=''):
        """Get text content of a child element."""
        el = find(element, tag)
        return el.text.strip() if el is not None and el.text else default

    result = {
        'issuer': {},
        'owner': {},
        'transactions': []
    }

    # Issuer information
    issuer = find(root, 'issuer')
    if issuer is not None:
        result['issuer'] = {
            'cik': find_text(issuer, 'issuerCik'),
            'name': find_text(issuer, 'issuerName'),
            'ticker': find_text(issuer, 'issuerTradingSymbol'),
        }

    # Reporting owner
    owner = find(root, 'reportingOwner')
    if owner is not None:
        owner_id = find(owner, 'reportingOwnerId')
        owner_rel = find(owner, 'reportingOwnerRelationship')
        if owner_id is not None:
            result['owner']['name'] = find_text(owner_id, 'rptOwnerName')
            result['owner']['cik'] = find_text(owner_id, 'rptOwnerCik')
        if owner_rel is not None:
            result['owner']['isDirector'] = find_text(owner_rel, 'isDirector') == '1'
            result['owner']['isOfficer'] = find_text(owner_rel, 'isOfficer') == '1'
            result['owner']['officerTitle'] = find_text(owner_rel, 'officerTitle')

    # Non-derivative transactions (common stock buys/sells)
    nd_table = find(root, 'nonDerivativeTable')
    if nd_table is not None:
        for txn in nd_table:
            if 'Transaction' not in txn.tag:
                continue

            coding = find(txn, 'transactionCoding')
            amounts = find(txn, 'transactionAmounts')

            if coding is None or amounts is None:
                continue

            # Transaction code: P = Purchase, S = Sale
            txn_code = find_text(coding, 'transactionCode')

            shares_el = find(amounts, 'transactionShares')
            price_el = find(amounts, 'transactionPricePerShare')

            shares = float(find_text(shares_el, 'value', '0')) if shares_el is not None else 0
            price = float(find_text(price_el, 'value', '0')) if price_el is not None else 0

            result['transactions'].append({
                'date': find_text(txn, 'transactionDate/value')
                        if find(txn, 'transactionDate') is not None
                        else find_text(find(txn, 'transactionDate'), 'value', ''),
                'code': txn_code,
                'shares': shares,
                'price': price,
                'dollar_value': shares * price,
            })

    return result

The transaction code field is the most important piece of data for insider trading analysis. The key codes are:

P -- Open market or private purchase of securities
S -- Open market or private sale of securities
A -- Grant or award (stock compensation)
M -- Exercise or conversion of derivative security (e.g., stock options)
F -- Payment of exercise price or tax liability by delivering securities
G -- Gift of securities

For insider trading signal analysis, the most informative codes are P (open market purchases, where the insider is spending their own money) and S (open market sales). Grants (A) and option exercises (M) are part of compensation and generally carry less informational signal.

Complete Example: Recent Insider Transactions

Here is a complete working example that fetches all Form 4 filings for a company in the last 30 days and extracts the buy/sell transactions:

import requests
import time
import xml.etree.ElementTree as ET
import pandas as pd
from datetime import datetime, timedelta

HEADERS = {"User-Agent": "YourCompany [email protected]"}
session = requests.Session()
session.headers.update(HEADERS)

def get_recent_insider_trades(ticker, days=30):
    """
    Fetch insider buy/sell transactions for a company
    from the last N days.
    """
    # Step 1: Get CIK
    tickers_url = "https://www.sec.gov/files/company_tickers.json"
    time.sleep(0.1)
    tickers_data = session.get(tickers_url).json()

    cik = None
    for entry in tickers_data.values():
        if entry["ticker"].upper() == ticker.upper():
            cik = str(entry["cik_str"]).zfill(10)
            break

    if cik is None:
        raise ValueError(f"Ticker {ticker} not found")

    # Step 2: Get filings
    time.sleep(0.1)
    sub_url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    sub_data = session.get(sub_url).json()
    recent = sub_data['filings']['recent']

    # Step 3: Filter Form 4s from last N days
    cutoff = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
    trades = []

    for i, form in enumerate(recent['form']):
        if form != '4':
            continue
        if recent['filingDate'][i] < cutoff:
            continue

        acc = recent['accessionNumber'][i]
        acc_no_dash = acc.replace('-', '')

        # Step 4: Fetch and parse each Form 4
        xml_url = (
            f"https://www.sec.gov/Archives/edgar/data/"
            f"{cik.lstrip('0')}/{acc_no_dash}/{acc}.xml"
        )

        try:
            time.sleep(0.1)
            resp = session.get(xml_url)
            if resp.status_code != 200:
                continue

            parsed = parse_form4(resp.text)

            for txn in parsed['transactions']:
                if txn['code'] in ('P', 'S'):
                    trades.append({
                        'filing_date': recent['filingDate'][i],
                        'owner': parsed['owner'].get('name', ''),
                        'title': parsed['owner'].get('officerTitle', ''),
                        'type': 'BUY' if txn['code'] == 'P' else 'SELL',
                        'shares': txn['shares'],
                        'price': txn['price'],
                        'value': txn['dollar_value'],
                    })
        except Exception as e:
            print(f"Error parsing {acc}: {e}")
            continue

    return pd.DataFrame(trades)

# Example usage
df = get_recent_insider_trades("AAPL", days=30)
print(df.to_string(index=False))

Note on XML parsing: Not all Form 4 filings follow the exact same XML structure. Some older filings may have different element names or missing fields. The parser above handles common variations, but production code should include more robust error handling for edge cases.

XBRL Structured Financial Data

For fundamental financial data (revenue, earnings, assets, etc.), the SEC provides structured XBRL data through a dedicated API. This is far more efficient than parsing 10-K/10-Q HTML documents:

def get_company_facts(cik):
    """
    Fetch structured XBRL financial data for a company.
    Returns all reported financial facts (revenue, net income, etc.)
    """
    url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"
    time.sleep(0.1)
    response = session.get(url)
    data = response.json()

    print(f"Company: {data['entityName']}")

    # Facts are organized by taxonomy (us-gaap, dei, etc.)
    # and then by concept (Revenue, NetIncomeLoss, etc.)
    us_gaap = data['facts'].get('us-gaap', {})

    print(f"Available US-GAAP concepts: {len(us_gaap)}")

    # Example: get quarterly revenue
    if 'Revenues' in us_gaap:
        revenue_data = us_gaap['Revenues']['units']['USD']
        # Filter for 10-Q filings (quarterly)
        quarterly = [
            r for r in revenue_data
            if r.get('form') == '10-Q'
        ]
        for item in quarterly[-4:]:  # last 4 quarters
            print(f"  {item['end']}: ${item['val']:,.0f}")

    return data

facts = get_company_facts("0000320193")

The company facts endpoint returns every XBRL-tagged data point the company has ever filed. Common concepts include Revenues, NetIncomeLoss, Assets, StockholdersEquity, and hundreds of others. The data is organized by taxonomy (primarily us-gaap) and unit of measure (USD, shares, etc.).

Full-Text Search

The EDGAR full-text search system (EFTS) allows you to search across the content of all filings. This is useful for finding specific disclosures, risk factors, or mentions of particular topics:

def search_filings(query, forms=None, start_date=None, end_date=None):
    """
    Search EDGAR filings by text content.

    Parameters:
        query: search text (supports quotes for exact phrases)
        forms: comma-separated form types (e.g., "4,10-K")
        start_date: YYYY-MM-DD format
        end_date: YYYY-MM-DD format
    """
    base_url = "https://efts.sec.gov/LATEST/search-index"
    params = {"q": query}

    if forms:
        params["forms"] = forms
    if start_date and end_date:
        params["dateRange"] = "custom"
        params["startdt"] = start_date
        params["enddt"] = end_date

    time.sleep(0.1)
    response = session.get(base_url, params=params)
    data = response.json()

    print(f"Total hits: {data.get('hits', {}).get('total', {}).get('value', 0)}")

    for hit in data.get('hits', {}).get('hits', [])[:5]:
        source = hit['_source']
        print(f"  {source.get('file_date', '')} | "
              f"{source.get('display_names', [''])[0]} | "
              f"{source.get('form_type', '')}")

    return data

# Search for insider purchases in Form 4 filings
results = search_filings(
    query='"open market purchase"',
    forms="4",
    start_date="2024-01-01",
    end_date="2024-12-31"
)

Rate Limiting and Fair Access Policy

The SEC publishes a fair access policy for EDGAR that all programmatic users must follow. The key rules are:

Identify yourself: Include a User-Agent header with your name/company and email address. The SEC will use this to contact you if there are issues.
Limit request rate: No more than 10 requests per second. The SEC will block your IP if you exceed this limit.
Cache responses: If you need the same data repeatedly, cache it locally rather than re-fetching from EDGAR.
Off-peak access: For large bulk downloads, prefer off-peak hours (nights and weekends ET).
No automated downloading of all filings: If you need bulk data, use the SEC's EDGAR full-text search bulk data (available on the SEC's developer page) rather than crawling the filing tree.

# Robust rate limiter with retry logic
class EdgarClient:
    def __init__(self, user_agent, max_retries=3):
        self.session = requests.Session()
        self.session.headers.update({"User-Agent": user_agent})
        self.last_request = 0
        self.min_interval = 0.1  # 10 req/sec
        self.max_retries = max_retries

    def get(self, url, params=None):
        for attempt in range(self.max_retries):
            # Enforce rate limit
            elapsed = time.time() - self.last_request
            if elapsed < self.min_interval:
                time.sleep(self.min_interval - elapsed)

            self.last_request = time.time()
            response = self.session.get(url, params=params)

            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Rate limited: back off exponentially
                wait = 2 ** attempt * 5
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
            else:
                response.raise_for_status()

        raise Exception(f"Failed after {self.max_retries} retries: {url}")

client = EdgarClient("YourCompany [email protected]")

Ethics and Legality

SEC EDGAR data is public information. Every filing is a public record, and accessing it programmatically is entirely legal. The SEC actively encourages developers to build tools that make this data more accessible -- that is the entire purpose of the EDGAR system.

The only legal obligations are to follow the fair access policy (rate limits, identification) and to not misrepresent the data. You cannot, for example, present fabricated SEC filings as real. But downloading, analyzing, and building products around genuine EDGAR data is explicitly permitted.

The SEC has also released several bulk data products specifically for programmatic access, including the company tickers JSON file, the XBRL API, and the full-text search system. These were designed for exactly the type of analysis described in this tutorial.

One important nuance: while the filings themselves are public, the trading on material non-public information is illegal. The distinction is between information that has been filed with the SEC (public) and information that has not yet been disclosed (non-public). Everything in EDGAR has already been disclosed and is therefore public information that anyone can legally trade on.

Beyond the Basics: Building a Filing Monitor

For real-time insider trading analysis, you need a system that continuously monitors for new Form 4 filings. The SEC updates EDGAR throughout the day, with most filings appearing within minutes of submission. The recent filings feed at https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=4&dateb=&owner=include&count=40&search_text=&start=0&output=atom provides an Atom feed of the most recent filings by form type.

A production monitoring system typically polls the feed every few minutes, compares against previously seen accession numbers, and parses any new Form 4 filings. The latency from filing to detection is usually under 5 minutes, which is fast enough for most trading strategies based on insider activity (insiders have up to two business days to file Form 4, so the information is already somewhat delayed).

Building a robust SEC data pipeline involves handling edge cases: filings that are amended (Form 4/A), filings that are withdrawn, XML parsing errors, network timeouts, and the occasional change in EDGAR's response format. The key principle is defensive coding -- every request can fail, every XML document can have unexpected structure, and the system should log and continue rather than crash.

Automated SEC Filing Analysis

Alpha Suite monitors SEC EDGAR Form 4 filings in real time, automatically extracting and scoring insider buy/sell transactions to generate quantitative trading signals.

Get Started with Alpha Suite

SEC EDGAR: Not Really "Scraping" Anymore

Setting Up: Headers and Rate Limiting

SEC EDGAR Key Endpoints

Finding a Company's CIK Number

Getting Company Filing History

Filtering by Form Type

Parsing Form 4 XML Documents

Downloading a Form 4 Filing

Extracting Transaction Data

Complete Example: Recent Insider Transactions

XBRL Structured Financial Data

Full-Text Search

Rate Limiting and Fair Access Policy

Ethics and Legality

Beyond the Basics: Building a Filing Monitor

Automated SEC Filing Analysis

Continue Reading

How to Read SEC Form 4: Insider Transaction Filings Explained

SEC EDGAR Tutorial: Finding and Analyzing Company Filings

What Is Insider Trading? Legal vs. Illegal Explained

Getting Market Data with yfinance: A Complete Python Tutorial