April 4, 2026 16 min read Python Tutorial

How to Build an Insider Trading Screener with Python

A complete, working tutorial that takes you from fetching raw SEC EDGAR filings to outputting scored insider trading signals. We will parse Form 4 XML, filter for meaningful transactions, detect cluster buying, enrich with market data, and score every signal by conviction.

Prerequisites

You will need Python 3.8 or later and the following packages:

pip install requests beautifulsoup4 yfinance pandas

We will also use xml.etree.ElementTree from the Python standard library for XML parsing, and datetime and time for date handling and rate limiting.

SEC Rate Limits

The SEC requires all programmatic access to EDGAR to include a User-Agent header identifying your name and email. The SEC allows up to 10 requests per second. To be a good citizen, add time.sleep(0.1) between requests. Violating rate limits can result in your IP being temporarily blocked.

Step 1: Fetch Recent Form 4 Filings from EDGAR

The SEC provides a full-text search API at https://efts.sec.gov/LATEST/search-index that allows you to query for specific form types with date filters. We will use this to find recently filed Form 4s.

import requests
import time
from datetime import datetime, timedelta

HEADERS = {
    "User-Agent": "YourName [email protected]"
}

def fetch_recent_form4_filings(days_back=7, max_results=100):
    """Fetch recent Form 4 filing URLs from EDGAR full-text search."""
    start_date = (datetime.now() - timedelta(days=days_back)).strftime("%Y-%m-%d")
    end_date = datetime.now().strftime("%Y-%m-%d")

    url = "https://efts.sec.gov/LATEST/search-index"
    params = {
        "q": '"Form 4"',
        "dateRange": "custom",
        "startdt": start_date,
        "enddt": end_date,
        "forms": "4",
        "from": 0,
        "size": max_results,
    }

    resp = requests.get(url, params=params, headers=HEADERS)
    resp.raise_for_status()
    data = resp.json()

    filings = []
    for hit in data.get("hits", {}).get("hits", []):
        source = hit.get("_source", {})
        filings.append({
            "accession": source.get("file_num"),
            "filed": source.get("file_date"),
            "form_type": source.get("form_type"),
            "entity": source.get("entity_name"),
            "url": f"https://www.sec.gov/Archives/edgar/data/"
                   f"{source.get('ciks', [''])[0]}/"
                   f"{source.get('accession_no', '').replace('-', '')}/"
                   f"{source.get('accession_no')}-index.htm",
        })

    return filings

Alternatively, the SEC provides a recent filings RSS feed that you can poll. The EDGAR company search API at https://www.sec.gov/cgi-bin/browse-edgar also lets you query by company CIK and form type. For a production screener, you would typically use the EDGAR full-index files (updated nightly) or the streaming XBRL feed.

Step 2: Fetch and Parse Form 4 XML

Each Form 4 filing on EDGAR includes an XML document that follows a schema defined by the SEC. The XML contains structured data for the issuer, the reporting owner, and every transaction. Here is how to parse it:

import xml.etree.ElementTree as ET

def parse_form4_xml(xml_url):
    """Parse a Form 4 XML filing and extract transaction data."""
    time.sleep(0.1)  # respect rate limit
    resp = requests.get(xml_url, headers=HEADERS)
    resp.raise_for_status()

    root = ET.fromstring(resp.content)

    # Issuer (company) info
    issuer = root.find(".//issuer")
    issuer_name = issuer.findtext("issuerName", "") if issuer is not None else ""
    issuer_ticker = issuer.findtext("issuerTradingSymbol", "") if issuer is not None else ""
    issuer_cik = issuer.findtext("issuerCik", "") if issuer is not None else ""

    # Reporting owner (insider) info
    owner = root.find(".//reportingOwner")
    owner_name = ""
    owner_relationship = []
    if owner is not None:
        owner_id = owner.find("reportingOwnerId")
        if owner_id is not None:
            owner_name = owner_id.findtext("rptOwnerName", "")

        rel = owner.find("reportingOwnerRelationship")
        if rel is not None:
            if rel.findtext("isDirector", "0") == "1":
                owner_relationship.append("Director")
            if rel.findtext("isOfficer", "0") == "1":
                title = rel.findtext("officerTitle", "Officer")
                owner_relationship.append(title)
            if rel.findtext("isTenPercentOwner", "0") == "1":
                owner_relationship.append("10% Owner")

    # Non-derivative transactions (Table I)
    transactions = []
    for txn in root.findall(".//nonDerivativeTransaction"):
        coding = txn.find(".//transactionCoding")
        code = coding.findtext("transactionCode", "") if coding is not None else ""

        amounts = txn.find(".//transactionAmounts")
        shares_elem = amounts.find("transactionShares/value") if amounts is not None else None
        price_elem = amounts.find("transactionPricePerShare/value") if amounts is not None else None
        acq_disp = amounts.findtext("transactionAcquiredDisposedCode/value", "") if amounts is not None else ""

        shares = float(shares_elem.text) if shares_elem is not None and shares_elem.text else 0
        price = float(price_elem.text) if price_elem is not None and price_elem.text else 0

        date_elem = txn.find(".//transactionDate/value")
        txn_date = date_elem.text if date_elem is not None else ""

        # Post-transaction holdings
        post_elem = txn.find(".//postTransactionAmounts/sharesOwnedFollowingTransaction/value")
        post_shares = float(post_elem.text) if post_elem is not None and post_elem.text else 0

        transactions.append({
            "date": txn_date,
            "code": code,
            "shares": shares,
            "price": price,
            "acquired_disposed": acq_disp,
            "post_shares": post_shares,
            "dollar_value": shares * price,
        })

    return {
        "issuer_name": issuer_name,
        "ticker": issuer_ticker.upper(),
        "issuer_cik": issuer_cik,
        "owner_name": owner_name,
        "relationship": ", ".join(owner_relationship),
        "transactions": transactions,
    }

The XML namespace handling above is simplified. In production, some Form 4 XML files use the http://www.sec.gov/cgi-bin/viewer?action=view&cik=... namespace, and you may need to strip or handle namespaces explicitly. The beautifulsoup4 library can be more forgiving with malformed XML.

Step 3: Filter for Meaningful Signals

Not all Form 4 transactions are informative. The most valuable signal comes from open-market purchases (transaction code P), where insiders spend their own money to buy shares. We also want to filter by minimum dollar amount to exclude trivially small transactions.

def filter_meaningful_transactions(parsed_filings, min_dollar=10000):
    """Filter for open-market purchases above a minimum dollar threshold."""
    signals = []

    for filing in parsed_filings:
        for txn in filing["transactions"]:
            # Focus on open-market purchases
            if txn["code"] != "P":
                continue

            # Minimum dollar filter
            if txn["dollar_value"] < min_dollar:
                continue

            signals.append({
                "ticker": filing["ticker"],
                "issuer": filing["issuer_name"],
                "insider": filing["owner_name"],
                "relationship": filing["relationship"],
                "date": txn["date"],
                "shares": txn["shares"],
                "price": txn["price"],
                "dollar_value": txn["dollar_value"],
                "post_shares": txn["post_shares"],
            })

    return signals

You may also want to track open-market sales (code S) for a complete picture of insider sentiment. However, academic research consistently shows that purchases are more informative than sales. Insiders sell for many reasons (diversification, tax obligations, personal expenses), but they generally buy for only one reason: they believe the stock is undervalued.

Step 4: Detect Cluster Buying

One of the strongest insider trading signals is cluster buying: multiple distinct insiders purchasing shares of the same company within a short time window. The logic is intuitive — if three different executives independently decide to buy stock in the same two-week period, they likely share a positive view of the company’s near-term prospects.

from collections import defaultdict
from datetime import datetime

def detect_clusters(signals, window_days=10, min_insiders=2):
    """Group transactions by ticker and detect cluster buying."""
    by_ticker = defaultdict(list)
    for s in signals:
        by_ticker[s["ticker"]].append(s)

    clusters = []
    for ticker, txns in by_ticker.items():
        # Sort by date
        txns.sort(key=lambda x: x["date"])

        # Sliding window: count unique insiders within window_days
        for i, anchor in enumerate(txns):
            anchor_date = datetime.strptime(anchor["date"], "%Y-%m-%d")
            window_insiders = set()
            window_txns = []

            for t in txns:
                t_date = datetime.strptime(t["date"], "%Y-%m-%d")
                if 0 <= (t_date - anchor_date).days <= window_days:
                    window_insiders.add(t["insider"])
                    window_txns.append(t)

            if len(window_insiders) >= min_insiders:
                total_dollar = sum(t["dollar_value"] for t in window_txns)
                clusters.append({
                    "ticker": ticker,
                    "issuer": anchor["issuer"],
                    "insider_count": len(window_insiders),
                    "insiders": list(window_insiders),
                    "total_dollar": total_dollar,
                    "start_date": anchor["date"],
                    "transactions": window_txns,
                })
                break  # one cluster per ticker

    # Sort by insider count descending, then by dollar amount
    clusters.sort(key=lambda x: (-x["insider_count"], -x["total_dollar"]))
    return clusters

A 10-day window with a minimum of 2 unique insiders is a reasonable starting point. You can tighten the criteria (3+ insiders, 7-day window) for higher-conviction signals, or loosen them for more coverage. Lakonishok and Lee (2001) found that the predictive power of insider buying increases with the number of insiders buying.

Step 5: Enrich with Market Data

Raw insider transaction data is more useful when enriched with market context. We will use the yfinance library to get current price, market capitalization, sector, and 52-week price range for each ticker.

import yfinance as yf

def enrich_with_market_data(signals):
    """Add market data from yfinance to each signal."""
    tickers = list(set(s["ticker"] for s in signals))
    enriched = []

    for ticker in tickers:
        time.sleep(0.2)  # be gentle with Yahoo Finance
        try:
            stock = yf.Ticker(ticker)
            info = stock.info

            market_data = {
                "current_price": info.get("currentPrice") or info.get("regularMarketPrice"),
                "market_cap": info.get("marketCap"),
                "sector": info.get("sector", "Unknown"),
                "industry": info.get("industry", "Unknown"),
                "fifty_two_week_low": info.get("fiftyTwoWeekLow"),
                "fifty_two_week_high": info.get("fiftyTwoWeekHigh"),
                "average_volume": info.get("averageVolume"),
            }

            # Attach market data to all signals for this ticker
            for s in signals:
                if s["ticker"] == ticker:
                    enriched.append({**s, **market_data})

        except Exception as e:
            print(f"Warning: could not fetch data for {ticker}: {e}")
            for s in signals:
                if s["ticker"] == ticker:
                    enriched.append(s)

    return enriched

Market cap is particularly important for signal quality. Academic research shows that insider purchases at small-cap and mid-cap companies tend to be more informative than purchases at mega-cap companies. This makes intuitive sense: insiders at smaller companies are more likely to have a material information advantage, and their purchases represent a larger fraction of their personal wealth.

Step 6: Score Signals by Conviction

Now we combine all the data into a conviction score. The scoring function weights multiple factors:

import math

# Role weights
ROLE_WEIGHTS = {
    "CEO": 1.0,
    "CFO": 0.95,
    "COO": 0.90,
    "President": 0.90,
    "VP": 0.75,
    "Director": 0.65,
    "10% Owner": 0.55,
}

def get_role_weight(relationship_str):
    """Extract the highest-weighted role from the relationship string."""
    best = 0.5  # default for unknown roles
    for role, weight in ROLE_WEIGHTS.items():
        if role.lower() in relationship_str.lower():
            best = max(best, weight)
    return best

def score_signal(signal, cluster_map, decay_half_life=35):
    """Compute a conviction score for an insider purchase signal."""
    # Role weight (0-1)
    role_w = get_role_weight(signal.get("relationship", ""))

    # Dollar conviction (log scale, normalized)
    dollar = signal.get("dollar_value", 0)
    dollar_w = min(math.log10(max(dollar, 1)) / 7, 1.0)  # $10M = 1.0

    # Cluster bonus
    ticker = signal["ticker"]
    cluster_info = cluster_map.get(ticker, {})
    cluster_count = cluster_info.get("insider_count", 1)
    cluster_w = min(cluster_count / 5, 1.0)  # 5+ insiders = max

    # Time decay (exponential)
    try:
        txn_date = datetime.strptime(signal["date"], "%Y-%m-%d")
        days_ago = (datetime.now() - txn_date).days
        decay = math.exp(-0.693 * days_ago / decay_half_life)
    except (ValueError, KeyError):
        decay = 0.5

    # Weighted composite score
    raw_score = (
        0.25 * role_w +
        0.30 * dollar_w +
        0.30 * cluster_w +
        0.15 * decay
    )

    return round(raw_score * 100, 1)

The weights above (25% role, 30% dollar, 30% cluster, 15% recency) are a reasonable starting point based on the academic literature. You can tune these weights through backtesting on historical data. The exponential decay with a 35-day half-life means that a signal loses half its time-weighted value every 35 days.

Step 7: Output to DataFrame

Finally, we assemble everything into a pandas DataFrame and output the results:

import pandas as pd
import json

def build_screener_output(enriched_signals, clusters):
    """Build the final scored output."""
    # Create cluster lookup
    cluster_map = {}
    for c in clusters:
        cluster_map[c["ticker"]] = c

    # Score each signal
    for s in enriched_signals:
        s["score"] = score_signal(s, cluster_map)
        ci = cluster_map.get(s["ticker"], {})
        s["cluster_count"] = ci.get("insider_count", 1)

    # Sort by score descending
    enriched_signals.sort(key=lambda x: -x["score"])

    # Convert to DataFrame
    df = pd.DataFrame(enriched_signals)
    columns = [
        "ticker", "issuer", "insider", "relationship", "date",
        "shares", "price", "dollar_value", "cluster_count",
        "market_cap", "sector", "score",
    ]
    # Only include columns that exist
    columns = [c for c in columns if c in df.columns]
    df = df[columns]

    return df

# --- Main execution ---
if __name__ == "__main__":
    print("Fetching recent Form 4 filings...")
    filings_meta = fetch_recent_form4_filings(days_back=14)
    print(f"Found {len(filings_meta)} filings")

    # In a real implementation, you would resolve each filing
    # to its XML URL and parse it. Simplified here:
    # parsed = [parse_form4_xml(url) for url in xml_urls]
    # signals = filter_meaningful_transactions(parsed)
    # clusters = detect_clusters(signals)
    # enriched = enrich_with_market_data(signals)
    # df = build_screener_output(enriched, clusters)

    # print(df.to_string(index=False))
    # df.to_json("signal_cache.json", orient="records", indent=2)

Step 8: Resolving Filing URLs to XML

One practical challenge is that the EDGAR search API returns filing index pages, not the XML documents directly. To get the XML, you need to fetch the filing index page and find the link to the primary XML document. Here is a helper function:

from bs4 import BeautifulSoup

def resolve_xml_url(index_url):
    """Given a filing index URL, find the primary XML document URL."""
    time.sleep(0.1)
    resp = requests.get(index_url, headers=HEADERS)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "html.parser")
    # Look for the XML file in the filing documents table
    for row in soup.select("table.tableFile tr"):
        cells = row.find_all("td")
        if len(cells) >= 4:
            doc_type = cells[3].get_text(strip=True)
            if doc_type == "4":
                link = cells[2].find("a")
                if link and link.get("href", "").endswith(".xml"):
                    return "https://www.sec.gov" + link["href"]

    return None

Putting It All Together

Here is the complete pipeline in sequence:

  1. Fetch recent Form 4 filing metadata from EDGAR search
  2. Resolve each filing’s index page to its XML document URL
  3. Parse each XML document to extract issuer, insider, and transaction data
  4. Filter for open-market purchases (code P) above $10,000
  5. Detect cluster buying (2+ unique insiders within a 10-day window)
  6. Enrich with market data from yfinance (price, market cap, sector)
  7. Score each signal by insider role, dollar amount, cluster count, and recency
  8. Output to DataFrame, JSON, or your preferred format

Production Considerations

The screener above is a working prototype. To run it in production, you would want to address several additional concerns:

Data Storage

Instead of processing filings each time the script runs, store parsed transactions in a database (SQLite, PostgreSQL, or DuckDB all work well). This lets you build a historical record of insider activity, compute rolling statistics, and avoid re-fetching and re-parsing filings you have already processed.

Incremental Updates

Rather than searching for filings over a date range each time, maintain a high-water mark (the most recent filing date or accession number you have processed) and only fetch new filings since that point. The EDGAR full-index files, updated daily, are ideal for this pattern.

Error Handling

EDGAR XML files are not always perfectly formed. Some filings have missing fields, unusual encodings, or namespace variations. Wrap your XML parsing in try/except blocks and log failures so you can investigate and handle edge cases. Process each filing independently so that one malformed filing does not block the entire batch.

Scheduling

Run the screener on a schedule (daily after market close is typical). Use APScheduler, cron, or a task queue like Celery. The SEC publishes most Form 4 filings during business hours Eastern Time, with a lag of a few hours to a day from the actual transaction date.

Sell Signal Analysis

While we focused on purchases (code P), a complete screener should also track sales (code S). You can compute a “net flow” metric for each company: total dollar value of purchases minus total dollar value of sales over a rolling window. A strongly positive net flow (many buyers, few sellers) is more informative than purchases alone.

Why This Works

Insider trading analysis works because corporate insiders have an asymmetric information advantage about their own companies. When they put their personal capital at risk by buying shares on the open market, they are expressing a view with real financial consequences. Aggregating these signals across thousands of companies and scoring them by conviction produces a systematic edge that has been documented in decades of academic research.

Skip the Build — Use Alpha Suite

Alpha Suite runs this entire pipeline automatically: continuous EDGAR monitoring, XML parsing, cluster detection, market data enrichment, conviction scoring, and risk-managed position sizing.

Get Started