How to Build an Insider Trading Screener with Python
A complete, working tutorial that takes you from fetching raw SEC EDGAR filings to outputting scored insider trading signals. We will parse Form 4 XML, filter for meaningful transactions, detect cluster buying, enrich with market data, and score every signal by conviction.
Prerequisites
You will need Python 3.8 or later and the following packages:
pip install requests beautifulsoup4 yfinance pandas
We will also use xml.etree.ElementTree from the Python standard library for XML parsing, and datetime and time for date handling and rate limiting.
The SEC requires all programmatic access to EDGAR to include a User-Agent header identifying your name and email. The SEC allows up to 10 requests per second. To be a good citizen, add time.sleep(0.1) between requests. Violating rate limits can result in your IP being temporarily blocked.
Step 1: Fetch Recent Form 4 Filings from EDGAR
The SEC provides a full-text search API at https://efts.sec.gov/LATEST/search-index that allows you to query for specific form types with date filters. We will use this to find recently filed Form 4s.
import requests
import time
from datetime import datetime, timedelta
HEADERS = {
"User-Agent": "YourName [email protected]"
}
def fetch_recent_form4_filings(days_back=7, max_results=100):
"""Fetch recent Form 4 filing URLs from EDGAR full-text search."""
start_date = (datetime.now() - timedelta(days=days_back)).strftime("%Y-%m-%d")
end_date = datetime.now().strftime("%Y-%m-%d")
url = "https://efts.sec.gov/LATEST/search-index"
params = {
"q": '"Form 4"',
"dateRange": "custom",
"startdt": start_date,
"enddt": end_date,
"forms": "4",
"from": 0,
"size": max_results,
}
resp = requests.get(url, params=params, headers=HEADERS)
resp.raise_for_status()
data = resp.json()
filings = []
for hit in data.get("hits", {}).get("hits", []):
source = hit.get("_source", {})
filings.append({
"accession": source.get("file_num"),
"filed": source.get("file_date"),
"form_type": source.get("form_type"),
"entity": source.get("entity_name"),
"url": f"https://www.sec.gov/Archives/edgar/data/"
f"{source.get('ciks', [''])[0]}/"
f"{source.get('accession_no', '').replace('-', '')}/"
f"{source.get('accession_no')}-index.htm",
})
return filings
Alternatively, the SEC provides a recent filings RSS feed that you can poll. The EDGAR company search API at https://www.sec.gov/cgi-bin/browse-edgar also lets you query by company CIK and form type. For a production screener, you would typically use the EDGAR full-index files (updated nightly) or the streaming XBRL feed.
Step 2: Fetch and Parse Form 4 XML
Each Form 4 filing on EDGAR includes an XML document that follows a schema defined by the SEC. The XML contains structured data for the issuer, the reporting owner, and every transaction. Here is how to parse it:
import xml.etree.ElementTree as ET
def parse_form4_xml(xml_url):
"""Parse a Form 4 XML filing and extract transaction data."""
time.sleep(0.1) # respect rate limit
resp = requests.get(xml_url, headers=HEADERS)
resp.raise_for_status()
root = ET.fromstring(resp.content)
# Issuer (company) info
issuer = root.find(".//issuer")
issuer_name = issuer.findtext("issuerName", "") if issuer is not None else ""
issuer_ticker = issuer.findtext("issuerTradingSymbol", "") if issuer is not None else ""
issuer_cik = issuer.findtext("issuerCik", "") if issuer is not None else ""
# Reporting owner (insider) info
owner = root.find(".//reportingOwner")
owner_name = ""
owner_relationship = []
if owner is not None:
owner_id = owner.find("reportingOwnerId")
if owner_id is not None:
owner_name = owner_id.findtext("rptOwnerName", "")
rel = owner.find("reportingOwnerRelationship")
if rel is not None:
if rel.findtext("isDirector", "0") == "1":
owner_relationship.append("Director")
if rel.findtext("isOfficer", "0") == "1":
title = rel.findtext("officerTitle", "Officer")
owner_relationship.append(title)
if rel.findtext("isTenPercentOwner", "0") == "1":
owner_relationship.append("10% Owner")
# Non-derivative transactions (Table I)
transactions = []
for txn in root.findall(".//nonDerivativeTransaction"):
coding = txn.find(".//transactionCoding")
code = coding.findtext("transactionCode", "") if coding is not None else ""
amounts = txn.find(".//transactionAmounts")
shares_elem = amounts.find("transactionShares/value") if amounts is not None else None
price_elem = amounts.find("transactionPricePerShare/value") if amounts is not None else None
acq_disp = amounts.findtext("transactionAcquiredDisposedCode/value", "") if amounts is not None else ""
shares = float(shares_elem.text) if shares_elem is not None and shares_elem.text else 0
price = float(price_elem.text) if price_elem is not None and price_elem.text else 0
date_elem = txn.find(".//transactionDate/value")
txn_date = date_elem.text if date_elem is not None else ""
# Post-transaction holdings
post_elem = txn.find(".//postTransactionAmounts/sharesOwnedFollowingTransaction/value")
post_shares = float(post_elem.text) if post_elem is not None and post_elem.text else 0
transactions.append({
"date": txn_date,
"code": code,
"shares": shares,
"price": price,
"acquired_disposed": acq_disp,
"post_shares": post_shares,
"dollar_value": shares * price,
})
return {
"issuer_name": issuer_name,
"ticker": issuer_ticker.upper(),
"issuer_cik": issuer_cik,
"owner_name": owner_name,
"relationship": ", ".join(owner_relationship),
"transactions": transactions,
}
The XML namespace handling above is simplified. In production, some Form 4 XML files use the http://www.sec.gov/cgi-bin/viewer?action=view&cik=... namespace, and you may need to strip or handle namespaces explicitly. The beautifulsoup4 library can be more forgiving with malformed XML.
Step 3: Filter for Meaningful Signals
Not all Form 4 transactions are informative. The most valuable signal comes from open-market purchases (transaction code P), where insiders spend their own money to buy shares. We also want to filter by minimum dollar amount to exclude trivially small transactions.
def filter_meaningful_transactions(parsed_filings, min_dollar=10000):
"""Filter for open-market purchases above a minimum dollar threshold."""
signals = []
for filing in parsed_filings:
for txn in filing["transactions"]:
# Focus on open-market purchases
if txn["code"] != "P":
continue
# Minimum dollar filter
if txn["dollar_value"] < min_dollar:
continue
signals.append({
"ticker": filing["ticker"],
"issuer": filing["issuer_name"],
"insider": filing["owner_name"],
"relationship": filing["relationship"],
"date": txn["date"],
"shares": txn["shares"],
"price": txn["price"],
"dollar_value": txn["dollar_value"],
"post_shares": txn["post_shares"],
})
return signals
You may also want to track open-market sales (code S) for a complete picture of insider sentiment. However, academic research consistently shows that purchases are more informative than sales. Insiders sell for many reasons (diversification, tax obligations, personal expenses), but they generally buy for only one reason: they believe the stock is undervalued.
Step 4: Detect Cluster Buying
One of the strongest insider trading signals is cluster buying: multiple distinct insiders purchasing shares of the same company within a short time window. The logic is intuitive — if three different executives independently decide to buy stock in the same two-week period, they likely share a positive view of the company’s near-term prospects.
from collections import defaultdict
from datetime import datetime
def detect_clusters(signals, window_days=10, min_insiders=2):
"""Group transactions by ticker and detect cluster buying."""
by_ticker = defaultdict(list)
for s in signals:
by_ticker[s["ticker"]].append(s)
clusters = []
for ticker, txns in by_ticker.items():
# Sort by date
txns.sort(key=lambda x: x["date"])
# Sliding window: count unique insiders within window_days
for i, anchor in enumerate(txns):
anchor_date = datetime.strptime(anchor["date"], "%Y-%m-%d")
window_insiders = set()
window_txns = []
for t in txns:
t_date = datetime.strptime(t["date"], "%Y-%m-%d")
if 0 <= (t_date - anchor_date).days <= window_days:
window_insiders.add(t["insider"])
window_txns.append(t)
if len(window_insiders) >= min_insiders:
total_dollar = sum(t["dollar_value"] for t in window_txns)
clusters.append({
"ticker": ticker,
"issuer": anchor["issuer"],
"insider_count": len(window_insiders),
"insiders": list(window_insiders),
"total_dollar": total_dollar,
"start_date": anchor["date"],
"transactions": window_txns,
})
break # one cluster per ticker
# Sort by insider count descending, then by dollar amount
clusters.sort(key=lambda x: (-x["insider_count"], -x["total_dollar"]))
return clusters
A 10-day window with a minimum of 2 unique insiders is a reasonable starting point. You can tighten the criteria (3+ insiders, 7-day window) for higher-conviction signals, or loosen them for more coverage. Lakonishok and Lee (2001) found that the predictive power of insider buying increases with the number of insiders buying.
Step 5: Enrich with Market Data
Raw insider transaction data is more useful when enriched with market context. We will use the yfinance library to get current price, market capitalization, sector, and 52-week price range for each ticker.
import yfinance as yf
def enrich_with_market_data(signals):
"""Add market data from yfinance to each signal."""
tickers = list(set(s["ticker"] for s in signals))
enriched = []
for ticker in tickers:
time.sleep(0.2) # be gentle with Yahoo Finance
try:
stock = yf.Ticker(ticker)
info = stock.info
market_data = {
"current_price": info.get("currentPrice") or info.get("regularMarketPrice"),
"market_cap": info.get("marketCap"),
"sector": info.get("sector", "Unknown"),
"industry": info.get("industry", "Unknown"),
"fifty_two_week_low": info.get("fiftyTwoWeekLow"),
"fifty_two_week_high": info.get("fiftyTwoWeekHigh"),
"average_volume": info.get("averageVolume"),
}
# Attach market data to all signals for this ticker
for s in signals:
if s["ticker"] == ticker:
enriched.append({**s, **market_data})
except Exception as e:
print(f"Warning: could not fetch data for {ticker}: {e}")
for s in signals:
if s["ticker"] == ticker:
enriched.append(s)
return enriched
Market cap is particularly important for signal quality. Academic research shows that insider purchases at small-cap and mid-cap companies tend to be more informative than purchases at mega-cap companies. This makes intuitive sense: insiders at smaller companies are more likely to have a material information advantage, and their purchases represent a larger fraction of their personal wealth.
Step 6: Score Signals by Conviction
Now we combine all the data into a conviction score. The scoring function weights multiple factors:
- Insider role — CEO and CFO purchases carry more weight than director or 10% owner purchases, because C-suite executives typically have the deepest knowledge of the company’s operations and financial outlook
- Dollar amount — larger purchases signal stronger conviction
- Cluster count — multiple insiders buying is more informative than a single insider
- Recency — more recent transactions are more relevant (we apply exponential time decay)
import math
# Role weights
ROLE_WEIGHTS = {
"CEO": 1.0,
"CFO": 0.95,
"COO": 0.90,
"President": 0.90,
"VP": 0.75,
"Director": 0.65,
"10% Owner": 0.55,
}
def get_role_weight(relationship_str):
"""Extract the highest-weighted role from the relationship string."""
best = 0.5 # default for unknown roles
for role, weight in ROLE_WEIGHTS.items():
if role.lower() in relationship_str.lower():
best = max(best, weight)
return best
def score_signal(signal, cluster_map, decay_half_life=35):
"""Compute a conviction score for an insider purchase signal."""
# Role weight (0-1)
role_w = get_role_weight(signal.get("relationship", ""))
# Dollar conviction (log scale, normalized)
dollar = signal.get("dollar_value", 0)
dollar_w = min(math.log10(max(dollar, 1)) / 7, 1.0) # $10M = 1.0
# Cluster bonus
ticker = signal["ticker"]
cluster_info = cluster_map.get(ticker, {})
cluster_count = cluster_info.get("insider_count", 1)
cluster_w = min(cluster_count / 5, 1.0) # 5+ insiders = max
# Time decay (exponential)
try:
txn_date = datetime.strptime(signal["date"], "%Y-%m-%d")
days_ago = (datetime.now() - txn_date).days
decay = math.exp(-0.693 * days_ago / decay_half_life)
except (ValueError, KeyError):
decay = 0.5
# Weighted composite score
raw_score = (
0.25 * role_w +
0.30 * dollar_w +
0.30 * cluster_w +
0.15 * decay
)
return round(raw_score * 100, 1)
The weights above (25% role, 30% dollar, 30% cluster, 15% recency) are a reasonable starting point based on the academic literature. You can tune these weights through backtesting on historical data. The exponential decay with a 35-day half-life means that a signal loses half its time-weighted value every 35 days.
Step 7: Output to DataFrame
Finally, we assemble everything into a pandas DataFrame and output the results:
import pandas as pd
import json
def build_screener_output(enriched_signals, clusters):
"""Build the final scored output."""
# Create cluster lookup
cluster_map = {}
for c in clusters:
cluster_map[c["ticker"]] = c
# Score each signal
for s in enriched_signals:
s["score"] = score_signal(s, cluster_map)
ci = cluster_map.get(s["ticker"], {})
s["cluster_count"] = ci.get("insider_count", 1)
# Sort by score descending
enriched_signals.sort(key=lambda x: -x["score"])
# Convert to DataFrame
df = pd.DataFrame(enriched_signals)
columns = [
"ticker", "issuer", "insider", "relationship", "date",
"shares", "price", "dollar_value", "cluster_count",
"market_cap", "sector", "score",
]
# Only include columns that exist
columns = [c for c in columns if c in df.columns]
df = df[columns]
return df
# --- Main execution ---
if __name__ == "__main__":
print("Fetching recent Form 4 filings...")
filings_meta = fetch_recent_form4_filings(days_back=14)
print(f"Found {len(filings_meta)} filings")
# In a real implementation, you would resolve each filing
# to its XML URL and parse it. Simplified here:
# parsed = [parse_form4_xml(url) for url in xml_urls]
# signals = filter_meaningful_transactions(parsed)
# clusters = detect_clusters(signals)
# enriched = enrich_with_market_data(signals)
# df = build_screener_output(enriched, clusters)
# print(df.to_string(index=False))
# df.to_json("signal_cache.json", orient="records", indent=2)
Step 8: Resolving Filing URLs to XML
One practical challenge is that the EDGAR search API returns filing index pages, not the XML documents directly. To get the XML, you need to fetch the filing index page and find the link to the primary XML document. Here is a helper function:
from bs4 import BeautifulSoup
def resolve_xml_url(index_url):
"""Given a filing index URL, find the primary XML document URL."""
time.sleep(0.1)
resp = requests.get(index_url, headers=HEADERS)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Look for the XML file in the filing documents table
for row in soup.select("table.tableFile tr"):
cells = row.find_all("td")
if len(cells) >= 4:
doc_type = cells[3].get_text(strip=True)
if doc_type == "4":
link = cells[2].find("a")
if link and link.get("href", "").endswith(".xml"):
return "https://www.sec.gov" + link["href"]
return None
Putting It All Together
Here is the complete pipeline in sequence:
- Fetch recent Form 4 filing metadata from EDGAR search
- Resolve each filing’s index page to its XML document URL
- Parse each XML document to extract issuer, insider, and transaction data
- Filter for open-market purchases (code P) above $10,000
- Detect cluster buying (2+ unique insiders within a 10-day window)
- Enrich with market data from yfinance (price, market cap, sector)
- Score each signal by insider role, dollar amount, cluster count, and recency
- Output to DataFrame, JSON, or your preferred format
Production Considerations
The screener above is a working prototype. To run it in production, you would want to address several additional concerns:
Data Storage
Instead of processing filings each time the script runs, store parsed transactions in a database (SQLite, PostgreSQL, or DuckDB all work well). This lets you build a historical record of insider activity, compute rolling statistics, and avoid re-fetching and re-parsing filings you have already processed.
Incremental Updates
Rather than searching for filings over a date range each time, maintain a high-water mark (the most recent filing date or accession number you have processed) and only fetch new filings since that point. The EDGAR full-index files, updated daily, are ideal for this pattern.
Error Handling
EDGAR XML files are not always perfectly formed. Some filings have missing fields, unusual encodings, or namespace variations. Wrap your XML parsing in try/except blocks and log failures so you can investigate and handle edge cases. Process each filing independently so that one malformed filing does not block the entire batch.
Scheduling
Run the screener on a schedule (daily after market close is typical). Use APScheduler, cron, or a task queue like Celery. The SEC publishes most Form 4 filings during business hours Eastern Time, with a lag of a few hours to a day from the actual transaction date.
Sell Signal Analysis
While we focused on purchases (code P), a complete screener should also track sales (code S). You can compute a “net flow” metric for each company: total dollar value of purchases minus total dollar value of sales over a rolling window. A strongly positive net flow (many buyers, few sellers) is more informative than purchases alone.
Insider trading analysis works because corporate insiders have an asymmetric information advantage about their own companies. When they put their personal capital at risk by buying shares on the open market, they are expressing a view with real financial consequences. Aggregating these signals across thousands of companies and scoring them by conviction produces a systematic edge that has been documented in decades of academic research.