SEC EDGAR: Not Really "Scraping" Anymore
The term "web scraping" implies parsing HTML pages, but the SEC has invested heavily in structured data APIs that make traditional scraping largely unnecessary. The EDGAR system (Electronic Data Gathering, Analysis, and Retrieval) is the SEC's filing database, and its modern data.sec.gov REST API provides clean JSON responses for most common use cases.
That said, some data -- particularly the content of individual filings like Form 4 XML documents -- still requires fetching and parsing individual files. This tutorial covers both the structured API endpoints and the techniques for parsing filing documents directly.
Required: User-Agent header. The SEC requires all programmatic requests to include a User-Agent header with your company name (or your name) and email address. Requests without a proper User-Agent will be blocked. The SEC uses this to contact you if your requests are causing problems.
Setting Up: Headers and Rate Limiting
Before making any requests, set up a session with proper headers and rate limiting:
import requests
import time
# Required: identify yourself to the SEC
HEADERS = {
"User-Agent": "YourCompany [email protected]"
}
# Create a session for connection reuse
session = requests.Session()
session.headers.update(HEADERS)
def rate_limited_get(url):
"""GET request with rate limiting for SEC fair access."""
time.sleep(0.1) # 10 requests/second max
response = session.get(url)
response.raise_for_status()
return response
The SEC's rate limit is 10 requests per second. The time.sleep(0.1) ensures you stay within this limit. In practice, being slightly more conservative (such as 0.12 seconds between requests) is wise to account for timing imprecision. If you exceed the rate limit, the SEC will temporarily block your IP address.
SEC EDGAR Key Endpoints
- Company submissions: data.sec.gov/submissions/CIK{cik}.json
- XBRL company facts: data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json
- Full-text search: efts.sec.gov/LATEST/search-index
- Filing documents: sec.gov/Archives/edgar/data/{cik}/{accession}/
- Recent filings feed: sec.gov/cgi-bin/browse-edgar
Finding a Company's CIK Number
Every company in EDGAR is identified by a CIK (Central Index Key) number. You need the CIK to use most API endpoints. There are several ways to look up a CIK from a ticker symbol.
The simplest method uses the SEC's company tickers JSON file, which maps tickers to CIKs:
def get_cik_from_ticker(ticker):
"""Look up CIK number from a stock ticker symbol."""
url = "https://www.sec.gov/files/company_tickers.json"
response = rate_limited_get(url)
data = response.json()
# The response is a dict with numeric keys
for entry in data.values():
if entry["ticker"].upper() == ticker.upper():
# CIK must be zero-padded to 10 digits for API URLs
return str(entry["cik_str"]).zfill(10)
return None
cik = get_cik_from_ticker("AAPL")
print(f"Apple's CIK: {cik}") # 0000320193
Apple's CIK is 320193. When constructing API URLs, the CIK must be zero-padded to 10 digits: 0000320193.
You can also look up CIK numbers through the EDGAR company search page at https://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=AAPL&type=&dateb=&owner=include&count=40&search_text=&action=getcompany. This returns an HTML page with the company's filings, which includes the CIK in the URL and page content.
Getting Company Filing History
The company submissions endpoint returns a comprehensive JSON object with the company's filing history:
def get_company_filings(cik):
"""Fetch all filing metadata for a company."""
url = f"https://data.sec.gov/submissions/CIK{cik}.json"
response = rate_limited_get(url)
data = response.json()
# Company info
print(f"Company: {data['name']}")
print(f"Ticker(s): {data.get('tickers', [])}")
print(f"SIC: {data.get('sic', 'N/A')} - {data.get('sicDescription', '')}")
# Recent filings are in data['filings']['recent']
recent = data['filings']['recent']
# Convert to DataFrame for easier manipulation
import pandas as pd
filings_df = pd.DataFrame({
'accessionNumber': recent['accessionNumber'],
'filingDate': recent['filingDate'],
'form': recent['form'],
'primaryDocument': recent['primaryDocument'],
})
return filings_df, data
filings_df, company_data = get_company_filings("0000320193")
print(f"\nTotal recent filings: {len(filings_df)}")
print(filings_df.head(10))
The filings['recent'] object contains arrays for each field (accession number, filing date, form type, etc.) for the most recent filings (up to 1,000). For companies with more filings, additional pages are available through the filings['files'] array, which lists supplementary JSON files.
Filtering by Form Type
You can filter the filings DataFrame to find specific form types. The most common form types for insider trading analysis are:
- Form 4: Statement of changes in beneficial ownership (the primary insider trading disclosure)
- Form 3: Initial statement of beneficial ownership
- Form 5: Annual statement of changes in beneficial ownership
- Form 144: Notice of proposed sale of securities
# Filter for Form 4 filings only
form4s = filings_df[filings_df['form'] == '4'].copy()
print(f"Form 4 filings: {len(form4s)}")
# Filter for filings in the last 30 days
from datetime import datetime, timedelta
cutoff = (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d')
recent_form4s = form4s[form4s['filingDate'] >= cutoff]
print(f"Form 4s in last 30 days: {len(recent_form4s)}")
Parsing Form 4 XML Documents
Form 4 filings are submitted in XML format. Each Form 4 reports insider transactions -- purchases, sales, grants, and other changes in beneficial ownership. To extract the actual transaction data, you need to download and parse the XML document.
Downloading a Form 4 Filing
import xml.etree.ElementTree as ET
def fetch_form4_xml(cik, accession_number):
"""Download and parse a Form 4 XML filing."""
# Format accession number for URL (remove dashes)
acc_no_dashes = accession_number.replace('-', '')
# The primary document URL
# First, get the filing index to find the XML document
index_url = (
f"https://www.sec.gov/Archives/edgar/data/"
f"{cik.lstrip('0')}/{acc_no_dashes}/"
)
# Form 4 XML files typically end in .xml
# We can construct the URL from the accession number
xml_url = (
f"https://www.sec.gov/Archives/edgar/data/"
f"{cik.lstrip('0')}/{acc_no_dashes}/"
f"{accession_number}.xml"
)
response = rate_limited_get(xml_url)
return response.text
Extracting Transaction Data
The Form 4 XML schema contains several important sections: the issuer (the company), the reporting owner (the insider), and the transaction details. Here is a comprehensive parser:
def parse_form4(xml_text):
"""
Parse Form 4 XML and extract transaction details.
Returns a dict with issuer info, owner info, and list of transactions.
"""
root = ET.fromstring(xml_text)
# Namespace handling: Form 4 XML may or may not use namespaces
# Try without namespace first
ns = ''
if root.tag.startswith('{'):
ns = root.tag.split('}')[0] + '}'
def find(element, tag):
"""Find a child element, handling optional namespace."""
result = element.find(f"{ns}{tag}")
if result is None:
result = element.find(tag)
return result
def find_text(element, tag, default=''):
"""Get text content of a child element."""
el = find(element, tag)
return el.text.strip() if el is not None and el.text else default
result = {
'issuer': {},
'owner': {},
'transactions': []
}
# Issuer information
issuer = find(root, 'issuer')
if issuer is not None:
result['issuer'] = {
'cik': find_text(issuer, 'issuerCik'),
'name': find_text(issuer, 'issuerName'),
'ticker': find_text(issuer, 'issuerTradingSymbol'),
}
# Reporting owner
owner = find(root, 'reportingOwner')
if owner is not None:
owner_id = find(owner, 'reportingOwnerId')
owner_rel = find(owner, 'reportingOwnerRelationship')
if owner_id is not None:
result['owner']['name'] = find_text(owner_id, 'rptOwnerName')
result['owner']['cik'] = find_text(owner_id, 'rptOwnerCik')
if owner_rel is not None:
result['owner']['isDirector'] = find_text(owner_rel, 'isDirector') == '1'
result['owner']['isOfficer'] = find_text(owner_rel, 'isOfficer') == '1'
result['owner']['officerTitle'] = find_text(owner_rel, 'officerTitle')
# Non-derivative transactions (common stock buys/sells)
nd_table = find(root, 'nonDerivativeTable')
if nd_table is not None:
for txn in nd_table:
if 'Transaction' not in txn.tag:
continue
coding = find(txn, 'transactionCoding')
amounts = find(txn, 'transactionAmounts')
if coding is None or amounts is None:
continue
# Transaction code: P = Purchase, S = Sale
txn_code = find_text(coding, 'transactionCode')
shares_el = find(amounts, 'transactionShares')
price_el = find(amounts, 'transactionPricePerShare')
shares = float(find_text(shares_el, 'value', '0')) if shares_el is not None else 0
price = float(find_text(price_el, 'value', '0')) if price_el is not None else 0
result['transactions'].append({
'date': find_text(txn, 'transactionDate/value')
if find(txn, 'transactionDate') is not None
else find_text(find(txn, 'transactionDate'), 'value', ''),
'code': txn_code,
'shares': shares,
'price': price,
'dollar_value': shares * price,
})
return result
The transaction code field is the most important piece of data for insider trading analysis. The key codes are:
- P -- Open market or private purchase of securities
- S -- Open market or private sale of securities
- A -- Grant or award (stock compensation)
- M -- Exercise or conversion of derivative security (e.g., stock options)
- F -- Payment of exercise price or tax liability by delivering securities
- G -- Gift of securities
For insider trading signal analysis, the most informative codes are P (open market purchases, where the insider is spending their own money) and S (open market sales). Grants (A) and option exercises (M) are part of compensation and generally carry less informational signal.
Complete Example: Recent Insider Transactions
Here is a complete working example that fetches all Form 4 filings for a company in the last 30 days and extracts the buy/sell transactions:
import requests
import time
import xml.etree.ElementTree as ET
import pandas as pd
from datetime import datetime, timedelta
HEADERS = {"User-Agent": "YourCompany [email protected]"}
session = requests.Session()
session.headers.update(HEADERS)
def get_recent_insider_trades(ticker, days=30):
"""
Fetch insider buy/sell transactions for a company
from the last N days.
"""
# Step 1: Get CIK
tickers_url = "https://www.sec.gov/files/company_tickers.json"
time.sleep(0.1)
tickers_data = session.get(tickers_url).json()
cik = None
for entry in tickers_data.values():
if entry["ticker"].upper() == ticker.upper():
cik = str(entry["cik_str"]).zfill(10)
break
if cik is None:
raise ValueError(f"Ticker {ticker} not found")
# Step 2: Get filings
time.sleep(0.1)
sub_url = f"https://data.sec.gov/submissions/CIK{cik}.json"
sub_data = session.get(sub_url).json()
recent = sub_data['filings']['recent']
# Step 3: Filter Form 4s from last N days
cutoff = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
trades = []
for i, form in enumerate(recent['form']):
if form != '4':
continue
if recent['filingDate'][i] < cutoff:
continue
acc = recent['accessionNumber'][i]
acc_no_dash = acc.replace('-', '')
# Step 4: Fetch and parse each Form 4
xml_url = (
f"https://www.sec.gov/Archives/edgar/data/"
f"{cik.lstrip('0')}/{acc_no_dash}/{acc}.xml"
)
try:
time.sleep(0.1)
resp = session.get(xml_url)
if resp.status_code != 200:
continue
parsed = parse_form4(resp.text)
for txn in parsed['transactions']:
if txn['code'] in ('P', 'S'):
trades.append({
'filing_date': recent['filingDate'][i],
'owner': parsed['owner'].get('name', ''),
'title': parsed['owner'].get('officerTitle', ''),
'type': 'BUY' if txn['code'] == 'P' else 'SELL',
'shares': txn['shares'],
'price': txn['price'],
'value': txn['dollar_value'],
})
except Exception as e:
print(f"Error parsing {acc}: {e}")
continue
return pd.DataFrame(trades)
# Example usage
df = get_recent_insider_trades("AAPL", days=30)
print(df.to_string(index=False))
Note on XML parsing: Not all Form 4 filings follow the exact same XML structure. Some older filings may have different element names or missing fields. The parser above handles common variations, but production code should include more robust error handling for edge cases.
XBRL Structured Financial Data
For fundamental financial data (revenue, earnings, assets, etc.), the SEC provides structured XBRL data through a dedicated API. This is far more efficient than parsing 10-K/10-Q HTML documents:
def get_company_facts(cik):
"""
Fetch structured XBRL financial data for a company.
Returns all reported financial facts (revenue, net income, etc.)
"""
url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json"
time.sleep(0.1)
response = session.get(url)
data = response.json()
print(f"Company: {data['entityName']}")
# Facts are organized by taxonomy (us-gaap, dei, etc.)
# and then by concept (Revenue, NetIncomeLoss, etc.)
us_gaap = data['facts'].get('us-gaap', {})
print(f"Available US-GAAP concepts: {len(us_gaap)}")
# Example: get quarterly revenue
if 'Revenues' in us_gaap:
revenue_data = us_gaap['Revenues']['units']['USD']
# Filter for 10-Q filings (quarterly)
quarterly = [
r for r in revenue_data
if r.get('form') == '10-Q'
]
for item in quarterly[-4:]: # last 4 quarters
print(f" {item['end']}: ${item['val']:,.0f}")
return data
facts = get_company_facts("0000320193")
The company facts endpoint returns every XBRL-tagged data point the company has ever filed. Common concepts include Revenues, NetIncomeLoss, Assets, StockholdersEquity, and hundreds of others. The data is organized by taxonomy (primarily us-gaap) and unit of measure (USD, shares, etc.).
Full-Text Search
The EDGAR full-text search system (EFTS) allows you to search across the content of all filings. This is useful for finding specific disclosures, risk factors, or mentions of particular topics:
def search_filings(query, forms=None, start_date=None, end_date=None):
"""
Search EDGAR filings by text content.
Parameters:
query: search text (supports quotes for exact phrases)
forms: comma-separated form types (e.g., "4,10-K")
start_date: YYYY-MM-DD format
end_date: YYYY-MM-DD format
"""
base_url = "https://efts.sec.gov/LATEST/search-index"
params = {"q": query}
if forms:
params["forms"] = forms
if start_date and end_date:
params["dateRange"] = "custom"
params["startdt"] = start_date
params["enddt"] = end_date
time.sleep(0.1)
response = session.get(base_url, params=params)
data = response.json()
print(f"Total hits: {data.get('hits', {}).get('total', {}).get('value', 0)}")
for hit in data.get('hits', {}).get('hits', [])[:5]:
source = hit['_source']
print(f" {source.get('file_date', '')} | "
f"{source.get('display_names', [''])[0]} | "
f"{source.get('form_type', '')}")
return data
# Search for insider purchases in Form 4 filings
results = search_filings(
query='"open market purchase"',
forms="4",
start_date="2024-01-01",
end_date="2024-12-31"
)
Rate Limiting and Fair Access Policy
The SEC publishes a fair access policy for EDGAR that all programmatic users must follow. The key rules are:
- Identify yourself: Include a User-Agent header with your name/company and email address. The SEC will use this to contact you if there are issues.
- Limit request rate: No more than 10 requests per second. The SEC will block your IP if you exceed this limit.
- Cache responses: If you need the same data repeatedly, cache it locally rather than re-fetching from EDGAR.
- Off-peak access: For large bulk downloads, prefer off-peak hours (nights and weekends ET).
- No automated downloading of all filings: If you need bulk data, use the SEC's EDGAR full-text search bulk data (available on the SEC's developer page) rather than crawling the filing tree.
# Robust rate limiter with retry logic
class EdgarClient:
def __init__(self, user_agent, max_retries=3):
self.session = requests.Session()
self.session.headers.update({"User-Agent": user_agent})
self.last_request = 0
self.min_interval = 0.1 # 10 req/sec
self.max_retries = max_retries
def get(self, url, params=None):
for attempt in range(self.max_retries):
# Enforce rate limit
elapsed = time.time() - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
response = self.session.get(url, params=params)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited: back off exponentially
wait = 2 ** attempt * 5
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
response.raise_for_status()
raise Exception(f"Failed after {self.max_retries} retries: {url}")
client = EdgarClient("YourCompany [email protected]")
Ethics and Legality
SEC EDGAR data is public information. Every filing is a public record, and accessing it programmatically is entirely legal. The SEC actively encourages developers to build tools that make this data more accessible -- that is the entire purpose of the EDGAR system.
The only legal obligations are to follow the fair access policy (rate limits, identification) and to not misrepresent the data. You cannot, for example, present fabricated SEC filings as real. But downloading, analyzing, and building products around genuine EDGAR data is explicitly permitted.
The SEC has also released several bulk data products specifically for programmatic access, including the company tickers JSON file, the XBRL API, and the full-text search system. These were designed for exactly the type of analysis described in this tutorial.
One important nuance: while the filings themselves are public, the trading on material non-public information is illegal. The distinction is between information that has been filed with the SEC (public) and information that has not yet been disclosed (non-public). Everything in EDGAR has already been disclosed and is therefore public information that anyone can legally trade on.
Beyond the Basics: Building a Filing Monitor
For real-time insider trading analysis, you need a system that continuously monitors for new Form 4 filings. The SEC updates EDGAR throughout the day, with most filings appearing within minutes of submission. The recent filings feed at https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&type=4&dateb=&owner=include&count=40&search_text=&start=0&output=atom provides an Atom feed of the most recent filings by form type.
A production monitoring system typically polls the feed every few minutes, compares against previously seen accession numbers, and parses any new Form 4 filings. The latency from filing to detection is usually under 5 minutes, which is fast enough for most trading strategies based on insider activity (insiders have up to two business days to file Form 4, so the information is already somewhat delayed).
Building a robust SEC data pipeline involves handling edge cases: filings that are amended (Form 4/A), filings that are withdrawn, XML parsing errors, network timeouts, and the occasional change in EDGAR's response format. The key principle is defensive coding -- every request can fail, every XML document can have unexpected structure, and the system should log and continue rather than crash.
Automated SEC Filing Analysis
Alpha Suite monitors SEC EDGAR Form 4 filings in real time, automatically extracting and scoring insider buy/sell transactions to generate quantitative trading signals.
Get Started with Alpha Suite