Peppercorn EDGAR Data Loader

No extraction results to review. Use the Extract tab to search and extract data, then results will appear here.

Review Queue

Hide empty Show Approved

Extraction Log

Data Browser

🧠 AI Suggestions

Claude analyzes your reviewer feedback and correction patterns to suggest improvements to extraction prompts, smart chunk patterns, relevance keywords, and configuration.

Click Generate Suggestions to analyze your feedback and correction patterns.
Requires at least one reviewer comment (add comments on the Review page).

📋 Applied Suggestions

History of accepted ML suggestions. Revert any change to restore the previous configuration.

No suggestions applied yet.

💬 Reviewer Feedback

Comments from domain extraction reviews. These feed into the AI suggestion engine.

No feedback yet. Add comments on the Review page during extraction review.

📊 Correction Patterns

Recurring corrections from approved extractions. High-frequency patterns are auto-applied to future extractions. Configure thresholds in Settings → ML.

Domain	Field	Extracted Value	Corrected To	Freq	Last Seen	Status
Click Refresh to load correction patterns

Data Chat

Ask questions about extracted data or trigger new extractions. Examples: "Show NAV for CIK 1920145", "What are the latest distributions for Blue Owl?", "Compare leverage across all BDCs"

Ask a question about your extracted EDGAR data.

Company Search

XBRL Screener — Cross-Company Metrics

Concepts

Period

Peppercorn EDGAR Data Loader — Help

📖 Reference: Fund Types & Filing Types

SEC Filing Types

User Guide

1. Manual Search Mode

1Search: Enter a CIK number. Optionally filter by Filing Group (Event, Periodic, Registration, RIC) or a specific Filing Type. The system returns filings with the detected fund type.

2Select Filing: Click a filing row to select it. If this filing was already extracted, you'll see a warning dialog. The filing content is fetched from EDGAR and domain suggestions appear based on fund type.

3Select Domains: Domain groups are shown with auto-suggested tables highlighted. Click a group header to select/deselect all, or click individual tables. Suggested tables are based on fund type + filing type.

4Extract: Click Extract Selected. Each domain is sent to the LLM (or N-PORT XML parser). Results appear on the Review tab. For N-PORT filings, no LLM is needed — data is parsed directly from XML.

5Review & Push: Edit any cell in the review table. Purple auto fields are read-only. Select which domains to push with checkboxes. Click Approve & Push to insert into the database. Empty domains are auto-skipped.

2. Smart Fetch Mode

Smart Fetch automates the filing selection process. Enter a CIK and choose a period — the system determines which filings to retrieve based on fund type and the filing waterfall strategy.

Filing Group filter: Narrow to Event, Periodic, Registration, or RIC filings. Only domains that map to those filing types are suggested.

Period: Most Recent (latest filing per type), Quarter (all filings in Q1-Q4 of a year), or Annual (full year).

Deep Scan: Re-fetches the full filing from EDGAR and uses smart chunking to find relevant sections. Activates automatically when a filing appears truncated (content ≤ Auto Deep Scan Threshold in Settings). The checkbox forces deep scan even for smaller filings. Configure which domains and the auto threshold in Settings.

Batch Mode (50% off): Submits all extractions via the Anthropic Batch API. Takes 1-5 minutes but costs 50% less. Results are polled automatically.

Direct Ingest: Requires Batch Mode. Bypasses the review step — data is pushed directly to the database when the batch completes. Use with caution.

Result Consolidation: If multiple filings are fetched for the same domain, only the extraction with the most rows is kept. You won't see duplicate tabs in the review.

3. Review & Approve

Domain Tabs: Each extracted domain appears as a tab. The tab shows the row count. Click a tab to switch. Empty domains show "No records found" and are auto-skipped on push.

Editing: Click any white cell to edit. Changed cells turn amber. Purple auto fields (e.g. RegulatoryFilingID, timestamps) are computed automatically and cannot be edited.

Source Location: The green "Source" column shows where in the filing each row's data was found (e.g. "Item 1 - Financial Statements, NAV table"). This is stored in the extraction log for audit.

Selective Push: Use checkboxes on domain tabs to select which domains to push. Buttons: [Select All] [Select Non-Empty] [Deselect All]. The Approve button shows the count: "Approve 8/29 Domains & Push".

Fund Type Display: The review meta bar shows detected fund type and sub-type as badges (e.g. [BDC] [BDC-TOF]). These are stored only in the Filing Master table.

4. Database Architecture

Dependency Chain: When you approve, the system executes 4 steps in order to ensure referential integrity:

1. T_PORT_PORTFOLIO — Lookup by CIK → name → CREATE with MAX(ID)+1, sets FundTypeCode
2. T_PORT_SHARE_CLASS — Per unique _ShareClassName, lookup → CREATE
3. T_PE_FUND_REGULATORY_FILING_MASTER — Upsert by RegulatoryFilingID
4. Target domain table — Upsert with injected PortfolioID, ShareClassID, AccessionID

Insert Resilience: Each row is wrapped in a PostgreSQL SAVEPOINT. If one row fails, subsequent rows still insert. Rows with null primary keys are skipped. Empty/garbage rows (fewer than 2 non-null fields) are filtered.

Schemas: PE data goes to newdev_private_equity (34 tables). Reference data (Portfolio, ShareClass) lives in newdev_public_equity. Both are configurable in Settings.

5. Cost & Token Tracking

Cost Tracker Bar: Below the title bar, shows last extraction cost and session totals in real time. Resets when you start a new extraction.

Activity Log: Each extraction shows tokens and cost inline (e.g. "12,450 in / 8,200 out · $0.1278").

Extraction Log: The Log tab shows a persistent Cost $ column stored in the database. The header shows total cost for the filtered view. Export to CSV includes cost data.

Pricing: Sonnet 4: $3/$15 per 1M tokens (in/out). Opus 4: $15/$75. Haiku 4.5: $1/$5. Batch mode applies 50% discount automatically.

6. Settings & Configuration

Database: Host, port, database name, user, password, PE schema, reference schema, SSL mode. Changes take effect immediately.

Anthropic API Key: Required for LLM extraction. Not needed for N-PORT XML parsing.

Extraction Settings: Domain throttle (seconds between extractions), max tokens (standard and composition). Higher token limits reduce truncation risk for large tables.

Fund Type Keywords: Configurable keyword lists matched against company names. One keyword per line. Used by the 9-signal detection system.

Filing Matrix: JSON mapping of {fund_type → {domain → [filing_types]}}. First filing type has highest precedence. Controls Smart Fetch plan building.

Signal Priorities: Number fields to reorder the 9 detection signals. Lower number = higher priority. Default: SIC(1) → Filing History(2) → EntityType(3).

7. Troubleshooting

Flask won't start / password prompt hangs: Set debug=False in edgar_loader.py. Flask's reloader spawns a child process that re-triggers password prompts.

Port blocked by browser: Avoid ports 5060/5061 (SIP). The default port 5070 is safe. Change in edgar_config.json → app.port.

EDGAR 403 errors: The SEC rate limiter requires a descriptive User-Agent header. The system sets this automatically. If you still get 403s, increase domain_throttle_seconds in Settings.

LLM returns truncated JSON: Increase max_tokens in Settings. Default: 16,384 (standard), 32,768 (composition/returns). The system auto-repairs truncated JSON arrays.

Column does not exist errors: Run all DDL migration scripts in the sql/ folder. Also delete Python cache: find . -name '*.pyc' -delete && rm -rf __pycache__

Safari fetch errors: The system uses a safeFetch() wrapper with XMLHttpRequest fallback for Safari compatibility. If you see "string did not match expected pattern", this should handle it automatically.

Stale config after update: Delete __pycache__/ and restart Flask. Python may cache old .pyc files.

About

Peppercorn EDGAR Data Loader — 34 extraction domains, 508+ field definitions, 6 fund types.
Built by EXF Financial Data Solutions.

Database Connection

PostgreSQL connection settings. Changes take effect immediately.

Host

Port

Database

Schema (PE tables)

Reference Schema (Portfolio/ShareClass)

User

Password

SSL Mode

Anthropic API

API key for LLM-powered data extraction. Stored in edgar_config.json.

API Key

Models & Pricing

LLM models available for extraction. Pricing is per 1M tokens (USD). Update when Anthropic changes pricing or new models are released.

Version

Current version: . Patch auto-increments on every settings save. Edit major/minor manually for releases.

Major

Minor

Patch

Domain Model Overrides

Assign a specific model to individual domains. When "Force Model" is unchecked during extraction, domains listed here use their assigned model instead of the default. Leave blank to use the default model.

ML / Learning from Corrections

When you edit extracted data during review and approve, the corrections are logged. These corrections improve future extractions in two ways: the LLM receives past correction patterns as context (prompt enrichment), and frequent corrections can be auto-applied post-extraction.

Prompt Enrichment — Include past corrections in LLM prompt Auto-Apply — Automatically apply frequent corrections post-extraction

Min Frequency (prompt)

Auto-Apply Threshold

Smart Chunk Patterns

Regex patterns used by Deep Scan to find relevant sections in large filings. For each domain, patterns are tried in order. Each pattern has: pattern (regex), before (chars of context before match), after (chars after match). Edit as JSON.

Format: {"domain_id": [{"pattern": "(?i)regex", "before": 500, "after": 5000}, ...]}. All patterns are case-insensitive by default (include (?i)). Larger after values capture more table content.

Relevance Keywords

Keywords used by the Relevance Check to decide whether a filing contains data for a domain before sending it to the LLM. If none of a domain's keywords appear in the filing content (case-insensitive), the extraction is skipped. Edit as JSON — one keyword list per domain.

Format: {"domain_id": ["keyword1", "keyword2", ...]}. Keywords are matched case-insensitively against the first 500K chars of filing content. Add more specific keywords to reduce false skips.

Extraction Settings

Controls for LLM extraction behavior.

Domain Throttle (seconds)

Max Tokens (standard)

Max Tokens (large)

Auto Deep Scan Threshold

Content Limit — Large (chars)

Content Limit — Medium (chars)

Content Limit — Standard (chars)

Throttle: pause between extractions. Max Tokens: LLM output budget (standard vs large domains).
Content Limits: how much filing text to send to the LLM per domain tier. Large=200K, Medium=120K, Standard=60K.
Auto Deep Scan: auto-triggers when filing content ≤ threshold. Default 100000 (~20 pages). Set 0 to disable.

Domain Tier Assignments

Assign domains to token and content tiers. Domains in "Large Tokens" get Max Tokens (large). Domains in "Large/Medium Content" get expanded content limits for smart chunking. Unassigned domains use standard values.

Large Tokens (32K)

Large Content (200K)

Medium Content (120K)

Deep Scan Domains

Domains checked below will re-fetch the full filing from EDGAR. Deep scan activates automatically when content appears truncated (≤ Auto Deep Scan Threshold above), or manually via the Deep Scan checkbox on the Extract page. The checkbox forces deep scan even for small filings.

Fund Type Detection Keywords

Keywords matched against company name to determine fund type. One keyword per line. Changes take effect immediately.

BDC

Tender Offer (REIT)

Interval Fund

ETF

SIC codes (comma-separated):

BDC SICs

Tender Offer SICs

Interval Fund SICs

Filing Waterfall Matrix

For each fund type and domain, which filing types to fetch in precedence order. Edit as JSON.

Format: {"Fund Type": {"domain_id": ["FilingType1", "FilingType2"]}} — first type has highest precedence.

N-PORT Class Mapping

Map positional class names (Class_1, Class_2) from N-PORT filings to actual share class names. Only needed for funds where N-PORT XML lacks class identifiers.

Format: {"CIK": {"1": "Class Name", "2": "Class Name"}} — key is the positional number (1, 2, 3...), value is the actual share class name.

Fund Type Detection — Signal Priorities

Lower number = higher priority. The first signal to match wins. Edit and save to change detection behavior.