๐ฏ Accuracy & Evidence-Completeness Update
A focused accuracy pass on top of 0.10.2
After extensive end-to-end validation against a real ~700K-record Windows case, this update closes every remaining "where did the evidence go?" gap. Every fix below is locked by the 96-test pytest suite and verified by a holistic validation harness that exercises all 7 default wings against both correlation engines.
๐ช Identity Engine: All the Evidence
- Tz-comparison bug fixed: the engine was iterating
only the FIRST row of every feather when a time filter was active โ
a timezone-aware vs naive datetime comparison raised
TypeErrorand aborted the per-row loop. Records seen jumped from 3,558 → 745,615 on the validation case. - Log records no longer collapse to one identity:
every SecurityLogs record used to be extracted as
"Microsoft-Windows-Security-Auditing"(the event PROVIDER). The per-artifact mapping now prioritises real entities (User,ComputerName,NewProcessName,TargetUserName) before channel metadata. - Artifact-aware mapping now actually fires: parsers
rarely stamp an
artifactcolumn on each row, so the artifact-specific field-priority list never engaged. The engine now falls back tofeather_metadata.artifact_type. - Placeholder strings filtered:
"N/A","Unknown","-", nil-GUIDs, and 25+ similar variants are now rejected as identities (case-folded match). They were bundling tens of thousands of unrelated records into one bogus bucket.
๐ Real "High vs Low" Labeling
- Identity engine used to tag every match
High. Now matches withfeather_count == 1get"Low - single feather"— same convention as the time engine. The analyst's High view focuses on real cross-feather correlation. - Path-aware composite key removed from primary grouping:
each feather stores paths differently (BAM uses
/device/harddiskvolume3/…, LNK usesc:/users/…, MFT uses 8.3 short paths). With path in the key,chromehad 10+ different composite keys and never correlated. Primary key is now name-only (canonicalidentity_grouping.identity_key) — same as the time engine. Cross-feather correlation works again. - Validation impact: Execution Proof surfaces 2,856 cross-feather (High) matches; Account Logon went from 8 → 24 cross-feather; Anti-Forensics from 5 → 40; Lateral Movement from 16 → 40.
๐จ Impersonation Detection
- Path-based impersonation flag replaces what path-aware keying was
supposed to provide. After a match is formed, the engine walks all
records' path-like fields and classifies as TRUSTED
(
Program Files,Windows\System32,WinSxS,WindowsApps,ProgramData\Microsoft, BAM/SRUM/device/harddiskvolumeN/program files/…) or SUSPICIOUS (Temp,Downloads,Public,AppData\Local\Temp,$Recycle.Bin, removable-drive roots, network shares). - A match with records in BOTH classifications gets
match.semantic_data['impersonation_alert']. On the validation case: ~50 alerts per wing (≈ 0.05% rate) — each a real candidate (python.exein venv + Program Files, Discord's official install + AppData copy, etc.). - SUSPICIOUS wins on overlap: a path like
c:\windows\temp\…won't double-count just becausec:\windowsis a trusted prefix.
๐ Engine Evidence Accounting
- Per-window drop ledger: named buckets
(
no_identity_field,normalize_failure,below_threshold_skipped, etc.) carry the first 3 sample identities so logs name what was lost, not just how many. - Per-pipeline summary block: every
correlate(…)call ends with records seen, high-conf emitted, low-conf emitted, no-identity drops, and timeless-feather records joined by identity. Every record either lands in a match or in a named drop bucket. low_confidence_review_modeis now ON by default. Below-threshold identity groups become Low-confidence matches instead of being silently dropped.
๐ชถ Timeless-Feather Identity Enrichment
- Feathers without per-row timestamps
(
AutoStartPrograms,MUICache,SystemServices,TypedPaths) no longer get a synthetic feather-generation timestamp stamped onto every row (which used to lie about when the artifact was observed). - Instead, after time-windowed matches are formed by the timed feathers, the engine walks each match's identity and pulls matching records from every timeless feather, attaching them as supplementary evidence. Identity drives the join; the time anchor comes from the timed feathers.
๐ Consolidated Identity Registry
- New
config/standard_fields/identities.json— single source of truth for every column the engines + EYE Agent should consult to extract or track an identity. 98 categories, 1,146 column synonyms covering app/process, file, hash, user, host/device, network, registry, service/task, event, email, browser, cloud (AWS / Azure / GCP), Windows internals (mutex, named pipe, COM CLSID, ETW provider, WMI consumer), certificate, container, and OS objects. - Engine's timestamp-field and path-field lists now read directly from the JSON registries instead of hand-maintained Python lists. Adding a new parser column synonym is a JSON edit, not a code change.
timestamps.jsoncleaned up: real MFT activity timestamps (si_creation_time,si_modification_time,si_mft_entry_change_time, etc.) moved out of the bookkeeping category where they were wrongly classified.
๐ง Semantic Mapping False-Positive Fixes
default-data-exfiltration-pattern: was an OR rule across 3 wildcard conditions on different feathers — firedseverity=highon any single LNK record. Now properly gated with_requires_multi_indicator=True, _min_indicators=2. The previously-dead multi-indicator fields are now actually honored.auth_failed_then_success: was logically impossible (AND onEventID == 4625 AND EventID == 4624— a single record can equal only one value, rule never fired). Rewritten as OR with clearer semantic value.- Wiper / Remote tool rules described specific tools
(
sdelete.exe, cipher.exeetc.) but had wildcard conditions — every Prefetch entry got tagged "Wiper Tool Run" atseverity=high. Now use proper regex matching the actual tool names. - Severity inflation fixes on
exec_confirmed_run,pers_run_key,pers_service_install,pers_confirmed_runs— baseline-activity rules demoted fromhigh/criticaltoinfo/low. The wing's weighted scoring layer escalates real threats via score thresholds.
โฑ๏ธ Time-Engine Accuracy Layer
- Method 0 schema-driven timestamp detection from
feather_schemas.json. Feathers no longer rely on auto-detection that could pick integer cycle-counters (foreground_cycle_time) as Unix-epoch timestamps. - Per-column WHERE-clause format binding.
Heterogeneous-format feathers (amcache MM/DD/YYYY
install_date+ ISOparsed_at) used to silently return 0 rows because the engine used one global format. Now each column's WHERE parameters use that column's specific format. - Multi-column timestamp fan-out for separate-column
timestamp feathers (
mft_usn:si_creation_time+si_modification_time+si_access_time+ 6 more). A single row with timestamps in multiple windows becomes a virtual record in each.mft_usnwent from 2 records returned to 913,809 virtual records. - Year 1990–2100 plausibility filter on parsed timestamps drops NTFS / LNK placeholder dates (1601-01-01, 1980-01-01, 2000-01-01) before they spawn lonely virtual records.
๐ Validation Numbers (Real Case)
On a real ~700K-record Windows case, the identity engine processes 745,615 records (was 3,558 before the tz-comparison fix) and surfaces 78–99K matches per wing. Cross-feather High-confidence matches range 24–2,856 per wing depending on the wing's feather mix. Account Logon: 24 High; Anti-Forensics: 40 High; Execution Proof: 2,856 High; Lateral Movement: 40 High; Persistence: 118 High; USB: 51 High; User Activity: 42 High. Time engine baselines unchanged (6–643 High per wing). All sanity flags clear.
What's New in 0.10.2
๐ Reliability & Extensibility Release
0.10.2 is a deep reliability + extensibility pass that closes the most common "where did my evidence go?" gaps and lays the groundwork for parallel correlation.
๐ฆ No Dropped Evidence
- Multi-timestamp fan-out: Prefetch's
run_timesJSON list (up to 8 historical executions per row) is expanded into one virtual correlation event per timestamp. Real cases see ~4ร more execution events surfaced than the previous engine. - Tolerant timestamp parser: Windows FILETIME,
YYYYMMDDcompact dates, trailing parser annotations like"2026-05-19 10:48:21 (Registry Key LastWrite)"all parse on the first try. - Duplicates preserved as evidence: Dedup keys on the
raw identity, not the normalized form.
Chrome.exeandChrome.dllno longer collapse into one row. - 290+ identity-field synonyms recognized โ MFT, BAM, USB, BrowserHistory, Shellbags, RecycleBin, Network_list, and UserProfiles records that previously had "no identity" now form matches.
๐ฏ Smarter Sub-Identity Grouping
Chrome.exe/chrome.exe/Chrome.EXE/chrome.dll→ one sub-identity (case + extension are noise).Chrome v1.0.exe≠Chrome v2.0.exe(versions are signal).Chrome (x86)≠Chrome (x64)(architectural qualifiers are signal).- Each sub-identity exposes its full set of name variants in the UI.
๐งฐ One Source of Truth
- Field synonyms live in
config/standard_fields/*.json. Adding a new parser column means editing JSON, not three Python files. - Per-table metadata (primary timestamp column, multi-timestamp JSON
columns, identity priority) lives in
feather_schemas.json. - Identity normalization unified in one module โ engine, viewers, and semantic phase share one implementation.
โก Performance Foundation
- Thread-safe feather query cache (RLock-protected). Path to true parallel correlation is unblocked.
- Streaming reads via
query_time_range_iter()โ yields records in 1000-row batches, constant memory across feathers of any size. parallel_strategyconfig field ready for opt-in process-pool parallelism ('none' | 'threads' | 'processes').
๐ Better Feather Generation
FeatherWriter— canonical write contract with WAL journal, explicit transactions, andexecutemany()batched inserts (5000-row batches). Expected 50–200× speedup on large imports.- Schema metadata stamped into the feather DB itself; engine reads it instead of sniffing sample values.
- Multi-timestamp JSON columns declarable via one
writer.declare_multi_timestamp_json(...)call — zero engine-side changes.
๐ฉบ Diagnostics You Can Trust
- Per-window INFO line:
records_in / no_identity / parse_cache_hits / below_threshold / matches_emitted / min_feathers. If matches drop, the log says exactly where. _last_window_correlation_statsexposes the same data programmatically for downstream tooling.low_confidence_review_mode+min_feathers_overridelet analysts dial correlation strictness per case.
๐งช Locked-In Quality
A 96-test pytest regression suite covers the timestamp parser, identity normalization, multi-timestamp fan-out, feather schemas, wing loader, the writer contract, Eye-Agent authoring (GEP rules 9/10/11), and the standard-fields registry. Every fix above is locked in — future refactors can't regress what 0.10.2 delivers.
Introduction
Built by Investigators, for Investigators
The Crow Eye Correlation Engine is designed to revolutionize the forensic investigation process by making it faster, more accurate, and more focused. Born from the forensic community's need for efficient artifact correlation, this open-source solution enables investigators to quickly identify the most critical information and answer the most pressing questions in their cases.
โก Speed & Efficiency
The dual-engine architecture is optimized for performance. The Identity-Based engine processes large datasets with O(N log N) complexity and streaming mode, enabling analysis of millions of records without memory constraints. Investigators can correlate data as fast as possible to get answers when time is critical.
๐ฏ Accuracy & Focus
By automatically correlating artifacts across multiple sources (Prefetch, Registry, Event Logs, MFT, SRUM, and more), the engine helps investigators focus on the most needed details. Instead of manually searching through thousands of records, investigators can immediately see relationships and patterns that matter to their case.
๐ค Community-Driven
From the community, to the community. Crow Eye is built as an open-source solution by forensic investigators who understand the challenges of digital forensics. The correlation engine is actively developed with contributions from the forensic community, ensuring it evolves to meet real-world investigation needs.
๐ Mission: Open-Source Forensic Excellence
Crow Eye targets to be the open-source solution for analyzing and correlating forensic data. Our goal is to provide investigators with engines that deliver the most critical information as quickly as possible, enabling them to:
- Answer key questions faster: What happened? When? Who was involved?
- Identify critical evidence: Focus on the artifacts that matter most to your case
- Correlate across artifacts: See relationships between Prefetch, Registry, Event Logs, and more
- Scale to any dataset: From small targeted investigations to enterprise-wide incidents
- Contribute and improve: Join the community in building better forensic engines
Purpose of This Document
This comprehensive documentation serves as the main entry point for developers and contributors who want to understand, modify, or extend the correlation engine system. Whether you're investigating a case or contributing code, this guide will help you leverage the full power of the Correlation Engine.
- Understand the System: Get a high-level view of how all components work together
- Navigate the Codebase: Find the right files to modify for specific tasks
- Visualize Architecture: See diagrams that illustrate system structure and data flow
- Learn Core Concepts: Master anchors, time windows, feathers, wings, and engines
- Contribute: Join the community in making forensics faster and more accessible
Who Should Read This
- Forensic Investigators: Using Crow Eye for case analysis and wanting to understand correlation capabilities
- Developers: New to the Crow Eye project and wanting to contribute
- Contributors: Adding new features, engines, or artifact support
- Maintainers: Debugging issues and optimizing performance
- Community Members: Anyone passionate about open-source forensic engines
๐ Active Development: The Correlation Engine is continuously evolving with new features, optimizations, and community contributions. Check the GitHub repository for the latest updates and join us in building the future of open-source forensics.
What is the Correlation Engine?
The Correlation Engine is a forensic analysis system that finds temporal and semantic relationships between different types of forensic artifacts. It helps investigators discover connections between events that occurred on a system by correlating data from multiple sources.
You can also drive the Correlation Engine conversationally with Eye AI โ Crow-Eye's forensic AI assistant queries correlation results by time and identity and can author new Wings and Semantic Mappings on your behalf.
The system implements a dual-engine architecture with two distinct correlation strategies:
1. Identity-Based Correlation Engine
Groups records by identity first, then creates temporal anchors. Optimized for large datasets (> 1,000 records) with O(N log N) performance and streaming support. Production Ready
2. Time-Based Correlation Engine
Utilizes an O(N log N) time-window scanning approach for systematic temporal analysis. Ideal for large datasets (> 1,000 records) requiring performance-critical analysis. Production Ready
Key Capabilities
- Dual-Engine Architecture: Choose between Identity-Based (groups by identity first) and Time-Based (systematically scans time windows) engines, both offering O(N log N) performance, based on analysis goals
- Engine Selection: Automatic or manual engine selection via EngineSelector factory
- Temporal Correlation: Find events that occurred within a specified time window
- Identity Tracking: Track applications and files across multiple artifacts (Identity-Based engine)
- Smart Sub-Identity Grouping (0.10.2): Case + extension variants collapse to one bucket; version numbers and architectural qualifiers stay distinct
- Multi-Artifact Support: Correlate data from Prefetch, ShimCache, AmCache, Event Logs, LNK files, Jumplists, MFT, USN, SRUM, Registry, RecycleBin, and more
- Multi-Timestamp Fan-Out (0.10.2): JSON timestamp lists (Prefetch
run_times) expanded so every execution gets correlated, not just the latest - Tolerant Timestamps (0.10.2): FILETIME, Unix s/ms/µs, ISO 8601 (with or without Z),
YYYYMMDD, US slash dates, and annotated strings all classified on first try - Standard Fields Registry (0.10.2): One JSON edit to add a new parser column — no code changes in three places
- Feather Schemas (0.10.2): Per-table metadata (primary timestamp, multi-timestamp JSON columns, identity priority) declared in
feather_schemas.json - Flexible Rules: Define custom correlation rules (Wings) with configurable parameters
- Semantic Mapping: Map different column names to common semantic meanings
- Honest Diagnostics (0.10.2): Per-window stats line (records_in / no_identity / parse_cache_hits / below_threshold / matches_emitted) so you always know if evidence was dropped
- Tunable Strictness (0.10.2):
min_feathers_override+low_confidence_review_modelet analysts dial correlation thresholds per case - Weighted Scoring: Calculate confidence scores based on multiple factors
- Streaming Mode: Process millions of records with constant memory usage (Identity-Based engine + new
query_time_range_itergenerator) - FeatherWriter (0.10.2): Transactional batched inserts (50–200× faster than per-row), WAL journal, schema metadata stamping
- Thread-Safe Caches (0.10.2): RLock-protected feather query cache; foundation for opt-in process-pool parallelism
- Pipeline Automation: Execute complete analysis workflows automatically
- Visual Interface: GUI for building pipelines, viewing results, and exploring timelines
- Test Coverage (0.10.2): 96-test pytest regression suite locks in every fix — future refactors can't silently regress
Core Concepts
Understanding these fundamental concepts is essential for working with the Correlation Engine:
๐ชถ Feather
A data normalization system that accepts various input formats (CSV, JSON, or any forensic engine output) and converts them into a standardized SQLite database. Feathers normalize diverse data sources into a consistent schema, making them ready to serve as input for the correlation engine. Each feather represents a single artifact type (e.g., Prefetch, Registry, Event Logs) with standardized column names and timestamp formats.
๐ชฝ Wing
A comprehensive configuration that defines the complete correlation workflow. Wings specify: (1) which feathers to use as input for correlation, (2) the correlation rules and matching criteria, (3) the time window (anchor range) for temporal correlation, (4) semantic mappings to translate different field names to common meanings (e.g., mapping "ExecutableName", "ProcessName", "AppName" to a unified "application" concept), and (5) filters to narrow the dataset. Wings are the control center that orchestrates how the correlation engine processes data.
โ๏ธ Engine
The correlation strategy (Time-Based or Identity-Based) used to find relationships between forensic artifacts. Each engine implements a different algorithmic approach optimized for specific use cases and dataset sizes.
โ Anchor
A record from one feather that serves as the starting point for finding correlations. In Time-Based engine, each record with a valid timestamp becomes an anchor. In Identity-Based engine, anchors are temporal clusters of evidence grouped by identity. Both engines use the same time window concept to find related records.
โฑ๏ธ Time Window
The temporal range used to find related events around an anchor or within a scanning window. Default is 180 minutes, meaning events within this period are considered potentially related. This value can be customized in the Wing configuration to match investigation needs (e.g., 1 minute for precise correlations, 30 minutes for broader analysis).
๐ Semantic Mapping
A translation layer defined in Wings that maps different column names from various artifacts to common semantic meanings. For example, "ExecutableName" (Prefetch), "ProcessName" (Event Logs), and "AppName" (Registry) can all be mapped to "application", enabling the engine to correlate records across different artifact types even when they use different field names.
๐ Identity
A normalized representation of an application, file, or entity across artifacts (Identity-Based engine). Identities enable tracking the same entity across different forensic sources, even when represented with different names or formats.
๐ฏ Match
A set of temporally-related records from different feathers that have been correlated by the engine. Each match includes a confidence score indicating the strength of the correlation based on temporal proximity, field matches, and semantic similarity.
๐ Pipeline
An automated workflow that creates feathers and executes wings in a coordinated sequence. Pipelines orchestrate the entire correlation process from data ingestion to result generation, enabling repeatable and automated forensic analysis.
๐พ Streaming Mode
Memory-efficient processing that writes results directly to database (Identity-Based engine). Streaming mode enables processing of millions of records with constant memory usage, making it suitable for large-scale forensic investigations.
Architecture Diagrams
System Architecture
The Correlation Engine is organized into 7 major subsystems:
๐ด Engine Core
๐ต Feather System
๐ข Wings System
๐ Configuration
๐ฃ Pipeline
๐ก GUI Components
๐ฃ Integration
Color Legend
- ๐ด Red: Engine Core (correlation logic)
- ๐ต Blue: Feather System (data normalization)
- ๐ข Green: Wings System (rule definitions)
- ๐ Orange: Configuration (settings management)
- ๐ฃ Purple: Pipeline (workflow orchestration)
- ๐ก Yellow: GUI (user interface)
- ๐ฃ Magenta: Integration (Crow-Eye bridge)
Data Flow
This diagram shows how forensic data flows through the system from source to results:
Data Flow Through Correlation Engine
User Configuration
Define correlation rules, select feathers, set time window
Load Pipeline Config
Load feather configs and wing configs
Create Feathers
Transform source data to normalized format
Store Normalized Data
Create SQLite databases with metadata
Execute Wing
Pass wing config and feather paths to engine
Load Feather Data
Query records from each feather database
Correlate by Time or Identity
Based on engine used: Time-Based engine collects anchors and finds temporal matches, Identity-Based engine groups by identity and creates temporal clusters
Return Results
Return matches with confidence scores
Display Results
Format and present correlation matches in GUI
Dependency Graph
This diagram shows how the major directories depend on each other:
Dependency Rules
- config/: No dependencies on other correlation_engine modules (base layer)
- wings/: Depends only on config/
- feather/: Depends only on config/
- engine/: Depends on feather/, wings/, config/
- pipeline/: Depends on engine/, config/, wings/
- gui/: Depends on engine/, pipeline/, config/, wings/
- integration/: Depends on all other modules (top layer)
Dual-Engine Architecture
Why Multiple Engines?
Forensic investigations vary dramatically in scope, data volume, and analytical priorities. A single correlation approach cannot efficiently handle all scenarios. The Crow Eye Correlation Engine implements a modular, multi-engine architecture where each engine is optimized for specific investigation priorities and dataset characteristics.
Design Philosophy
- Case-Driven Selection: Different investigations require different correlation strategies. A targeted malware analysis needs identity tracking, while a timeline reconstruction needs comprehensive temporal analysis.
- Performance Optimization: Both engines run at O(N log N) complexity โ Identity-Based via per-identity bucketing, Time-Based via systematic time-window scanning with indexed timestamp queries. The choice between them is about what question you're asking (time-bucketed vs identity-bucketed), not about dataset size.
- Scalability: The architecture allows adding new engines without modifying existing code, following the Open/Closed Principle.
- Future Extensibility: Additional engines can focus on specific forensic aspects like network correlation, user behavior analysis, or file system relationships.
Current Engines
The system currently implements two engines, with more planned for future releases:
โ Identity-Based Engine
Status: Production Ready
Priority: Track specific applications, files, or entities across the entire forensic timeline
Best For: Malware analysis, application tracking, large datasets (>1,000 records)
โ Time-Based Engine
Status: Production Ready
Priority: Comprehensive temporal analysis with detailed field-level matching
Best For: Timeline reconstruction, detailed investigations, smaller datasets (<1,000 records)
The Theory: Why Two Separate Engines?
The decision to implement two distinct engines rather than a single unified approach stems from fundamental differences in correlation strategies and their computational trade-offs. Each engine represents a different theoretical approach to the correlation problem:
Identity-Based Engine Theory
Core Principle: "Group first, then correlate within groups"
This engine operates on the principle that forensic artifacts related to the same entity (application, file, user) should be grouped together before temporal analysis. By normalizing identities across different artifact types and creating identity-based clusters, the engine reduces the correlation search space dramatically.
Algorithmic Approach:
- Identity Extraction: Normalize all application names, file paths, and hashes across artifacts using semantic mappings
- Identity Grouping: Create buckets where each bucket contains all evidence for a single identity (e.g., all "chrome.exe" evidence)
- Temporal Clustering: Within each identity group, create temporal anchors representing time-based clusters of activity
- Cross-Artifact Linking: Link evidence from different sources (Prefetch, Registry, Event Logs) within the same identity group
Why O(N log N)? By grouping first, the engine only compares records within the same identity group. If you have N records distributed across M identities, you're performing M smaller sorts (O(N/M log N/M) each) instead of one large O(Nยฒ) comparison. The sorting and grouping operations dominate, giving O(N log N) complexity.
Memory Efficiency: The streaming mode processes one identity group at a time, writing results directly to the database. This maintains constant memory usage regardless of dataset size.
Trade-off: While highly efficient for identity tracking, this approach may naturally deemphasize or entirely miss direct temporal correlations between disparate artifacts that do not share a common identity.
Time-Based Engine Theory
Core Principle: "Systematic scanning of temporal windows for efficient correlation"
This engine operates on the principle of systematically scanning through predefined time windows across the entire dataset. It leverages optimized database indexing and query techniques to achieve linear (O(N)) scalability, focusing on all records within a given temporal segment.
Algorithmic Approach:
- Automatic Time Range Detection: Dynamically determines the earliest and latest timestamps across all relevant feathers.
- Window Generation: Divides the overall time span into fixed-size windows (e.g., 5-minute intervals), typically starting from a predefined epoch (e.g., year 2000).
- Indexed Window Querying: For each window, queries all feathers for records falling within that specific time range. This step is highly optimized by pre-built timestamp indexes.
- In-Window Correlation: Performs comprehensive field-level matching and semantic rule evaluation on the records retrieved within the current window.
- Streaming Results: Correlated matches are processed and written directly to a database, ensuring memory efficiency and scalability.
Why O(N log N)? The engine systematically scans through the entire dataset in discrete time windows. Within each window, it efficiently queries and processes records. The "log N" factor arises primarily from optimized database indexing and sorting operations performed to gather and prepare data within these windows. This approach avoids the exhaustive pairwise comparison of O(Nยฒ) by focusing on localized, indexed operations, making it highly efficient for large datasets and achieving near-linear scalability.
Memory Usage: Leverages streaming mode where only records within the current time window are loaded into memory, resulting in constant memory usage regardless of the total dataset size.
Trade-off: This approach is optimized for comprehensive temporal coverage and is highly efficient for large datasets. However, it prioritizes systematic time-window analysis over explicit identity grouping, which means identity relationships are established within time windows rather than being the primary grouping mechanism.
Why Not Combine Them?
Fundamental Incompatibility: The two engines represent fundamentally different correlation philosophies that cannot be merged without sacrificing the strengths of each:
- Primary Approach: Identity-Based first groups data by unique entities, then correlates events within those groups. Time-Based systematically scans the entire timeline in fixed windows, correlating all events within each window.
- Search Space Management: Identity-Based significantly reduces the search space by pre-filtering based on identity, making it efficient for tracking specific entities. Time-Based explores a localized temporal search space within each window for comprehensive coverage across all artifacts.
- Memory Model: Both engines leverage streaming for constant memory usage, but their internal data handling strategies differ based on their primary approach.
- Optimization Focus: Identity-Based optimizes for entity-centric tracking and scalability in large datasets. Time-Based optimizes for a comprehensive temporal overview and efficient processing of all events within a given time frame.
- Investigative Question: Identity-Based excels at answering "What did this specific application/file do over time?". Time-Based excels at answering "What activity occurred within this time period?".
Design Decision: Rather than creating a compromised hybrid that performs poorly in both scenarios, the system provides specialized engines that excel in their respective domains. The EngineSelector factory automatically chooses the appropriate engine based on dataset size and analysis goals, or allows manual selection for specific investigation needs.
Future Enhancements
The correlation system is continuously evolving to become more intelligent and comprehensive. Future development will focus on:
๐ง Enhanced Intelligence
New correlation engines will be added to link more variant data types together, creating a more comprehensive understanding of system activity and relationships between artifacts.
๐ฏ Behavioral Detection
Advanced engines will analyze patterns to automatically detect system behavior and user activities, identifying anomalies and suspicious patterns without manual configuration.
๐ Cross-Domain Correlation
Future engines will correlate data across multiple forensic domains (network, file system, registry, memory) to provide holistic insights into complex attack chains and system interactions.
๐ Modular Architecture: The engine system is designed for extensibility. Each new engine enhances the correlation capabilities without disrupting existing functionality, allowing the system to grow smarter over time while maintaining stability and performance.
Visual Engine Comparison
Understanding the differences between engines helps select the right tool for your investigation:
Identity-Based Engine
O(N log N)Normalize app names, paths, hashes
Cluster all evidence per entity
Build time-based clusters
Write directly to database
Optimized For:
- Large Datasets: > 1,000 records
- Identity Tracking: Follow specific apps/files
- Memory Efficiency: Constant memory usage
- Production Use: Stable & tested
Key Metrics:
- Complexity: O(N log N)
- Memory: O(1) streaming
- Speed: Fast for large data
Time-Based Engine
O(N log N)Scan fixed time windows
Retrieve records via indexes
Field matching & rules
Write directly to database
Optimized For:
- Large Datasets: > 1,000 records
- Systematic Analysis: Comprehensive temporal view
- Memory Efficiency: Constant memory usage (streaming)
- Production Use: Stable & tested
Key Metrics:
- Complexity: O(N log N)
- Memory: O(1) streaming
- Speed: Fast for large data
Correlation Methodology
Both engines follow a structured methodology but differ in their approach to organizing and correlating data:
Identity-Based Methodology
- Identity Extraction: Normalizes application names, file paths, and hashes across all artifacts using semantic mappings
- Identity Grouping: Groups all records by their normalized identity (e.g., all evidence related to "chrome.exe")
- Temporal Clustering: Within each identity group, creates temporal anchors representing clusters of related events
- Cross-Artifact Correlation: Links evidence from different artifacts (Prefetch, Registry, Event Logs) for the same identity
- Streaming Output: Writes results directly to database to maintain constant memory usage
- Confidence Scoring: Calculates scores based on identity strength, temporal proximity, and artifact diversity
Key Advantage: Its identity-centric approach ensures highly efficient tracking of specific applications or files across the entire forensic timeline, regardless of their timestamps, without loading all data into memory.
Time-Based Methodology
- Time Range Detection: Automatically determines the overall time span of all artifacts.
- Window Generation: Divides the entire time range into fixed, overlapping time windows.
- Indexed Window Querying: For each window, efficiently queries all relevant records from all feathers using optimized database indexes.
- In-Window Correlation: Correlates records within the current time window using comprehensive field-level matching and semantic rules.
- Streaming Output: Writes correlated results directly to the database for constant memory usage.
- Confidence Scoring: Calculates scores based on temporal proximity, field matches, and semantic similarity within each window.
Key Advantage: Offers a systematic and comprehensive temporal view of all activities within defined time windows, ideal for reconstructing events and analyzing activity patterns across the entire dataset.
Core Components Architecture
The correlation system is built from five core components that work together to enable forensic analysis:
Feather
Normalized Data Container
Purpose: Stores forensic artifact data in a standardized SQLite format
Contains:
- Normalized timestamps
- Semantic field mappings
- Artifact metadata
- Source information
Key Files:
__init__.pyfeather_builder.pydatabase.pytransformer.py
Wings
Correlation Rule Definitions
Purpose: Defines which feathers to correlate and how
Contains:
- Feather specifications
- Time window settings
- Filter conditions
- Match requirements
Key Files:
__init__.pywing_model.pyartifact_detector.pywing_validator.py
Engine
Correlation Processing Logic
Purpose: Executes correlation algorithms to find relationships
Contains:
- Identity-Based engine (O(N log N))
- Time-Window Scanning engine (O(N log N))
- Feather loader
- Scoring algorithms
Key Files:
__init__.pycorrelation_engine.pyengine_selector.pyidentity_correlation_engine.pytime_based_engine.pyweighted_scoring.py
Config
Configuration Management
Purpose: Manages all system configurations and mappings
Contains:
- Semantic field mappings
- Feather configurations
- Wing configurations
- Pipeline definitions
Key Files:
__init__.pyconfig_manager.pysemantic_mapping.pypipeline_config.pyintegrated_configuration_manager.pycase_specific_configuration_manager.py
Pipeline
Workflow Orchestration
Purpose: Automates complete analysis workflows
Contains:
- Feather creation steps
- Wing execution sequence
- Result aggregation
- Report generation
Key Files:
__init__.pypipeline_executor.pypipeline_loader.pyfeather_auto_registration.pydiscovery_service.pydatabase_connection_manager.py
Component Interaction Flow
Here's how the five core components work together in a typical correlation workflow:
Complete Correlation Workflow
Config Loads Settings
Configuration manager loads semantic mappings, feather configs, and wing definitions from JSON files
Feather Normalizes Data
Feather Builder transforms raw forensic artifacts into normalized SQLite databases with standardized schemas
Wings Define Rules
Wing configuration specifies which feathers to correlate, time windows, filters, and match requirements
Pipeline Orchestrates
Pipeline executor coordinates feather creation and wing execution in the correct sequence
Engine Correlates
Correlation engine (Identity-Based or Time-Based) processes feathers according to wing rules and finds relationships
Results Generated
Correlation matches with confidence scores are stored in database and displayed in GUI timeline
Component Interaction Summary
- Config โ Feather: Provides semantic mappings and transformation rules
- Config โ Wings: Supplies validation rules and artifact type definitions
- Feather โ Engine: Supplies normalized data for correlation processing
- Wings โ Engine: Defines correlation parameters and match requirements
- Pipeline โ All: Orchestrates the entire workflow from data ingestion to results
- Engine โ Results: Produces correlation matches with confidence scores
Directory Structure
The correlation_engine is organized into 7 main directories, each with a specific responsibility:
engine/ - Core Correlation Engine
Purpose: Contains the core correlation logic, feather loading, scoring, and result management.
Key Files: 35 Python files
correlation_engine.py- Main correlation engineidentity_correlation_engine.py- Identity-based correlationtime_based_engine.py- Time-based correlationengine_selector.py- Engine factory and selectionweighted_scoring.py- Confidence score calculation
feather/ - Data Normalization
Purpose: Handles importing forensic artifact data from various sources and normalizing it into the feather format.
Key Files: 4 Python files + UI subdirectory
__init__.pyfeather_builder.py- Main application entry pointdatabase.py- Database operationstransformer.py- Data transformation pipelineui/- GUI components for Feather Builder
wings/ - Correlation Rules
Purpose: Defines the data models and validation logic for Wing configurations (correlation rules).
Key Files: 4 Python files in core/ + UI subdirectory
core/__init__.pycore/wing_model.py- Wing, FeatherSpec, CorrelationRules data modelscore/artifact_detector.py- Detect artifact typescore/wing_validator.py- Validate wing configurationsui/- GUI components for Wings Creator
config/ - Configuration Management
Purpose: Manages all configuration files (feathers, wings, pipelines) and semantic mappings.
Key Files: 20 Python files
config_manager.py- Central configuration managementsemantic_mapping.py- Semantic field mappingspipeline_config.py- Pipeline configuration modelintegrated_configuration_manager.py- Integrated configurationcase_specific_configuration_manager.py- Case-specific configuration
pipeline/ - Workflow Orchestration
Purpose: Orchestrates complete analysis workflows, including feather creation, wing execution, and report generation.
Key Files: 7 Python files
pipeline_executor.py- Main pipeline executionpipeline_loader.py- Load pipeline configurationsfeather_auto_registration.py- Auto-register feathersdiscovery_service.py- Discover available configsdatabase_connection_manager.py- Manage pipeline database connections
gui/ - User Interface
Purpose: Provides all GUI components for the correlation engine, including pipeline management, results visualization, and configuration editing.
Key Files: 35 Python files
main_window.py- Main application windowpipeline_management_tab.py- Pipeline creation/managementcorrelation_results_view.py- Correlated results displaytimeline_widget.py- Timeline visualizationsettings_dialog.py- User settings and configurations
integration/ - Crow-Eye Bridge
Purpose: Integrates the correlation engine with the main Crow-Eye application, providing auto-generation features and default configurations.
Key Files: 21 Python files + default_wings/ subdirectory
crow_eye_integration.py- Main integration bridgecase_initializer.py- Initialize correlation engine for a caseauto_feather_generator.py- Auto-generate feathers from Crow-Eye datafeather_mappings.py- Define mappings for feather integrationcorrelation_integration.py- Core correlation integration logic
Identity-Based Correlation Engine
The Identity-Based Engine groups records by identity first, then creates temporal anchors for efficient correlation. โ Production Ready
How It Works
- Extract Identities: Normalizes application names, file paths, and other identifiers
- Group by Identity: Organizes records by their normalized identity
- Create Temporal Anchors: Builds temporal clusters of evidence for each identity
- Stream Results: Writes correlation results directly to database for memory efficiency
Performance Characteristics
- Complexity: O(N log N) - Efficient grouping and sorting
- Memory Usage: Constant with streaming mode enabled
- Best For: Large datasets (> 1,000 records)
- Strengths: Identity tracking across artifacts, scalable performance, streaming support
Use Case Scenarios
- Production forensic investigations with large datasets
- Tracking specific applications or files across multiple artifacts
- Enterprise-scale analysis requiring memory efficiency
- Automated correlation pipelines processing millions of records
Time-Based Correlation Engine
The Time-Based Engine implements an efficient O(N log N) time-window scanning approach for systematic temporal analysis. It is designed for large datasets and performance-critical environments. โ Production Ready
How It Works
- Systematic Time Scan: Scans through forensic data systematically in fixed time intervals (windows).
- Indexed Queries: Utilizes optimized database indexes for fast retrieval of records within each time window.
- Field-Level Matching: Performs comprehensive field-level comparisons for records found in the current window.
- Streaming Results: Processes and writes correlated matches directly to the database for memory efficiency.
Performance Characteristics
- Complexity: O(N log N) - Achieves efficient scalability through time-window scanning and optimized indexing.
- Memory Usage: Low (streaming mode) - Processes data in chunks, maintaining constant memory usage.
- Best For: Large datasets (> 1,000 records) - Efficiently handles millions of records.
- Strengths: Systematic temporal analysis, high performance, memory-efficient.
Use Case Scenarios
- Large-scale forensic investigations with extensive datasets.
- Systematic temporal analysis across an entire timeline.
- Performance-critical environments where speed and memory efficiency are paramount.
- Automated correlation pipelines.
Engine Selection Guide
Choose the appropriate correlation engine based on your specific needs:
Choose Identity-Based Engine When:
- โ Dataset has > 1,000 records
- โ Performance is critical
- โ You need identity tracking across artifacts
- โ You want to filter by specific applications
- โ Memory constraints require streaming mode
- โ Production-ready solution needed
Choose Time-Based Engine When: โ Production Ready
- โ Dataset has > 1,000 records
- โ Systematic temporal analysis across large datasets is needed
- โ Performance and memory efficiency are paramount
- โ Automated correlation pipelines are in use
- โ You require an O(N log N) scalable solution
Decision Criteria Table
| Factor | Time-Based | Identity-Based |
|---|---|---|
| Dataset Size | Large datasets (>1,000 records) | Very large datasets (10,000+ records) |
| Complexity | O(N log N) | O(N log N) |
| Memory Usage | Low (streaming mode) | Low (streaming mode) |
| Analysis Type | Systematic temporal analysis | Identity tracking |
| Use Case | Production, automation | Production, automation |