Introduction
Built by Investigators, for Investigators
The Crow Eye Correlation Engine is designed to revolutionize the forensic investigation process by making it faster, more accurate, and more focused. Born from the forensic community's need for efficient artifact correlation, this open-source solution enables investigators to quickly identify the most critical information and answer the most pressing questions in their cases.
ā” Speed & Efficiency
The dual-engine architecture is optimized for performance. The Identity-Based engine processes large datasets with O(N log N) complexity and streaming mode, enabling analysis of millions of records without memory constraints. Investigators can correlate data as fast as possible to get answers when time is critical.
šÆ Accuracy & Focus
By automatically correlating artifacts across multiple sources (Prefetch, Registry, Event Logs, MFT, SRUM, and more), the engine helps investigators focus on the most needed details. Instead of manually searching through thousands of records, investigators can immediately see relationships and patterns that matter to their case.
š¤ Community-Driven
From the community, to the community. Crow Eye is built as an open-source solution by forensic investigators who understand the challenges of digital forensics. The correlation engine is actively developed with contributions from the forensic community, ensuring it evolves to meet real-world investigation needs.
š Mission: Open-Source Forensic Excellence
Crow Eye targets to be the open-source solution for analyzing and correlating forensic data. Our goal is to provide investigators with tools that deliver the most critical information as quickly as possible, enabling them to:
- Answer key questions faster: What happened? When? Who was involved?
- Identify critical evidence: Focus on the artifacts that matter most to your case
- Correlate across artifacts: See relationships between Prefetch, Registry, Event Logs, and more
- Scale to any dataset: From small targeted investigations to enterprise-wide incidents
- Contribute and improve: Join the community in building better forensic tools
Purpose of This Document
This comprehensive documentation serves as the main entry point for developers and contributors who want to understand, modify, or extend the correlation engine system. Whether you're investigating a case or contributing code, this guide will help you leverage the full power of the Correlation Engine.
- Understand the System: Get a high-level view of how all components work together
- Navigate the Codebase: Find the right files to modify for specific tasks
- Visualize Architecture: See diagrams that illustrate system structure and data flow
- Learn Core Concepts: Master anchors, time windows, feathers, wings, and engines
- Contribute: Join the community in making forensics faster and more accessible
Who Should Read This
- Forensic Investigators: Using Crow Eye for case analysis and wanting to understand correlation capabilities
- Developers: New to the Crow Eye project and wanting to contribute
- Contributors: Adding new features, engines, or artifact support
- Maintainers: Debugging issues and optimizing performance
- Community Members: Anyone passionate about open-source forensic tools
š Active Development: The Correlation Engine is continuously evolving with new features, optimizations, and community contributions. Check the GitHub repository for the latest updates and join us in building the future of open-source forensics.
What is the Correlation Engine?
The Correlation Engine is a forensic analysis system that finds temporal and semantic relationships between different types of forensic artifacts. It helps investigators discover connections between events that occurred on a system by correlating data from multiple sources.
The system implements a dual-engine architecture with two distinct correlation strategies:
1. Identity-Based Correlation Engine
Groups records by identity first, then creates temporal anchors. Optimized for large datasets (> 1,000 records) with O(N log N) performance and streaming support. Production Ready
2. Time-Based Correlation Engine
Uses temporal proximity as the primary factor with comprehensive field matching. Ideal for small datasets (< 1,000 records) requiring detailed analysis. ā ļø Under Development
Key Capabilities
- Dual-Engine Architecture: Choose between Time-Based (O(N²)) and Identity-Based (O(N log N)) engines based on dataset size and analysis goals
- Engine Selection: Automatic or manual engine selection via EngineSelector factory
- Temporal Correlation: Find events that occurred within a specified time window
- Identity Tracking: Track applications and files across multiple artifacts (Identity-Based engine)
- Multi-Artifact Support: Correlate data from Prefetch, ShimCache, AmCache, Event Logs, LNK files, Jumplists, MFT, SRUM, Registry, and more
- Flexible Rules: Define custom correlation rules (Wings) with configurable parameters
- Semantic Mapping: Map different column names to common semantic meanings
- Duplicate Prevention: Automatically detect and prevent duplicate matches
- Weighted Scoring: Calculate confidence scores based on multiple factors
- Streaming Mode: Process millions of records with constant memory usage (Identity-Based engine)
- Pipeline Automation: Execute complete analysis workflows automatically
- Visual Interface: GUI for building pipelines, viewing results, and exploring timelines
Core Concepts
Understanding these fundamental concepts is essential for working with the Correlation Engine:
šŖ¶ Feather
A data normalization system that accepts various input formats (CSV, JSON, or any forensic tool output) and converts them into a standardized SQLite database. Feathers normalize diverse data sources into a consistent schema, making them ready to serve as input for the correlation engine. Each feather represents a single artifact type (e.g., Prefetch, Registry, Event Logs) with standardized column names and timestamp formats.
šŖ½ Wing
A comprehensive configuration that defines the complete correlation workflow. Wings specify: (1) which feathers to use as input for correlation, (2) the correlation rules and matching criteria, (3) the time window (anchor range) for temporal correlation, (4) semantic mappings to translate different field names to common meanings (e.g., mapping "ExecutableName", "ProcessName", "AppName" to a unified "application" concept), and (5) filters to narrow the dataset. Wings are the control center that orchestrates how the correlation engine processes data.
āļø Engine
The correlation strategy (Time-Based or Identity-Based) used to find relationships between forensic artifacts. Each engine implements a different algorithmic approach optimized for specific use cases and dataset sizes.
ā Anchor
A record from one feather that serves as the starting point for finding correlations. In Time-Based engine, each record with a valid timestamp becomes an anchor. In Identity-Based engine, anchors are temporal clusters of evidence grouped by identity. Both engines use the same time window concept to find related records.
ā±ļø Time Window
The temporal range used to find related events around an anchor. Default is 5 minutes (±2.5 minutes from the anchor timestamp), meaning events within 5 minutes before or after are considered potentially related. This value can be customized in the Wing configuration to match investigation needs (e.g., 1 minute for precise correlations, 30 minutes for broader analysis).
š Semantic Mapping
A translation layer defined in Wings that maps different column names from various artifacts to common semantic meanings. For example, "ExecutableName" (Prefetch), "ProcessName" (Event Logs), and "AppName" (Registry) can all be mapped to "application", enabling the engine to correlate records across different artifact types even when they use different field names.
š Identity
A normalized representation of an application, file, or entity across artifacts (Identity-Based engine). Identities enable tracking the same entity across different forensic sources, even when represented with different names or formats.
šÆ Match
A set of temporally-related records from different feathers that have been correlated by the engine. Each match includes a confidence score indicating the strength of the correlation based on temporal proximity, field matches, and semantic similarity.
š Pipeline
An automated workflow that creates feathers and executes wings in a coordinated sequence. Pipelines orchestrate the entire correlation process from data ingestion to result generation, enabling repeatable and automated forensic analysis.
š¾ Streaming Mode
Memory-efficient processing that writes results directly to database (Identity-Based engine). Streaming mode enables processing of millions of records with constant memory usage, making it suitable for large-scale forensic investigations.
Architecture Diagrams
System Architecture
The Correlation Engine is organized into 7 major subsystems:
š“ Engine Core
šµ Feather System
š¢ Wings System
š Configuration
š£ Pipeline
š” GUI Components
š£ Integration
Color Legend
- š“ Red: Engine Core (correlation logic)
- šµ Blue: Feather System (data normalization)
- š¢ Green: Wings System (rule definitions)
- š Orange: Configuration (settings management)
- š£ Purple: Pipeline (workflow orchestration)
- š” Yellow: GUI (user interface)
- š£ Magenta: Integration (Crow-Eye bridge)
Data Flow
This diagram shows how forensic data flows through the system from source to results:
Data Flow Through Correlation Engine
User Configuration
Define correlation rules, select feathers, set time window
Load Pipeline Config
Load feather configs and wing configs
Create Feathers
Transform source data to normalized format
Store Normalized Data
Create SQLite databases with metadata
Execute Wing
Pass wing config and feather paths to engine
Load Feather Data
Query records from each feather database
Correlate by Time or Identity
Based on engine used: Time-Based engine collects anchors and finds temporal matches, Identity-Based engine groups by identity and creates temporal clusters
Return Results
Return matches with confidence scores
Display Results
Format and present correlation matches in GUI
Dependency Graph
This diagram shows how the major directories depend on each other:
Dependency Rules
- config/: No dependencies on other correlation_engine modules (base layer)
- wings/: Depends only on config/
- feather/: Depends only on config/
- engine/: Depends on feather/, wings/, config/
- pipeline/: Depends on engine/, config/, wings/
- gui/: Depends on engine/, pipeline/, config/, wings/
- integration/: Depends on all other modules (top layer)
Dual-Engine Architecture
Why Multiple Engines?
Forensic investigations vary dramatically in scope, data volume, and analytical priorities. A single correlation approach cannot efficiently handle all scenarios. The Crow Eye Correlation Engine implements a modular, multi-engine architecture where each engine is optimized for specific investigation priorities and dataset characteristics.
Design Philosophy
- Case-Driven Selection: Different investigations require different correlation strategies. A targeted malware analysis needs identity tracking, while a timeline reconstruction needs comprehensive temporal analysis.
- Performance Optimization: Each engine uses algorithms optimized for its use case - Identity-Based uses O(N log N) complexity for large datasets, while Time-Based provides detailed O(N²) analysis for smaller scopes.
- Scalability: The architecture allows adding new engines without modifying existing code, following the Open/Closed Principle.
- Future Extensibility: Additional engines can focus on specific forensic aspects like network correlation, user behavior analysis, or file system relationships.
Current Engines
The system currently implements two engines, with more planned for future releases:
ā Identity-Based Engine
Status: Production Ready
Priority: Track specific applications, files, or entities across the entire forensic timeline
Best For: Malware analysis, application tracking, large datasets (>1,000 records)
ā ļø Time-Based Engine
Status: Under Development
Priority: Comprehensive temporal analysis with detailed field-level matching
Best For: Timeline reconstruction, detailed investigations, smaller datasets (<1,000 records)
The Theory: Why Two Separate Engines?
The decision to implement two distinct engines rather than a single unified approach stems from fundamental differences in correlation strategies and their computational trade-offs. Each engine represents a different theoretical approach to the correlation problem:
Identity-Based Engine Theory
Core Principle: "Group first, then correlate within groups"
This engine operates on the principle that forensic artifacts related to the same entity (application, file, user) should be grouped together before temporal analysis. By normalizing identities across different artifact types and creating identity-based clusters, the engine reduces the correlation search space dramatically.
Algorithmic Approach:
- Identity Extraction: Normalize all application names, file paths, and hashes across artifacts using semantic mappings
- Identity Grouping: Create buckets where each bucket contains all evidence for a single identity (e.g., all "chrome.exe" evidence)
- Temporal Clustering: Within each identity group, create temporal anchors representing time-based clusters of activity
- Cross-Artifact Linking: Link evidence from different sources (Prefetch, Registry, Event Logs) within the same identity group
Why O(N log N)? By grouping first, the engine only compares records within the same identity group. If you have N records distributed across M identities, you're performing M smaller sorts (O(N/M log N/M) each) instead of one large O(N²) comparison. The sorting and grouping operations dominate, giving O(N log N) complexity.
Memory Efficiency: The streaming mode processes one identity group at a time, writing results directly to the database. This maintains constant memory usage regardless of dataset size.
Trade-off: This approach is optimized for tracking specific entities but may miss correlations between unrelated artifacts that happen to occur at the same time.
Time-Based Engine Theory
Core Principle: "Correlate everything within temporal windows"
This engine operates on the principle that temporal proximity is the primary indicator of correlation. It performs comprehensive field-level matching for all records that fall within a specified time window, regardless of identity.
Algorithmic Approach:
- Anchor Collection: Gather all records with valid timestamps from all feathers (data sources)
- Chronological Sorting: Sort all anchors by timestamp to enable efficient window-based searching
- Window-Based Matching: For each anchor, search all other feathers for records within the time window
- Comprehensive Field Matching: Perform detailed semantic field comparisons (paths, hashes, names, users)
- Weighted Scoring: Calculate confidence based on temporal proximity, field matches, and semantic similarity
Why O(N²)? For each of N anchors, the engine must search through potentially all other records to find temporal matches. Even with optimizations like sorted timestamps and time window limits, the worst-case complexity remains O(N²) because every anchor could potentially match with every other record.
Memory Usage: All anchors must be loaded into memory for sorting and searching, resulting in O(N) memory usage.
Trade-off: This approach finds all temporal relationships and provides detailed field-level analysis, but becomes computationally expensive with large datasets.
Why Not Combine Them?
Fundamental Incompatibility: The two engines represent fundamentally different correlation philosophies that cannot be merged without sacrificing the strengths of each:
- Search Space: Identity-Based reduces search space through grouping (efficient for large data), while Time-Based explores the full search space (comprehensive for small data)
- Memory Model: Identity-Based uses streaming (constant memory), while Time-Based requires in-memory sorting (O(N) memory)
- Optimization Target: Identity-Based optimizes for identity tracking and scalability, while Time-Based optimizes for completeness and field-level detail
- Use Case Focus: Identity-Based excels at "follow this application across time," while Time-Based excels at "what happened during this time period"
Design Decision: Rather than creating a compromised hybrid that performs poorly in both scenarios, the system provides specialized engines that excel in their respective domains. The EngineSelector factory automatically chooses the appropriate engine based on dataset size and analysis goals, or allows manual selection for specific investigation needs.
Future Enhancements
The correlation system is continuously evolving to become more intelligent and comprehensive. Future development will focus on:
š§ Enhanced Intelligence
New correlation engines will be added to link more variant data types together, creating a more comprehensive understanding of system activity and relationships between artifacts.
šÆ Behavioral Detection
Advanced engines will analyze patterns to automatically detect system behavior and user activities, identifying anomalies and suspicious patterns without manual configuration.
š Cross-Domain Correlation
Future engines will correlate data across multiple forensic domains (network, file system, registry, memory) to provide holistic insights into complex attack chains and system interactions.
š Modular Architecture: The engine system is designed for extensibility. Each new engine enhances the correlation capabilities without disrupting existing functionality, allowing the system to grow smarter over time while maintaining stability and performance.
Visual Engine Comparison
Understanding the differences between engines helps select the right tool for your investigation:
Identity-Based Engine
O(N log N)Normalize app names, paths, hashes
Cluster all evidence per entity
Build time-based clusters
Write directly to database
Optimized For:
- Large Datasets: > 1,000 records
- Identity Tracking: Follow specific apps/files
- Memory Efficiency: Constant memory usage
- Production Use: Stable & tested
Key Metrics:
- Complexity: O(N log N)
- Memory: O(1) streaming
- Speed: Fast for large data
Time-Based Engine
O(N²)Gather all timestamped records
Chronological ordering
Window-based searching
Weighted field matching
Optimized For:
- Small Datasets: < 1,000 records
- Detailed Analysis: Field-level matching
- Comprehensive: All temporal relationships
- Research: Debugging & validation
Key Metrics:
- Complexity: O(N²)
- Memory: O(N) in-memory
- Speed: Slower for large data
Correlation Methodology
Both engines follow a structured methodology but differ in their approach to organizing and correlating data:
Identity-Based Methodology
- Identity Extraction: Normalizes application names, file paths, and hashes across all artifacts using semantic mappings
- Identity Grouping: Groups all records by their normalized identity (e.g., all evidence related to "chrome.exe")
- Temporal Clustering: Within each identity group, creates temporal anchors representing clusters of related events
- Cross-Artifact Correlation: Links evidence from different artifacts (Prefetch, Registry, Event Logs) for the same identity
- Streaming Output: Writes results directly to database to maintain constant memory usage
- Confidence Scoring: Calculates scores based on identity strength, temporal proximity, and artifact diversity
Key Advantage: Efficiently tracks specific applications or files across the entire forensic timeline without loading all data into memory.
Time-Based Methodology
- Anchor Collection: Gathers all records with valid timestamps from all feathers
- Chronological Sorting: Orders all anchors by timestamp to enable temporal window searches
- Window-Based Matching: For each anchor, searches all other feathers for records within the time window
- Field-Level Comparison: Performs comprehensive semantic field matching (paths, hashes, names)
- Match Combination: Generates all valid combinations of correlated records
- Weighted Scoring: Calculates confidence based on temporal proximity, field matches, and semantic similarity
Key Advantage: Provides comprehensive field-level analysis and detailed semantic matching for in-depth investigations.
Core Components Architecture
The correlation system is built from five core components that work together to enable forensic analysis:
Feather
Normalized Data Container
Purpose: Stores forensic artifact data in a standardized SQLite format
Contains:
- Normalized timestamps
- Semantic field mappings
- Artifact metadata
- Source information
Key Files:
feather_builder.pydatabase.pytransformer.py
Wings
Correlation Rule Definitions
Purpose: Defines which feathers to correlate and how
Contains:
- Feather specifications
- Time window settings
- Filter conditions
- Match requirements
Key Files:
wing_model.pyartifact_detector.pywing_validator.py
Engine
Correlation Processing Logic
Purpose: Executes correlation algorithms to find relationships
Contains:
- Identity-Based engine (O(N log N))
- Time-Based engine (O(N²))
- Feather loader
- Scoring algorithms
Key Files:
engine_selector.pyidentity_correlation_engine.pytime_based_engine.pyweighted_scoring.py
Config
Configuration Management
Purpose: Manages all system configurations and mappings
Contains:
- Semantic field mappings
- Feather configurations
- Wing configurations
- Pipeline definitions
Key Files:
config_manager.pysemantic_mapping.pypipeline_config.py
Pipeline
Workflow Orchestration
Purpose: Automates complete analysis workflows
Contains:
- Feather creation steps
- Wing execution sequence
- Result aggregation
- Report generation
Key Files:
pipeline_executor.pypipeline_loader.pydiscovery_service.py
Component Interaction Flow
Here's how the five core components work together in a typical correlation workflow:
Complete Correlation Workflow
Config Loads Settings
Configuration manager loads semantic mappings, feather configs, and wing definitions from JSON files
Feather Normalizes Data
Feather Builder transforms raw forensic artifacts into normalized SQLite databases with standardized schemas
Wings Define Rules
Wing configuration specifies which feathers to correlate, time windows, filters, and match requirements
Pipeline Orchestrates
Pipeline executor coordinates feather creation and wing execution in the correct sequence
Engine Correlates
Correlation engine (Identity-Based or Time-Based) processes feathers according to wing rules and finds relationships
Results Generated
Correlation matches with confidence scores are stored in database and displayed in GUI timeline
Component Interaction Summary
- Config ā Feather: Provides semantic mappings and transformation rules
- Config ā Wings: Supplies validation rules and artifact type definitions
- Feather ā Engine: Supplies normalized data for correlation processing
- Wings ā Engine: Defines correlation parameters and match requirements
- Pipeline ā All: Orchestrates the entire workflow from data ingestion to results
- Engine ā Results: Produces correlation matches with confidence scores
Directory Structure
The correlation_engine is organized into 7 main directories, each with a specific responsibility:
engine/ - Core Correlation Engine
Purpose: Contains the core correlation logic, feather loading, scoring, and result management.
Key Files: 15 Python files
correlation_engine.py- Main correlation enginefeather_loader.py- Loads and queries feather databasesweighted_scoring.py- Confidence score calculationengine_selector.py- Engine factory and selectionidentity_correlation_engine.py- Identity-based correlation
feather/ - Data Normalization
Purpose: Handles importing forensic artifact data from various sources and normalizing it into the feather format.
Key Files: 4 Python files + UI subdirectory
feather_builder.py- Main application entry pointdatabase.py- Database operationstransformer.py- Data transformation pipelineui/- GUI components for Feather Builder
wings/ - Correlation Rules
Purpose: Defines the data models and validation logic for Wing configurations (correlation rules).
Key Files: 3 Python files in core/ + UI subdirectory
core/wing_model.py- Wing, FeatherSpec, CorrelationRules data modelscore/artifact_detector.py- Detect artifact typescore/wing_validator.py- Validate wing configurationsui/- GUI components for Wings Creator
config/ - Configuration Management
Purpose: Manages all configuration files (feathers, wings, pipelines) and semantic mappings.
Key Files: 10 Python files
config_manager.py- Central configuration managementsemantic_mapping.py- Semantic field mappingspipeline_config.py- Pipeline configuration model
pipeline/ - Workflow Orchestration
Purpose: Orchestrates complete analysis workflows, including feather creation, wing execution, and report generation.
Key Files: 7 Python files
pipeline_executor.py- Main pipeline executionpipeline_loader.py- Load pipeline configurationsdiscovery_service.py- Discover available configs
gui/ - User Interface
Purpose: Provides all GUI components for the correlation engine, including pipeline management, results visualization, and configuration editing.
Key Files: 26 Python files
main_window.py- Main application windowpipeline_management_tab.py- Pipeline creation/managementtimeline_widget.py- Timeline visualization
integration/ - Crow-Eye Bridge
Purpose: Integrates the correlation engine with the main Crow-Eye application, providing auto-generation features and default configurations.
Key Files: 7 Python files + default_wings/ subdirectory
crow_eye_integration.py- Main integration bridgecase_initializer.py- Initialize correlation engine for a caseauto_feather_generator.py- Auto-generate feathers from Crow-Eye data
Identity-Based Correlation Engine
The Identity-Based Engine groups records by identity first, then creates temporal anchors for efficient correlation. ā Production Ready
How It Works
- Extract Identities: Normalizes application names, file paths, and other identifiers
- Group by Identity: Organizes records by their normalized identity
- Create Temporal Anchors: Builds temporal clusters of evidence for each identity
- Stream Results: Writes correlation results directly to database for memory efficiency
Performance Characteristics
- Complexity: O(N log N) - Efficient grouping and sorting
- Memory Usage: Constant with streaming mode enabled
- Best For: Large datasets (> 1,000 records)
- Strengths: Identity tracking across artifacts, scalable performance, streaming support
Use Case Scenarios
- Production forensic investigations with large datasets
- Tracking specific applications or files across multiple artifacts
- Enterprise-scale analysis requiring memory efficiency
- Automated correlation pipelines processing millions of records
Time-Based Correlation Engine
The Time-Based Engine uses temporal proximity as the primary factor for correlation with comprehensive field matching. ā ļø Under Development
How It Works
- Collect All Anchors: Gathers records from all feathers that have valid timestamps
- Sort by Timestamp: Orders all anchors chronologically
- Find Temporal Matches: For each anchor, searches other feathers for records within the specified time window
- Calculate Scores: Computes confidence scores based on temporal proximity and field matching
Performance Characteristics
- Complexity: O(N²) - Compares each anchor against all other records
- Memory Usage: Loads all records into memory
- Best For: Small datasets (< 1,000 records)
- Strengths: Comprehensive field-level analysis, detailed semantic matching
Use Case Scenarios
- Research and debugging correlation logic
- Detailed forensic analysis requiring field-level inspection
- Small-scale investigations with limited artifact data
- Testing and validating correlation rules
Engine Selection Guide
Choose the appropriate correlation engine based on your specific needs:
Choose Identity-Based Engine When:
- ā Dataset has > 1,000 records
- ā Performance is critical
- ā You need identity tracking across artifacts
- ā You want to filter by specific applications
- ā Memory constraints require streaming mode
- ā Production-ready solution needed
Choose Time-Based Engine When: ā ļø Under Development
- ā Dataset has < 1,000 records
- ā You need comprehensive field-level analysis
- ā You're debugging or doing research
- ā Detailed semantic matching is critical
- ā Memory is not a constraint
Decision Criteria Table
| Factor | Time-Based | Identity-Based |
|---|---|---|
| Dataset Size | < 1,000 records | > 1,000 records |
| Complexity | O(N²) | O(N log N) |
| Memory Usage | High (all in memory) | Low (streaming mode) |
| Analysis Type | Field-level matching | Identity tracking |
| Use Case | Research, debugging | Production, automation |