Introduction

Built by Investigators, for Investigators

The Crow Eye Correlation Engine is designed to revolutionize the forensic investigation process by making it faster, more accurate, and more focused. Born from the forensic community's need for efficient artifact correlation, this open-source solution enables investigators to quickly identify the most critical information and answer the most pressing questions in their cases.

⚔ Speed & Efficiency

The dual-engine architecture is optimized for performance. The Identity-Based engine processes large datasets with O(N log N) complexity and streaming mode, enabling analysis of millions of records without memory constraints. Investigators can correlate data as fast as possible to get answers when time is critical.

šŸŽÆ Accuracy & Focus

By automatically correlating artifacts across multiple sources (Prefetch, Registry, Event Logs, MFT, SRUM, and more), the engine helps investigators focus on the most needed details. Instead of manually searching through thousands of records, investigators can immediately see relationships and patterns that matter to their case.

šŸ¤ Community-Driven

From the community, to the community. Crow Eye is built as an open-source solution by forensic investigators who understand the challenges of digital forensics. The correlation engine is actively developed with contributions from the forensic community, ensuring it evolves to meet real-world investigation needs.

šŸš€ Mission: Open-Source Forensic Excellence

Crow Eye targets to be the open-source solution for analyzing and correlating forensic data. Our goal is to provide investigators with engines that deliver the most critical information as quickly as possible, enabling them to:

  • Answer key questions faster: What happened? When? Who was involved?
  • Identify critical evidence: Focus on the artifacts that matter most to your case
  • Correlate across artifacts: See relationships between Prefetch, Registry, Event Logs, and more
  • Scale to any dataset: From small targeted investigations to enterprise-wide incidents
  • Contribute and improve: Join the community in building better forensic engines

Purpose of This Document

This comprehensive documentation serves as the main entry point for developers and contributors who want to understand, modify, or extend the correlation engine system. Whether you're investigating a case or contributing code, this guide will help you leverage the full power of the Correlation Engine.

  • Understand the System: Get a high-level view of how all components work together
  • Navigate the Codebase: Find the right files to modify for specific tasks
  • Visualize Architecture: See diagrams that illustrate system structure and data flow
  • Learn Core Concepts: Master anchors, time windows, feathers, wings, and engines
  • Contribute: Join the community in making forensics faster and more accessible

Who Should Read This

  • Forensic Investigators: Using Crow Eye for case analysis and wanting to understand correlation capabilities
  • Developers: New to the Crow Eye project and wanting to contribute
  • Contributors: Adding new features, engines, or artifact support
  • Maintainers: Debugging issues and optimizing performance
  • Community Members: Anyone passionate about open-source forensic engines

šŸ”„ Active Development: The Correlation Engine is continuously evolving with new features, optimizations, and community contributions. Check the GitHub repository for the latest updates and join us in building the future of open-source forensics.

What is the Correlation Engine?

The Correlation Engine is a forensic analysis system that finds temporal and semantic relationships between different types of forensic artifacts. It helps investigators discover connections between events that occurred on a system by correlating data from multiple sources.

The system implements a dual-engine architecture with two distinct correlation strategies:

1. Identity-Based Correlation Engine

Groups records by identity first, then creates temporal anchors. Optimized for large datasets (> 1,000 records) with O(N log N) performance and streaming support. Production Ready

2. Time-Based Correlation Engine

Utilizes an O(N log N) time-window scanning approach for systematic temporal analysis. Ideal for large datasets (> 1,000 records) requiring performance-critical analysis. Production Ready

Key Capabilities

  • Dual-Engine Architecture: Choose between Identity-Based (groups by identity first) and Time-Based (systematically scans time windows) engines, both offering O(N log N) performance, based on analysis goals
  • Engine Selection: Automatic or manual engine selection via EngineSelector factory
  • Temporal Correlation: Find events that occurred within a specified time window
  • Identity Tracking: Track applications and files across multiple artifacts (Identity-Based engine)
  • Multi-Artifact Support: Correlate data from Prefetch, ShimCache, AmCache, Event Logs, LNK files, Jumplists, MFT, SRUM, Registry, and more
  • Flexible Rules: Define custom correlation rules (Wings) with configurable parameters
  • Semantic Mapping: Map different column names to common semantic meanings
  • Duplicate Prevention: Automatically detect and prevent duplicate matches
  • Weighted Scoring: Calculate confidence scores based on multiple factors
  • Streaming Mode: Process millions of records with constant memory usage (Identity-Based engine)
  • Pipeline Automation: Execute complete analysis workflows automatically
  • Visual Interface: GUI for building pipelines, viewing results, and exploring timelines

Core Concepts

Understanding these fundamental concepts is essential for working with the Correlation Engine:

🪶 Feather

A data normalization system that accepts various input formats (CSV, JSON, or any forensic engine output) and converts them into a standardized SQLite database. Feathers normalize diverse data sources into a consistent schema, making them ready to serve as input for the correlation engine. Each feather represents a single artifact type (e.g., Prefetch, Registry, Event Logs) with standardized column names and timestamp formats.

🪽 Wing

A comprehensive configuration that defines the complete correlation workflow. Wings specify: (1) which feathers to use as input for correlation, (2) the correlation rules and matching criteria, (3) the time window (anchor range) for temporal correlation, (4) semantic mappings to translate different field names to common meanings (e.g., mapping "ExecutableName", "ProcessName", "AppName" to a unified "application" concept), and (5) filters to narrow the dataset. Wings are the control center that orchestrates how the correlation engine processes data.

āš™ļø Engine

The correlation strategy (Time-Based or Identity-Based) used to find relationships between forensic artifacts. Each engine implements a different algorithmic approach optimized for specific use cases and dataset sizes.

āš“ Anchor

A record from one feather that serves as the starting point for finding correlations. In Time-Based engine, each record with a valid timestamp becomes an anchor. In Identity-Based engine, anchors are temporal clusters of evidence grouped by identity. Both engines use the same time window concept to find related records.

ā±ļø Time Window

The temporal range used to find related events around an anchor or within a scanning window. Default is 180 minutes, meaning events within this period are considered potentially related. This value can be customized in the Wing configuration to match investigation needs (e.g., 1 minute for precise correlations, 30 minutes for broader analysis).

šŸ”— Semantic Mapping

A translation layer defined in Wings that maps different column names from various artifacts to common semantic meanings. For example, "ExecutableName" (Prefetch), "ProcessName" (Event Logs), and "AppName" (Registry) can all be mapped to "application", enabling the engine to correlate records across different artifact types even when they use different field names.

šŸ†” Identity

A normalized representation of an application, file, or entity across artifacts (Identity-Based engine). Identities enable tracking the same entity across different forensic sources, even when represented with different names or formats.

šŸŽÆ Match

A set of temporally-related records from different feathers that have been correlated by the engine. Each match includes a confidence score indicating the strength of the correlation based on temporal proximity, field matches, and semantic similarity.

šŸ”„ Pipeline

An automated workflow that creates feathers and executes wings in a coordinated sequence. Pipelines orchestrate the entire correlation process from data ingestion to result generation, enabling repeatable and automated forensic analysis.

šŸ’¾ Streaming Mode

Memory-efficient processing that writes results directly to database (Identity-Based engine). Streaming mode enables processing of millions of records with constant memory usage, making it suitable for large-scale forensic investigations.

Architecture Diagrams

System Architecture

The Correlation Engine is organized into 7 major subsystems:

āš™ļø
Correlation Engine System
7 Major Subsystems

šŸ”“ Engine Core

correlation_engine.py
identity_correlation_engine.py
time_based_engine.py
engine_selector.py
weighted_scoring.py

šŸ”µ Feather System

__init__.py
feather_builder.py
database.py
transformer.py

🟢 Wings System

__init__.py
wing_model.py
artifact_detector.py
wing_validator.py

🟠 Configuration

config_manager.py
semantic_mapping.py
session_state.py

🟣 Pipeline

__init__.py
pipeline_executor.py
pipeline_loader.py
feather_auto_registration.py
discovery_service.py

🟔 GUI Components

main_window.py
pipeline_management_tab.py
correlation_results_view.py
timeline_widget.py
settings_dialog.py

🟣 Integration

crow_eye_integration.py
case_initializer.py
auto_feather_generator.py
feather_mappings.py

Color Legend

  • šŸ”“ Red: Engine Core (correlation logic)
  • šŸ”µ Blue: Feather System (data normalization)
  • 🟢 Green: Wings System (rule definitions)
  • 🟠 Orange: Configuration (settings management)
  • 🟣 Purple: Pipeline (workflow orchestration)
  • 🟔 Yellow: GUI (user interface)
  • 🟣 Magenta: Integration (Crow-Eye bridge)

Data Flow

This diagram shows how forensic data flows through the system from source to results:

Data Flow Through Correlation Engine

User Configuration

Define correlation rules, select feathers, set time window

Load Pipeline Config

Load feather configs and wing configs

Create Feathers

Transform source data to normalized format

Store Normalized Data

Create SQLite databases with metadata

Execute Wing

Pass wing config and feather paths to engine

Load Feather Data

Query records from each feather database

Correlate by Time or Identity

Based on engine used: Time-Based engine collects anchors and finds temporal matches, Identity-Based engine groups by identity and creates temporal clusters

Return Results

Return matches with confidence scores

Display Results

Format and present correlation matches in GUI

Dependency Graph

This diagram shows how the major directories depend on each other:

config/
wings/
feather/
engine/
pipeline/
gui/
integration/

Dependency Rules

  • config/: No dependencies on other correlation_engine modules (base layer)
  • wings/: Depends only on config/
  • feather/: Depends only on config/
  • engine/: Depends on feather/, wings/, config/
  • pipeline/: Depends on engine/, config/, wings/
  • gui/: Depends on engine/, pipeline/, config/, wings/
  • integration/: Depends on all other modules (top layer)

Dual-Engine Architecture

Why Multiple Engines?

Forensic investigations vary dramatically in scope, data volume, and analytical priorities. A single correlation approach cannot efficiently handle all scenarios. The Crow Eye Correlation Engine implements a modular, multi-engine architecture where each engine is optimized for specific investigation priorities and dataset characteristics.

Design Philosophy

  • Case-Driven Selection: Different investigations require different correlation strategies. A targeted malware analysis needs identity tracking, while a timeline reconstruction needs comprehensive temporal analysis.
  • Performance Optimization: Each engine uses algorithms optimized for its use case - Identity-Based uses O(N log N) complexity for large datasets, while Time-Based provides detailed O(N²) analysis for smaller scopes.
  • Scalability: The architecture allows adding new engines without modifying existing code, following the Open/Closed Principle.
  • Future Extensibility: Additional engines can focus on specific forensic aspects like network correlation, user behavior analysis, or file system relationships.

Current Engines

The system currently implements two engines, with more planned for future releases:

āœ“ Identity-Based Engine

Status: Production Ready

Priority: Track specific applications, files, or entities across the entire forensic timeline

Best For: Malware analysis, application tracking, large datasets (>1,000 records)

āœ“ Time-Based Engine

Status: Production Ready

Priority: Comprehensive temporal analysis with detailed field-level matching

Best For: Timeline reconstruction, detailed investigations, smaller datasets (<1,000 records)

The Theory: Why Two Separate Engines?

The decision to implement two distinct engines rather than a single unified approach stems from fundamental differences in correlation strategies and their computational trade-offs. Each engine represents a different theoretical approach to the correlation problem:

Identity-Based Engine Theory

Core Principle: "Group first, then correlate within groups"

This engine operates on the principle that forensic artifacts related to the same entity (application, file, user) should be grouped together before temporal analysis. By normalizing identities across different artifact types and creating identity-based clusters, the engine reduces the correlation search space dramatically.

Algorithmic Approach:

  1. Identity Extraction: Normalize all application names, file paths, and hashes across artifacts using semantic mappings
  2. Identity Grouping: Create buckets where each bucket contains all evidence for a single identity (e.g., all "chrome.exe" evidence)
  3. Temporal Clustering: Within each identity group, create temporal anchors representing time-based clusters of activity
  4. Cross-Artifact Linking: Link evidence from different sources (Prefetch, Registry, Event Logs) within the same identity group

Why O(N log N)? By grouping first, the engine only compares records within the same identity group. If you have N records distributed across M identities, you're performing M smaller sorts (O(N/M log N/M) each) instead of one large O(N²) comparison. The sorting and grouping operations dominate, giving O(N log N) complexity.

Memory Efficiency: The streaming mode processes one identity group at a time, writing results directly to the database. This maintains constant memory usage regardless of dataset size.

Trade-off: While highly efficient for identity tracking, this approach may naturally deemphasize or entirely miss direct temporal correlations between disparate artifacts that do not share a common identity.

Time-Based Engine Theory

Core Principle: "Systematic scanning of temporal windows for efficient correlation"

This engine operates on the principle of systematically scanning through predefined time windows across the entire dataset. It leverages optimized database indexing and query techniques to achieve linear (O(N)) scalability, focusing on all records within a given temporal segment.

Algorithmic Approach:

  1. Automatic Time Range Detection: Dynamically determines the earliest and latest timestamps across all relevant feathers.
  2. Window Generation: Divides the overall time span into fixed-size windows (e.g., 5-minute intervals), typically starting from a predefined epoch (e.g., year 2000).
  3. Indexed Window Querying: For each window, queries all feathers for records falling within that specific time range. This step is highly optimized by pre-built timestamp indexes.
  4. In-Window Correlation: Performs comprehensive field-level matching and semantic rule evaluation on the records retrieved within the current window.
  5. Streaming Results: Correlated matches are processed and written directly to a database, ensuring memory efficiency and scalability.

Why O(N log N)? The engine systematically scans through the entire dataset in discrete time windows. Within each window, it efficiently queries and processes records. The "log N" factor arises primarily from optimized database indexing and sorting operations performed to gather and prepare data within these windows. This approach avoids the exhaustive pairwise comparison of O(N²) by focusing on localized, indexed operations, making it highly efficient for large datasets and achieving near-linear scalability.

Memory Usage: Leverages streaming mode where only records within the current time window are loaded into memory, resulting in constant memory usage regardless of the total dataset size.

Trade-off: This approach is optimized for comprehensive temporal coverage and is highly efficient for large datasets. However, it prioritizes systematic time-window analysis over explicit identity grouping, which means identity relationships are established within time windows rather than being the primary grouping mechanism.

Why Not Combine Them?

Fundamental Incompatibility: The two engines represent fundamentally different correlation philosophies that cannot be merged without sacrificing the strengths of each:

  • Primary Approach: Identity-Based first groups data by unique entities, then correlates events within those groups. Time-Based systematically scans the entire timeline in fixed windows, correlating all events within each window.
  • Search Space Management: Identity-Based significantly reduces the search space by pre-filtering based on identity, making it efficient for tracking specific entities. Time-Based explores a localized temporal search space within each window for comprehensive coverage across all artifacts.
  • Memory Model: Both engines leverage streaming for constant memory usage, but their internal data handling strategies differ based on their primary approach.
  • Optimization Focus: Identity-Based optimizes for entity-centric tracking and scalability in large datasets. Time-Based optimizes for a comprehensive temporal overview and efficient processing of all events within a given time frame.
  • Investigative Question: Identity-Based excels at answering "What did this specific application/file do over time?". Time-Based excels at answering "What activity occurred within this time period?".

Design Decision: Rather than creating a compromised hybrid that performs poorly in both scenarios, the system provides specialized engines that excel in their respective domains. The EngineSelector factory automatically chooses the appropriate engine based on dataset size and analysis goals, or allows manual selection for specific investigation needs.

Future Enhancements

The correlation system is continuously evolving to become more intelligent and comprehensive. Future development will focus on:

🧠 Enhanced Intelligence

New correlation engines will be added to link more variant data types together, creating a more comprehensive understanding of system activity and relationships between artifacts.

šŸŽÆ Behavioral Detection

Advanced engines will analyze patterns to automatically detect system behavior and user activities, identifying anomalies and suspicious patterns without manual configuration.

šŸ”— Cross-Domain Correlation

Future engines will correlate data across multiple forensic domains (network, file system, registry, memory) to provide holistic insights into complex attack chains and system interactions.

šŸš€ Modular Architecture: The engine system is designed for extensibility. Each new engine enhances the correlation capabilities without disrupting existing functionality, allowing the system to grow smarter over time while maintaining stability and performance.

Visual Engine Comparison

Understanding the differences between engines helps select the right tool for your investigation:

Identity-Based Engine

O(N log N)
šŸ”
Extract Identities
Normalize app names, paths, hashes
↓
Group by Identity
Cluster all evidence per entity
↓
Create Temporal Anchors
Build time-based clusters
↓
Stream Results
Write directly to database

Optimized For:

  • Large Datasets: > 1,000 records
  • Identity Tracking: Follow specific apps/files
  • Memory Efficiency: Constant memory usage
  • Production Use: Stable & tested

Key Metrics:

  • Complexity: O(N log N)
  • Memory: O(1) streaming
  • Speed: Fast for large data
āœ“ Production Ready
VS

Time-Based Engine

O(N log N)
ā±ļø
Systematic Time Scan
Scan fixed time windows
↓
Indexed Window Query
Retrieve records via indexes
↓
In-Window Correlation
Field matching & rules
↓
Stream Results
Write directly to database

Optimized For:

  • Large Datasets: > 1,000 records
  • Systematic Analysis: Comprehensive temporal view
  • Memory Efficiency: Constant memory usage (streaming)
  • Production Use: Stable & tested

Key Metrics:

  • Complexity: O(N log N)
  • Memory: O(1) streaming
  • Speed: Fast for large data
āœ“ Production Ready

Correlation Methodology

Both engines follow a structured methodology but differ in their approach to organizing and correlating data:

Identity-Based Methodology

  1. Identity Extraction: Normalizes application names, file paths, and hashes across all artifacts using semantic mappings
  2. Identity Grouping: Groups all records by their normalized identity (e.g., all evidence related to "chrome.exe")
  3. Temporal Clustering: Within each identity group, creates temporal anchors representing clusters of related events
  4. Cross-Artifact Correlation: Links evidence from different artifacts (Prefetch, Registry, Event Logs) for the same identity
  5. Streaming Output: Writes results directly to database to maintain constant memory usage
  6. Confidence Scoring: Calculates scores based on identity strength, temporal proximity, and artifact diversity

Key Advantage: Its identity-centric approach ensures highly efficient tracking of specific applications or files across the entire forensic timeline, regardless of their timestamps, without loading all data into memory.

Time-Based Methodology

  1. Time Range Detection: Automatically determines the overall time span of all artifacts.
  2. Window Generation: Divides the entire time range into fixed, overlapping time windows.
  3. Indexed Window Querying: For each window, efficiently queries all relevant records from all feathers using optimized database indexes.
  4. In-Window Correlation: Correlates records within the current time window using comprehensive field-level matching and semantic rules.
  5. Streaming Output: Writes correlated results directly to the database for constant memory usage.
  6. Confidence Scoring: Calculates scores based on temporal proximity, field matches, and semantic similarity within each window.

Key Advantage: Offers a systematic and comprehensive temporal view of all activities within defined time windows, ideal for reconstructing events and analyzing activity patterns across the entire dataset.

Core Components Architecture

The correlation system is built from five core components that work together to enable forensic analysis:

🪶

Feather

Normalized Data Container

Purpose: Stores forensic artifact data in a standardized SQLite format

Contains:

  • Normalized timestamps
  • Semantic field mappings
  • Artifact metadata
  • Source information

Key Files:

  • __init__.py
  • feather_builder.py
  • database.py
  • transformer.py
🪽

Wings

Correlation Rule Definitions

Purpose: Defines which feathers to correlate and how

Contains:

  • Feather specifications
  • Time window settings
  • Filter conditions
  • Match requirements

Key Files:

  • __init__.py
  • wing_model.py
  • artifact_detector.py
  • wing_validator.py
āš™ļø

Engine

Correlation Processing Logic

Purpose: Executes correlation algorithms to find relationships

Contains:

  • Identity-Based engine (O(N log N))
  • Time-Based engine (O(N²))
  • Feather loader
  • Scoring algorithms

Key Files:

  • __init__.py
  • correlation_engine.py
  • engine_selector.py
  • identity_correlation_engine.py
  • time_based_engine.py
  • weighted_scoring.py
šŸ“‹

Config

Configuration Management

Purpose: Manages all system configurations and mappings

Contains:

  • Semantic field mappings
  • Feather configurations
  • Wing configurations
  • Pipeline definitions

Key Files:

  • __init__.py
  • config_manager.py
  • semantic_mapping.py
  • pipeline_config.py
  • integrated_configuration_manager.py
  • case_specific_configuration_manager.py
šŸ”„

Pipeline

Workflow Orchestration

Purpose: Automates complete analysis workflows

Contains:

  • Feather creation steps
  • Wing execution sequence
  • Result aggregation
  • Report generation

Key Files:

  • __init__.py
  • pipeline_executor.py
  • pipeline_loader.py
  • feather_auto_registration.py
  • discovery_service.py
  • database_connection_manager.py

Component Interaction Flow

Here's how the five core components work together in a typical correlation workflow:

Complete Correlation Workflow

āš™ļø

Config Loads Settings

Configuration manager loads semantic mappings, feather configs, and wing definitions from JSON files

🪶

Feather Normalizes Data

Feather Builder transforms raw forensic artifacts into normalized SQLite databases with standardized schemas

🪽

Wings Define Rules

Wing configuration specifies which feathers to correlate, time windows, filters, and match requirements

šŸ”„

Pipeline Orchestrates

Pipeline executor coordinates feather creation and wing execution in the correct sequence

āš™ļø

Engine Correlates

Correlation engine (Identity-Based or Time-Based) processes feathers according to wing rules and finds relationships

šŸ“Š

Results Generated

Correlation matches with confidence scores are stored in database and displayed in GUI timeline

Component Interaction Summary

  • Config → Feather: Provides semantic mappings and transformation rules
  • Config → Wings: Supplies validation rules and artifact type definitions
  • Feather → Engine: Supplies normalized data for correlation processing
  • Wings → Engine: Defines correlation parameters and match requirements
  • Pipeline → All: Orchestrates the entire workflow from data ingestion to results
  • Engine → Results: Produces correlation matches with confidence scores

Directory Structure

The correlation_engine is organized into 7 main directories, each with a specific responsibility:

engine/ - Core Correlation Engine

Purpose: Contains the core correlation logic, feather loading, scoring, and result management.

Key Files: 35 Python files

  • correlation_engine.py - Main correlation engine
  • identity_correlation_engine.py - Identity-based correlation
  • time_based_engine.py - Time-based correlation
  • engine_selector.py - Engine factory and selection
  • weighted_scoring.py - Confidence score calculation

feather/ - Data Normalization

Purpose: Handles importing forensic artifact data from various sources and normalizing it into the feather format.

Key Files: 4 Python files + UI subdirectory

  • __init__.py
  • feather_builder.py - Main application entry point
  • database.py - Database operations
  • transformer.py - Data transformation pipeline
  • ui/ - GUI components for Feather Builder

wings/ - Correlation Rules

Purpose: Defines the data models and validation logic for Wing configurations (correlation rules).

Key Files: 4 Python files in core/ + UI subdirectory

  • core/__init__.py
  • core/wing_model.py - Wing, FeatherSpec, CorrelationRules data models
  • core/artifact_detector.py - Detect artifact types
  • core/wing_validator.py - Validate wing configurations
  • ui/ - GUI components for Wings Creator

config/ - Configuration Management

Purpose: Manages all configuration files (feathers, wings, pipelines) and semantic mappings.

Key Files: 20 Python files

  • config_manager.py - Central configuration management
  • semantic_mapping.py - Semantic field mappings
  • pipeline_config.py - Pipeline configuration model
  • integrated_configuration_manager.py - Integrated configuration
  • case_specific_configuration_manager.py - Case-specific configuration

pipeline/ - Workflow Orchestration

Purpose: Orchestrates complete analysis workflows, including feather creation, wing execution, and report generation.

Key Files: 7 Python files

  • pipeline_executor.py - Main pipeline execution
  • pipeline_loader.py - Load pipeline configurations
  • feather_auto_registration.py - Auto-register feathers
  • discovery_service.py - Discover available configs
  • database_connection_manager.py - Manage pipeline database connections

gui/ - User Interface

Purpose: Provides all GUI components for the correlation engine, including pipeline management, results visualization, and configuration editing.

Key Files: 35 Python files

  • main_window.py - Main application window
  • pipeline_management_tab.py - Pipeline creation/management
  • correlation_results_view.py - Correlated results display
  • timeline_widget.py - Timeline visualization
  • settings_dialog.py - User settings and configurations

integration/ - Crow-Eye Bridge

Purpose: Integrates the correlation engine with the main Crow-Eye application, providing auto-generation features and default configurations.

Key Files: 21 Python files + default_wings/ subdirectory

  • crow_eye_integration.py - Main integration bridge
  • case_initializer.py - Initialize correlation engine for a case
  • auto_feather_generator.py - Auto-generate feathers from Crow-Eye data
  • feather_mappings.py - Define mappings for feather integration
  • correlation_integration.py - Core correlation integration logic

Identity-Based Correlation Engine

The Identity-Based Engine groups records by identity first, then creates temporal anchors for efficient correlation. āœ“ Production Ready

How It Works

  1. Extract Identities: Normalizes application names, file paths, and other identifiers
  2. Group by Identity: Organizes records by their normalized identity
  3. Create Temporal Anchors: Builds temporal clusters of evidence for each identity
  4. Stream Results: Writes correlation results directly to database for memory efficiency

Performance Characteristics

  • Complexity: O(N log N) - Efficient grouping and sorting
  • Memory Usage: Constant with streaming mode enabled
  • Best For: Large datasets (> 1,000 records)
  • Strengths: Identity tracking across artifacts, scalable performance, streaming support

Use Case Scenarios

  • Production forensic investigations with large datasets
  • Tracking specific applications or files across multiple artifacts
  • Enterprise-scale analysis requiring memory efficiency
  • Automated correlation pipelines processing millions of records

Time-Based Correlation Engine

The Time-Based Engine implements an efficient O(N log N) time-window scanning approach for systematic temporal analysis. It is designed for large datasets and performance-critical environments. āœ“ Production Ready

How It Works

  1. Systematic Time Scan: Scans through forensic data systematically in fixed time intervals (windows).
  2. Indexed Queries: Utilizes optimized database indexes for fast retrieval of records within each time window.
  3. Field-Level Matching: Performs comprehensive field-level comparisons for records found in the current window.
  4. Streaming Results: Processes and writes correlated matches directly to the database for memory efficiency.

Performance Characteristics

  • Complexity: O(N log N) - Achieves efficient scalability through time-window scanning and optimized indexing.
  • Memory Usage: Low (streaming mode) - Processes data in chunks, maintaining constant memory usage.
  • Best For: Large datasets (> 1,000 records) - Efficiently handles millions of records.
  • Strengths: Systematic temporal analysis, high performance, memory-efficient.

Use Case Scenarios

  • Large-scale forensic investigations with extensive datasets.
  • Systematic temporal analysis across an entire timeline.
  • Performance-critical environments where speed and memory efficiency are paramount.
  • Automated correlation pipelines.

Engine Selection Guide

Choose the appropriate correlation engine based on your specific needs:

Choose Identity-Based Engine When:

  • āœ… Dataset has > 1,000 records
  • āœ… Performance is critical
  • āœ… You need identity tracking across artifacts
  • āœ… You want to filter by specific applications
  • āœ… Memory constraints require streaming mode
  • āœ… Production-ready solution needed

Choose Time-Based Engine When: āœ“ Production Ready

  • āœ… Dataset has > 1,000 records
  • āœ… Systematic temporal analysis across large datasets is needed
  • āœ… Performance and memory efficiency are paramount
  • āœ… Automated correlation pipelines are in use
  • āœ… You require an O(N log N) scalable solution

Decision Criteria Table

Factor Time-Based Identity-Based
Dataset Size Large datasets (>1,000 records) Very large datasets (10,000+ records)
Complexity O(N log N) O(N log N)
Memory Usage Low (streaming mode) Low (streaming mode)
Analysis Type Systematic temporal analysis Identity tracking
Use Case Production, automation Production, automation