Introduction

Built by Investigators, for Investigators

The Crow Eye Correlation Engine is designed to revolutionize the forensic investigation process by making it faster, more accurate, and more focused. Born from the forensic community's need for efficient artifact correlation, this open-source solution enables investigators to quickly identify the most critical information and answer the most pressing questions in their cases.

⚔ Speed & Efficiency

The dual-engine architecture is optimized for performance. The Identity-Based engine processes large datasets with O(N log N) complexity and streaming mode, enabling analysis of millions of records without memory constraints. Investigators can correlate data as fast as possible to get answers when time is critical.

šŸŽÆ Accuracy & Focus

By automatically correlating artifacts across multiple sources (Prefetch, Registry, Event Logs, MFT, SRUM, and more), the engine helps investigators focus on the most needed details. Instead of manually searching through thousands of records, investigators can immediately see relationships and patterns that matter to their case.

šŸ¤ Community-Driven

From the community, to the community. Crow Eye is built as an open-source solution by forensic investigators who understand the challenges of digital forensics. The correlation engine is actively developed with contributions from the forensic community, ensuring it evolves to meet real-world investigation needs.

šŸš€ Mission: Open-Source Forensic Excellence

Crow Eye targets to be the open-source solution for analyzing and correlating forensic data. Our goal is to provide investigators with tools that deliver the most critical information as quickly as possible, enabling them to:

  • Answer key questions faster: What happened? When? Who was involved?
  • Identify critical evidence: Focus on the artifacts that matter most to your case
  • Correlate across artifacts: See relationships between Prefetch, Registry, Event Logs, and more
  • Scale to any dataset: From small targeted investigations to enterprise-wide incidents
  • Contribute and improve: Join the community in building better forensic tools

Purpose of This Document

This comprehensive documentation serves as the main entry point for developers and contributors who want to understand, modify, or extend the correlation engine system. Whether you're investigating a case or contributing code, this guide will help you leverage the full power of the Correlation Engine.

  • Understand the System: Get a high-level view of how all components work together
  • Navigate the Codebase: Find the right files to modify for specific tasks
  • Visualize Architecture: See diagrams that illustrate system structure and data flow
  • Learn Core Concepts: Master anchors, time windows, feathers, wings, and engines
  • Contribute: Join the community in making forensics faster and more accessible

Who Should Read This

  • Forensic Investigators: Using Crow Eye for case analysis and wanting to understand correlation capabilities
  • Developers: New to the Crow Eye project and wanting to contribute
  • Contributors: Adding new features, engines, or artifact support
  • Maintainers: Debugging issues and optimizing performance
  • Community Members: Anyone passionate about open-source forensic tools

šŸ”„ Active Development: The Correlation Engine is continuously evolving with new features, optimizations, and community contributions. Check the GitHub repository for the latest updates and join us in building the future of open-source forensics.

What is the Correlation Engine?

The Correlation Engine is a forensic analysis system that finds temporal and semantic relationships between different types of forensic artifacts. It helps investigators discover connections between events that occurred on a system by correlating data from multiple sources.

The system implements a dual-engine architecture with two distinct correlation strategies:

1. Identity-Based Correlation Engine

Groups records by identity first, then creates temporal anchors. Optimized for large datasets (> 1,000 records) with O(N log N) performance and streaming support. Production Ready

2. Time-Based Correlation Engine

Uses temporal proximity as the primary factor with comprehensive field matching. Ideal for small datasets (< 1,000 records) requiring detailed analysis. āš ļø Under Development

Key Capabilities

  • Dual-Engine Architecture: Choose between Time-Based (O(N²)) and Identity-Based (O(N log N)) engines based on dataset size and analysis goals
  • Engine Selection: Automatic or manual engine selection via EngineSelector factory
  • Temporal Correlation: Find events that occurred within a specified time window
  • Identity Tracking: Track applications and files across multiple artifacts (Identity-Based engine)
  • Multi-Artifact Support: Correlate data from Prefetch, ShimCache, AmCache, Event Logs, LNK files, Jumplists, MFT, SRUM, Registry, and more
  • Flexible Rules: Define custom correlation rules (Wings) with configurable parameters
  • Semantic Mapping: Map different column names to common semantic meanings
  • Duplicate Prevention: Automatically detect and prevent duplicate matches
  • Weighted Scoring: Calculate confidence scores based on multiple factors
  • Streaming Mode: Process millions of records with constant memory usage (Identity-Based engine)
  • Pipeline Automation: Execute complete analysis workflows automatically
  • Visual Interface: GUI for building pipelines, viewing results, and exploring timelines

Core Concepts

Understanding these fundamental concepts is essential for working with the Correlation Engine:

🪶 Feather

A data normalization system that accepts various input formats (CSV, JSON, or any forensic tool output) and converts them into a standardized SQLite database. Feathers normalize diverse data sources into a consistent schema, making them ready to serve as input for the correlation engine. Each feather represents a single artifact type (e.g., Prefetch, Registry, Event Logs) with standardized column names and timestamp formats.

🪽 Wing

A comprehensive configuration that defines the complete correlation workflow. Wings specify: (1) which feathers to use as input for correlation, (2) the correlation rules and matching criteria, (3) the time window (anchor range) for temporal correlation, (4) semantic mappings to translate different field names to common meanings (e.g., mapping "ExecutableName", "ProcessName", "AppName" to a unified "application" concept), and (5) filters to narrow the dataset. Wings are the control center that orchestrates how the correlation engine processes data.

āš™ļø Engine

The correlation strategy (Time-Based or Identity-Based) used to find relationships between forensic artifacts. Each engine implements a different algorithmic approach optimized for specific use cases and dataset sizes.

āš“ Anchor

A record from one feather that serves as the starting point for finding correlations. In Time-Based engine, each record with a valid timestamp becomes an anchor. In Identity-Based engine, anchors are temporal clusters of evidence grouped by identity. Both engines use the same time window concept to find related records.

ā±ļø Time Window

The temporal range used to find related events around an anchor. Default is 5 minutes (±2.5 minutes from the anchor timestamp), meaning events within 5 minutes before or after are considered potentially related. This value can be customized in the Wing configuration to match investigation needs (e.g., 1 minute for precise correlations, 30 minutes for broader analysis).

šŸ”— Semantic Mapping

A translation layer defined in Wings that maps different column names from various artifacts to common semantic meanings. For example, "ExecutableName" (Prefetch), "ProcessName" (Event Logs), and "AppName" (Registry) can all be mapped to "application", enabling the engine to correlate records across different artifact types even when they use different field names.

šŸ†” Identity

A normalized representation of an application, file, or entity across artifacts (Identity-Based engine). Identities enable tracking the same entity across different forensic sources, even when represented with different names or formats.

šŸŽÆ Match

A set of temporally-related records from different feathers that have been correlated by the engine. Each match includes a confidence score indicating the strength of the correlation based on temporal proximity, field matches, and semantic similarity.

šŸ”„ Pipeline

An automated workflow that creates feathers and executes wings in a coordinated sequence. Pipelines orchestrate the entire correlation process from data ingestion to result generation, enabling repeatable and automated forensic analysis.

šŸ’¾ Streaming Mode

Memory-efficient processing that writes results directly to database (Identity-Based engine). Streaming mode enables processing of millions of records with constant memory usage, making it suitable for large-scale forensic investigations.

Architecture Diagrams

System Architecture

The Correlation Engine is organized into 7 major subsystems:

āš™ļø
Correlation Engine System
7 Major Subsystems

šŸ”“ Engine Core

correlation_engine.py
feather_loader.py
weighted_scoring.py
engine_selector.py

šŸ”µ Feather System

feather_builder.py
database.py
transformer.py

🟢 Wings System

wing_model.py
artifact_detector.py
wing_validator.py

🟠 Configuration

config_manager.py
semantic_mapping.py
session_state.py

🟣 Pipeline

pipeline_executor.py
pipeline_loader.py
discovery_service.py

🟔 GUI Components

main_window.py
timeline_widget.py
results_view.py

🟣 Integration

crow_eye_integration.py
case_initializer.py
auto_feather_generator.py

Color Legend

  • šŸ”“ Red: Engine Core (correlation logic)
  • šŸ”µ Blue: Feather System (data normalization)
  • 🟢 Green: Wings System (rule definitions)
  • 🟠 Orange: Configuration (settings management)
  • 🟣 Purple: Pipeline (workflow orchestration)
  • 🟔 Yellow: GUI (user interface)
  • 🟣 Magenta: Integration (Crow-Eye bridge)

Data Flow

This diagram shows how forensic data flows through the system from source to results:

Data Flow Through Correlation Engine

User Configuration

Define correlation rules, select feathers, set time window

Load Pipeline Config

Load feather configs and wing configs

Create Feathers

Transform source data to normalized format

Store Normalized Data

Create SQLite databases with metadata

Execute Wing

Pass wing config and feather paths to engine

Load Feather Data

Query records from each feather database

Correlate by Time or Identity

Based on engine used: Time-Based engine collects anchors and finds temporal matches, Identity-Based engine groups by identity and creates temporal clusters

Return Results

Return matches with confidence scores

Display Results

Format and present correlation matches in GUI

Dependency Graph

This diagram shows how the major directories depend on each other:

config/
wings/
feather/
engine/
pipeline/
gui/
integration/

Dependency Rules

  • config/: No dependencies on other correlation_engine modules (base layer)
  • wings/: Depends only on config/
  • feather/: Depends only on config/
  • engine/: Depends on feather/, wings/, config/
  • pipeline/: Depends on engine/, config/, wings/
  • gui/: Depends on engine/, pipeline/, config/, wings/
  • integration/: Depends on all other modules (top layer)

Dual-Engine Architecture

Why Multiple Engines?

Forensic investigations vary dramatically in scope, data volume, and analytical priorities. A single correlation approach cannot efficiently handle all scenarios. The Crow Eye Correlation Engine implements a modular, multi-engine architecture where each engine is optimized for specific investigation priorities and dataset characteristics.

Design Philosophy

  • Case-Driven Selection: Different investigations require different correlation strategies. A targeted malware analysis needs identity tracking, while a timeline reconstruction needs comprehensive temporal analysis.
  • Performance Optimization: Each engine uses algorithms optimized for its use case - Identity-Based uses O(N log N) complexity for large datasets, while Time-Based provides detailed O(N²) analysis for smaller scopes.
  • Scalability: The architecture allows adding new engines without modifying existing code, following the Open/Closed Principle.
  • Future Extensibility: Additional engines can focus on specific forensic aspects like network correlation, user behavior analysis, or file system relationships.

Current Engines

The system currently implements two engines, with more planned for future releases:

āœ“ Identity-Based Engine

Status: Production Ready

Priority: Track specific applications, files, or entities across the entire forensic timeline

Best For: Malware analysis, application tracking, large datasets (>1,000 records)

āš ļø Time-Based Engine

Status: Under Development

Priority: Comprehensive temporal analysis with detailed field-level matching

Best For: Timeline reconstruction, detailed investigations, smaller datasets (<1,000 records)

The Theory: Why Two Separate Engines?

The decision to implement two distinct engines rather than a single unified approach stems from fundamental differences in correlation strategies and their computational trade-offs. Each engine represents a different theoretical approach to the correlation problem:

Identity-Based Engine Theory

Core Principle: "Group first, then correlate within groups"

This engine operates on the principle that forensic artifacts related to the same entity (application, file, user) should be grouped together before temporal analysis. By normalizing identities across different artifact types and creating identity-based clusters, the engine reduces the correlation search space dramatically.

Algorithmic Approach:

  1. Identity Extraction: Normalize all application names, file paths, and hashes across artifacts using semantic mappings
  2. Identity Grouping: Create buckets where each bucket contains all evidence for a single identity (e.g., all "chrome.exe" evidence)
  3. Temporal Clustering: Within each identity group, create temporal anchors representing time-based clusters of activity
  4. Cross-Artifact Linking: Link evidence from different sources (Prefetch, Registry, Event Logs) within the same identity group

Why O(N log N)? By grouping first, the engine only compares records within the same identity group. If you have N records distributed across M identities, you're performing M smaller sorts (O(N/M log N/M) each) instead of one large O(N²) comparison. The sorting and grouping operations dominate, giving O(N log N) complexity.

Memory Efficiency: The streaming mode processes one identity group at a time, writing results directly to the database. This maintains constant memory usage regardless of dataset size.

Trade-off: This approach is optimized for tracking specific entities but may miss correlations between unrelated artifacts that happen to occur at the same time.

Time-Based Engine Theory

Core Principle: "Correlate everything within temporal windows"

This engine operates on the principle that temporal proximity is the primary indicator of correlation. It performs comprehensive field-level matching for all records that fall within a specified time window, regardless of identity.

Algorithmic Approach:

  1. Anchor Collection: Gather all records with valid timestamps from all feathers (data sources)
  2. Chronological Sorting: Sort all anchors by timestamp to enable efficient window-based searching
  3. Window-Based Matching: For each anchor, search all other feathers for records within the time window
  4. Comprehensive Field Matching: Perform detailed semantic field comparisons (paths, hashes, names, users)
  5. Weighted Scoring: Calculate confidence based on temporal proximity, field matches, and semantic similarity

Why O(N²)? For each of N anchors, the engine must search through potentially all other records to find temporal matches. Even with optimizations like sorted timestamps and time window limits, the worst-case complexity remains O(N²) because every anchor could potentially match with every other record.

Memory Usage: All anchors must be loaded into memory for sorting and searching, resulting in O(N) memory usage.

Trade-off: This approach finds all temporal relationships and provides detailed field-level analysis, but becomes computationally expensive with large datasets.

Why Not Combine Them?

Fundamental Incompatibility: The two engines represent fundamentally different correlation philosophies that cannot be merged without sacrificing the strengths of each:

  • Search Space: Identity-Based reduces search space through grouping (efficient for large data), while Time-Based explores the full search space (comprehensive for small data)
  • Memory Model: Identity-Based uses streaming (constant memory), while Time-Based requires in-memory sorting (O(N) memory)
  • Optimization Target: Identity-Based optimizes for identity tracking and scalability, while Time-Based optimizes for completeness and field-level detail
  • Use Case Focus: Identity-Based excels at "follow this application across time," while Time-Based excels at "what happened during this time period"

Design Decision: Rather than creating a compromised hybrid that performs poorly in both scenarios, the system provides specialized engines that excel in their respective domains. The EngineSelector factory automatically chooses the appropriate engine based on dataset size and analysis goals, or allows manual selection for specific investigation needs.

Future Enhancements

The correlation system is continuously evolving to become more intelligent and comprehensive. Future development will focus on:

🧠 Enhanced Intelligence

New correlation engines will be added to link more variant data types together, creating a more comprehensive understanding of system activity and relationships between artifacts.

šŸŽÆ Behavioral Detection

Advanced engines will analyze patterns to automatically detect system behavior and user activities, identifying anomalies and suspicious patterns without manual configuration.

šŸ”— Cross-Domain Correlation

Future engines will correlate data across multiple forensic domains (network, file system, registry, memory) to provide holistic insights into complex attack chains and system interactions.

šŸš€ Modular Architecture: The engine system is designed for extensibility. Each new engine enhances the correlation capabilities without disrupting existing functionality, allowing the system to grow smarter over time while maintaining stability and performance.

Visual Engine Comparison

Understanding the differences between engines helps select the right tool for your investigation:

Identity-Based Engine

O(N log N)
šŸ”
Extract Identities
Normalize app names, paths, hashes
↓
Group by Identity
Cluster all evidence per entity
↓
Create Temporal Anchors
Build time-based clusters
↓
Stream Results
Write directly to database

Optimized For:

  • Large Datasets: > 1,000 records
  • Identity Tracking: Follow specific apps/files
  • Memory Efficiency: Constant memory usage
  • Production Use: Stable & tested

Key Metrics:

  • Complexity: O(N log N)
  • Memory: O(1) streaming
  • Speed: Fast for large data
āœ“ Production Ready
VS

Time-Based Engine

O(N²)
ā±ļø
Collect All Anchors
Gather all timestamped records
↓
Sort by Timestamp
Chronological ordering
↓
Find Temporal Matches
Window-based searching
↓
Calculate Scores
Weighted field matching

Optimized For:

  • Small Datasets: < 1,000 records
  • Detailed Analysis: Field-level matching
  • Comprehensive: All temporal relationships
  • Research: Debugging & validation

Key Metrics:

  • Complexity: O(N²)
  • Memory: O(N) in-memory
  • Speed: Slower for large data
āš ļø Under Development

Correlation Methodology

Both engines follow a structured methodology but differ in their approach to organizing and correlating data:

Identity-Based Methodology

  1. Identity Extraction: Normalizes application names, file paths, and hashes across all artifacts using semantic mappings
  2. Identity Grouping: Groups all records by their normalized identity (e.g., all evidence related to "chrome.exe")
  3. Temporal Clustering: Within each identity group, creates temporal anchors representing clusters of related events
  4. Cross-Artifact Correlation: Links evidence from different artifacts (Prefetch, Registry, Event Logs) for the same identity
  5. Streaming Output: Writes results directly to database to maintain constant memory usage
  6. Confidence Scoring: Calculates scores based on identity strength, temporal proximity, and artifact diversity

Key Advantage: Efficiently tracks specific applications or files across the entire forensic timeline without loading all data into memory.

Time-Based Methodology

  1. Anchor Collection: Gathers all records with valid timestamps from all feathers
  2. Chronological Sorting: Orders all anchors by timestamp to enable temporal window searches
  3. Window-Based Matching: For each anchor, searches all other feathers for records within the time window
  4. Field-Level Comparison: Performs comprehensive semantic field matching (paths, hashes, names)
  5. Match Combination: Generates all valid combinations of correlated records
  6. Weighted Scoring: Calculates confidence based on temporal proximity, field matches, and semantic similarity

Key Advantage: Provides comprehensive field-level analysis and detailed semantic matching for in-depth investigations.

Core Components Architecture

The correlation system is built from five core components that work together to enable forensic analysis:

🪶

Feather

Normalized Data Container

Purpose: Stores forensic artifact data in a standardized SQLite format

Contains:

  • Normalized timestamps
  • Semantic field mappings
  • Artifact metadata
  • Source information

Key Files:

  • feather_builder.py
  • database.py
  • transformer.py
🪽

Wings

Correlation Rule Definitions

Purpose: Defines which feathers to correlate and how

Contains:

  • Feather specifications
  • Time window settings
  • Filter conditions
  • Match requirements

Key Files:

  • wing_model.py
  • artifact_detector.py
  • wing_validator.py
āš™ļø

Engine

Correlation Processing Logic

Purpose: Executes correlation algorithms to find relationships

Contains:

  • Identity-Based engine (O(N log N))
  • Time-Based engine (O(N²))
  • Feather loader
  • Scoring algorithms

Key Files:

  • engine_selector.py
  • identity_correlation_engine.py
  • time_based_engine.py
  • weighted_scoring.py
šŸ“‹

Config

Configuration Management

Purpose: Manages all system configurations and mappings

Contains:

  • Semantic field mappings
  • Feather configurations
  • Wing configurations
  • Pipeline definitions

Key Files:

  • config_manager.py
  • semantic_mapping.py
  • pipeline_config.py
šŸ”„

Pipeline

Workflow Orchestration

Purpose: Automates complete analysis workflows

Contains:

  • Feather creation steps
  • Wing execution sequence
  • Result aggregation
  • Report generation

Key Files:

  • pipeline_executor.py
  • pipeline_loader.py
  • discovery_service.py

Component Interaction Flow

Here's how the five core components work together in a typical correlation workflow:

Complete Correlation Workflow

āš™ļø

Config Loads Settings

Configuration manager loads semantic mappings, feather configs, and wing definitions from JSON files

🪶

Feather Normalizes Data

Feather Builder transforms raw forensic artifacts into normalized SQLite databases with standardized schemas

🪽

Wings Define Rules

Wing configuration specifies which feathers to correlate, time windows, filters, and match requirements

šŸ”„

Pipeline Orchestrates

Pipeline executor coordinates feather creation and wing execution in the correct sequence

āš™ļø

Engine Correlates

Correlation engine (Identity-Based or Time-Based) processes feathers according to wing rules and finds relationships

šŸ“Š

Results Generated

Correlation matches with confidence scores are stored in database and displayed in GUI timeline

Component Interaction Summary

  • Config → Feather: Provides semantic mappings and transformation rules
  • Config → Wings: Supplies validation rules and artifact type definitions
  • Feather → Engine: Supplies normalized data for correlation processing
  • Wings → Engine: Defines correlation parameters and match requirements
  • Pipeline → All: Orchestrates the entire workflow from data ingestion to results
  • Engine → Results: Produces correlation matches with confidence scores

Directory Structure

The correlation_engine is organized into 7 main directories, each with a specific responsibility:

engine/ - Core Correlation Engine

Purpose: Contains the core correlation logic, feather loading, scoring, and result management.

Key Files: 15 Python files

  • correlation_engine.py - Main correlation engine
  • feather_loader.py - Loads and queries feather databases
  • weighted_scoring.py - Confidence score calculation
  • engine_selector.py - Engine factory and selection
  • identity_correlation_engine.py - Identity-based correlation

feather/ - Data Normalization

Purpose: Handles importing forensic artifact data from various sources and normalizing it into the feather format.

Key Files: 4 Python files + UI subdirectory

  • feather_builder.py - Main application entry point
  • database.py - Database operations
  • transformer.py - Data transformation pipeline
  • ui/ - GUI components for Feather Builder

wings/ - Correlation Rules

Purpose: Defines the data models and validation logic for Wing configurations (correlation rules).

Key Files: 3 Python files in core/ + UI subdirectory

  • core/wing_model.py - Wing, FeatherSpec, CorrelationRules data models
  • core/artifact_detector.py - Detect artifact types
  • core/wing_validator.py - Validate wing configurations
  • ui/ - GUI components for Wings Creator

config/ - Configuration Management

Purpose: Manages all configuration files (feathers, wings, pipelines) and semantic mappings.

Key Files: 10 Python files

  • config_manager.py - Central configuration management
  • semantic_mapping.py - Semantic field mappings
  • pipeline_config.py - Pipeline configuration model

pipeline/ - Workflow Orchestration

Purpose: Orchestrates complete analysis workflows, including feather creation, wing execution, and report generation.

Key Files: 7 Python files

  • pipeline_executor.py - Main pipeline execution
  • pipeline_loader.py - Load pipeline configurations
  • discovery_service.py - Discover available configs

gui/ - User Interface

Purpose: Provides all GUI components for the correlation engine, including pipeline management, results visualization, and configuration editing.

Key Files: 26 Python files

  • main_window.py - Main application window
  • pipeline_management_tab.py - Pipeline creation/management
  • timeline_widget.py - Timeline visualization

integration/ - Crow-Eye Bridge

Purpose: Integrates the correlation engine with the main Crow-Eye application, providing auto-generation features and default configurations.

Key Files: 7 Python files + default_wings/ subdirectory

  • crow_eye_integration.py - Main integration bridge
  • case_initializer.py - Initialize correlation engine for a case
  • auto_feather_generator.py - Auto-generate feathers from Crow-Eye data

Identity-Based Correlation Engine

The Identity-Based Engine groups records by identity first, then creates temporal anchors for efficient correlation. āœ“ Production Ready

How It Works

  1. Extract Identities: Normalizes application names, file paths, and other identifiers
  2. Group by Identity: Organizes records by their normalized identity
  3. Create Temporal Anchors: Builds temporal clusters of evidence for each identity
  4. Stream Results: Writes correlation results directly to database for memory efficiency

Performance Characteristics

  • Complexity: O(N log N) - Efficient grouping and sorting
  • Memory Usage: Constant with streaming mode enabled
  • Best For: Large datasets (> 1,000 records)
  • Strengths: Identity tracking across artifacts, scalable performance, streaming support

Use Case Scenarios

  • Production forensic investigations with large datasets
  • Tracking specific applications or files across multiple artifacts
  • Enterprise-scale analysis requiring memory efficiency
  • Automated correlation pipelines processing millions of records

Time-Based Correlation Engine

The Time-Based Engine uses temporal proximity as the primary factor for correlation with comprehensive field matching. āš ļø Under Development

How It Works

  1. Collect All Anchors: Gathers records from all feathers that have valid timestamps
  2. Sort by Timestamp: Orders all anchors chronologically
  3. Find Temporal Matches: For each anchor, searches other feathers for records within the specified time window
  4. Calculate Scores: Computes confidence scores based on temporal proximity and field matching

Performance Characteristics

  • Complexity: O(N²) - Compares each anchor against all other records
  • Memory Usage: Loads all records into memory
  • Best For: Small datasets (< 1,000 records)
  • Strengths: Comprehensive field-level analysis, detailed semantic matching

Use Case Scenarios

  • Research and debugging correlation logic
  • Detailed forensic analysis requiring field-level inspection
  • Small-scale investigations with limited artifact data
  • Testing and validating correlation rules

Engine Selection Guide

Choose the appropriate correlation engine based on your specific needs:

Choose Identity-Based Engine When:

  • āœ… Dataset has > 1,000 records
  • āœ… Performance is critical
  • āœ… You need identity tracking across artifacts
  • āœ… You want to filter by specific applications
  • āœ… Memory constraints require streaming mode
  • āœ… Production-ready solution needed

Choose Time-Based Engine When: āš ļø Under Development

  • āœ… Dataset has < 1,000 records
  • āœ… You need comprehensive field-level analysis
  • āœ… You're debugging or doing research
  • āœ… Detailed semantic matching is critical
  • āœ… Memory is not a constraint

Decision Criteria Table

Factor Time-Based Identity-Based
Dataset Size < 1,000 records > 1,000 records
Complexity O(N²) O(N log N)
Memory Usage High (all in memory) Low (streaming mode)
Analysis Type Field-level matching Identity tracking
Use Case Research, debugging Production, automation