Data Terms A-Z
Understand the language of data: From Big Data to ETL to Predictive Analytics – all important terms for data-driven marketing and Business Intelligence.
A
Accuracy
A metric in machine learning that measures the proportion of correct predictions made by a model out of all predictions made.
AI-Powered CDP
Customer Data Platforms with integrated AI/ML capabilities for automated segmentation, predictions, and activation.
Analytics
The systematic analysis of data to gain insights and support decision-making.
Anomaly Detection
Identification of unusual patterns or outliers in data.
ARIMA (AutoRegressive Integrated Moving Average)
A classic statistical model for time series forecasting that combines autoregression, differencing, and moving averages.
AUC (Area Under the Curve)
The area under the ROC curve – a single number (0-1) summarizing the overall quality of a binary classifier.
B
Backtesting
Validation of a forecasting model on historical data to estimate out-of-sample performance.
Batch Processing
Processing large amounts of data in collected blocks rather than real-time.
Benchmark
A reference point or standard against which performance is measured and compared.
Brier Score
A metric measuring the quality of probabilistic predictions – MSE on probabilities (0=perfect).
Business Intelligence
Business Intelligence (BI) is the practice and tooling for transforming data into dashboards, reports, and analyses that support business decision-making.
C
Causal Inference
Causal inference is the discipline of estimating cause-and-effect relationships (what would happen if we changed X), not just correlations.
Chain of Custody
Chain of custody is the documented trail of how an artifact (data, evidence, content) was collected, handled, stored, and accessed—ensuring integrity and accountability.
Changepoint Detection
Detection of time points at which the statistical properties of a time series significantly change.
Clickstream Data
A time-ordered record of user interactions (clicks, page views, events) across digital properties such as websites and apps.
Cohen's Kappa
A statistic for measuring inter-rater reliability for categorical ratings, corrected for chance agreement.
Cohort Analysis
Cohort analysis groups users or entities by a shared starting event/time (e.g., signup week) and tracks behavior over time.
Confounding
A confounder is a variable that influences both the independent and dependent variable, creating a spurious association.
Confusion Matrix
A table that summarizes classification performance by counting true positives, false positives, true negatives, and false negatives.
Content Fingerprinting
Content fingerprinting creates a compact signature (fingerprint) of content to enable identification, deduplication, similarity detection, or provenance tracking.
Cosine Similarity
A measure of similarity between two vectors that calculates the cosine of the angle between them, independent of their magnitude.
Customer Data Platform (CDP)
Central system for unifying customer data from all sources.
D
Dashboard
A visual interface that presents key metrics, trends, and alerts to support decision-making.
Data Catalog
A searchable inventory of an organization's data assets including metadata, ownership, and documentation.
Data Clean Room
A secure environment where multiple parties can combine their data for joint analyses without sharing raw data.
Data Dictionary
Documentation that defines the meaning, format, allowed values, and usage of data fields.
Data Drift
The change in statistical properties of input data over time, which can degrade model performance.
Data Enrichment
Adding additional attributes to existing data—via internal joins or external sources (firmographic providers, geo data).
Data Governance
The framework for policies, processes, and responsibilities to manage data assets in an organization.
Data Labeling
Process of annotating data with ground truth for supervised learning.
Data Lake
Central storage for large amounts of unstructured and structured data.
Data Layout
The physical or logical arrangement of data in memory or on storage media, which influences access speed, cache efficiency, and processing performance.
Data Lineage
Data lineage describes where data comes from, how it moves through systems, and how it is transformed into downstream datasets and outputs.
Data Mesh
Decentralized approach to data architecture with domain-oriented data products.
Data Mining
The process of discovering patterns, anomalies, and relationships in large datasets using statistical and machine learning methods.
Data Pipeline
A sequence of processes that moves and transforms data from sources to destinations (lake, warehouse, feature store, vector index).
Data Preprocessing
Transforming raw data into a form suitable for modeling or analysis (cleaning, normalization, encoding).
Data Processing Agreement (DPA)
A legally binding contract between data controller and data processor that governs the terms for processing personal data according to GDPR.
Data Validation (ML)
Automated checking of data quality, schema conformity, and statistical properties in ML pipelines.
Data Visualization
The graphical representation of data to communicate insights and patterns.
Data Warehouse
A system optimized for structured analytics queries over curated, cleaned data—often with strong governance and performance.
Databricks
Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on Apache Spark.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that finds clusters based on density of data points and automatically identifies outliers.
Decision Support System (DSS)
A Decision Support System (DSS) helps people make better decisions by combining data, models, and user interfaces.
Decision Threshold
The cutoff used to convert a model score/probability into an action (e.g., approve/deny, route/escalate).
Deduplication
Deduplication is identifying and removing duplicate (or near-duplicate) items to reduce redundancy and improve quality.
Demand Forecasting
Prediction of future demand based on historical data and factors.
Difference-in-Differences (DiD)
Quasi-experimental method that estimates causal effects by comparing changes over time between treatment and control groups.
Differential Privacy
A mathematically rigorous definition of privacy that guarantees an individual's participation in a dataset is statistically undetectable – even against attackers with arbitrary background knowledge.
Dimensionality Reduction
Techniques for reducing the number of features while preserving important information.
Double Machine Learning (DML)
Causal inference method that uses ML models to flexibly control for confounding while enabling valid statistical inference.
E
Effect Size
Quantifies the strength of a difference or relationship – independent of sample size, unlike the p-value.
ELT
ELT (Extract, Load, Transform) is a data integration paradigm where raw data is first loaded into a data warehouse and then transformed there.
Entity Resolution
Entity resolution is the process of identifying, matching, and merging multiple records from different sources that refer to the same real-world entity (person, company, product) — even when spellings, IDs, or fields are not identical.
Error Rate
Error rate is the proportion of outcomes that are incorrect relative to a defined ground truth or acceptance criteria.
ETL (Extract, Transform, Load)
Extract, Transform, Load – the process of extracting data, transforming it, and loading it into target systems.
Euclidean Distance
Geometric distance between two points in vector space.
Event Tracking
The capture and analysis of user interactions and actions on digital platforms.
Exploratory Data Analysis
The process of visually and statistically examining data before model building.
Exponential Smoothing
A family of statistical time series methods that exponentially weights current observations more heavily than past ones.
F
F1 Score
The harmonic mean of precision and recall, a single metric that balances both aspects of classification performance.
Feature Engineering
The process of selecting, transforming, and creating input variables (features) for machine learning models to improve their predictive power.
Feature Importance
Feature importance quantifies how much each input feature contributes to a model's predictions (globally or for a specific prediction).
FinOps for AI
FinOps for AI applies financial operations practices (cost visibility, optimization, budgeting, accountability) to AI workloads and AI product usage.
First-Party Data
Data collected directly from own customers and users.
First-Party Data AI
Strategic approach of using proprietary customer data as a differentiation layer on top of generic foundation models.
Fraud Detection
AI-powered detection of fraudulent activities and transactions.
Fuzzy Matching
Techniques for finding approximate rather than exact matches in data.
G
Gaussian Distribution
A symmetric probability distribution, also known as normal distribution.
GDPR
The EU General Data Protection Regulation (since 2018), establishing uniform rules for processing personal data by companies and granting individuals comprehensive rights.
Great Expectations
Open-source framework for data validation, documentation, and profiling with a declarative expectation system.
H
Heatmap
A visual representation of data where values are encoded by color intensity.
Hit Rate
Measures the proportion of queries for which at least one relevant result was found in the top-k – often as Recall@1.
Hypothesis Testing
Hypothesis testing is a class of statistical procedures used to evaluate whether a claim about a population (alternative hypothesis), based on sample data, is statistically defensible compared with a default assumption (null hypothesis).
I
Insights
Insights are meaningful interpretations of data that reduce uncertainty and enable better decisions (descriptive, diagnostic, predictive, or prescriptive).
Instrumental Variable (IV)
A variable that influences the treatment variable but affects the outcome only through the treatment – not directly. Enables causal estimates despite confounding.
Inter-Annotator Agreement (IAA)
A metric for measuring the agreement between different annotators when evaluating the same data.
K
K-Anonymity
K-anonymity is a privacy property where each record in a dataset is indistinguishable from at least k−1 other records with respect to quasi-identifiers.
Kalman Filter
A Kalman filter is an algorithm for estimating the hidden state of a system over time from noisy measurements.
Kaplan-Meier Estimator
The Kaplan–Meier estimator estimates a survival function (probability of "not yet churned" over time), handling censored data.
L
Label Studio
Open-source platform for data annotation and labeling supporting text, images, audio, video, and multi-modal data.
Lift
Lift is the incremental change in an outcome attributable to an intervention.
Lift Chart
A lift chart shows how well a model ranks positives by comparing outcomes across scored segments.
Locality-Sensitive Hashing (LSH)
Locality-Sensitive Hashing (LSH) is a technique that hashes similar items into the same "buckets" with high probability, enabling fast approximate similarity search.
Log Loss
A loss function evaluating the quality of predicted probabilities – exponentially penalizes wrong but confident predictions.
M
MAE (Mean Absolute Error)
The average of absolute differences between prediction and reality – robust to outliers.
MAP (Mean Average Precision)
The average of Average Precision across all queries – considers both precision and ranking position of all relevant documents.
Master Data Management (MDM)
Master Data Management (MDM) is an approach to ensure critical enterprise data (e.g., customers, products, locations) is consistent, accurate, and governed across systems—often aiming for a "single source/version of truth."
MinHash
MinHash is a technique to efficiently estimate similarity between sets (especially Jaccard similarity), commonly used for near-duplicate detection.
Minimum Detectable Effect (MDE)
MDE is the smallest true effect size an experiment can reliably detect given traffic, variance, significance level, and power.
MRR (Mean Reciprocal Rank)
The average of the reciprocal ranks of the first relevant result across all queries – MRR = 1/n × Σ(1/rank_i).
MSE (Mean Squared Error)
The average of squared differences between predicted and actual values – standard loss for regression.
N
NaN (Not a Number)
NaN is a special floating-point value meaning "Not a Number," used to represent undefined or unrepresentable numeric results (e.g., 0/0).
Natural Experiment
A natural experiment uses real-world events or operational changes (not randomized by you) that approximate random assignment, enabling causal inference under assumptions.
NDCG (Normalized Discounted Cumulative Gain)
A ranking metric that considers both relevance grades and positions in the ranking – higher-ranked relevant items are weighted more heavily.
NDJSON (Newline-Delimited JSON)
NDJSON is a format where each line is a valid JSON object—making it easy to stream, append, and process logs/events at scale.
Negative Binomial Regression
Negative binomial regression is a statistical model for count data (e.g., clicks, conversions) that handles overdispersion (variance > mean), unlike Poisson regression.
Negative Control
A negative control is a variable, outcome, or test condition that should not be affected by an intervention—used to detect bias, confounding, or measurement artifacts.
NHST (Null Hypothesis Significance Testing)
NHST is the traditional statistical testing framework where you test whether observed data is unlikely under a null hypothesis (often "no effect"), typically using p-values.
NMI (Normalized Mutual Information)
NMI is a metric used to compare clustering assignments by measuring how much information one clustering shares with another, normalized to be scale-friendly.
Noise-to-Signal Ratio
Noise-to-signal ratio measures how much random variation (noise) exists relative to the meaningful pattern (signal) you want to detect.
Non-Negative Matrix Factorization (NMF)
NMF factorizes a non-negative matrix into two smaller non-negative matrices, often used for interpretable topic-like decompositions.
Non-Production Data Masking
Non-production data masking is the practice of anonymizing, tokenizing, or synthesizing sensitive data before it is used in dev/staging/test environments.
Normal Form (Database)
In databases, normal forms (1NF, 2NF, 3NF, BCNF) describe levels of normalization that reduce redundancy and improve data integrity.
Normalized Cost per Answer
Normalized cost per answer is the cost of generating an AI answer adjusted for comparability (e.g., normalized by answer length, tokens, difficulty tier, or traffic segment).
Normalized RMSE (NRMSE)
NRMSE is RMSE normalized by a scale factor (e.g., range, mean, or standard deviation) to make errors comparable across datasets.
Nowcasting
Forecasting the current or imminent state using high-frequency real-time data.
Null Value
A null value represents missing or unknown data (distinct from zero, empty string, or false).
O
Observed vs Expected
Compares actual system behavior to a baseline or model of expected behavior to detect anomalies and regressions.
OLAP
A technology for fast, multidimensional analysis of large datasets, enabling slice, dice, drill-down, and roll-up operations.
One-Hot Encoding
Represents a categorical value as a vector of zeros with a single 1 at the category index.
Ontology
Formal description of concepts, properties, and relationships in a knowledge domain.
Outlier
A data point that deviates significantly from the rest of the distribution.
Outlier Detection
Identifies anomalous data points or behaviors that differ from expected patterns.
P
p-Hacking
Manipulating analysis choices (stopping rules, segmentation, metrics, exclusions) to obtain statistically significant results.
p-Value
The probability of observing results at least as extreme as what you observed if the null hypothesis were true.
PII (Personally Identifiable Information)
Information that can identify a person directly or indirectly (e.g., name, email, phone number, government IDs).
Power Analysis
Calculation of the necessary sample size to detect an effect of a given size with desired probability (power).
Precision
The proportion of correctly classified positive cases out of all cases classified as positive.
Precision and Recall
Two complementary metrics for evaluating classification models on imbalanced data.
Precision@k
Measures how many of the top-k retrieved items are relevant (relevant items in top-k ÷ k).
Privacy Budget
A quantitative measure (epsilon, ε) of the total privacy loss accumulated through repeated queries on privacy-protected data.
Prophet (Facebook/Meta)
An open-source forecasting tool developed by Meta that automatically models trend, seasonality, and holiday effects.
Provenance
Provenance is metadata that describes the origin, history, and transformation path of data or content—where it came from, how it changed, and who/what changed it.
Pseudonymization
Replaces identifiers with pseudonyms so data can't be directly attributed to a person without additional information kept separately.
Q
Quality-Adjusted Cost per Answer
Quality-adjusted cost per answer is cost-per-answer interpreted alongside quality metrics, ensuring cost savings don't come from degraded outputs.
Quantile
A quantile is a value below which a certain percentage of observations fall (e.g., p50/median, p95, p99).
Quantile Regression
Quantile regression predicts a chosen quantile of the target distribution (e.g., p90 outcome) rather than the mean.
Quasi-Experiment
A quasi-experiment estimates causal effects without random assignment, using designs like difference-in-differences, regression discontinuity, or matching.
Quasi-Identifier
A quasi-identifier is a data attribute (or combination) that may not uniquely identify someone alone, but can identify them when combined with other attributes.
Query Optimizer
A query optimizer is the system component that chooses an efficient query plan, often based on statistics and heuristics.
Query Plan
A query plan is the execution strategy a database/search engine uses to answer a query (joins, index usage, filters, scan order).
R
R-Squared (Coefficient of Determination)
The proportion of variance in the target variable explained by the model (0-1).
Recall
The proportion of correctly identified positive cases out of all actual positive cases.
Redaction
Redaction is removing or masking sensitive information (PII, secrets, credentials) from text, logs, documents, or outputs.
Reporting
The process of collecting, organizing, and presenting data in structured formats (reports, dashboards) to inform stakeholders and support decisions.
RMSE (Root Mean Squared Error)
The square root of MSE – has the same unit as the target variable.
ROC Curve
A plot showing the True Positive Rate vs False Positive Rate across all classification thresholds.
S
Sampling
Sampling is selecting a subset of data (or outcomes) from a larger population/process to estimate properties, reduce cost, or enable exploration.
Scenario Analysis
Scenario analysis evaluates outcomes under a set of coherent, plausible future conditions (scenarios), rather than changing one variable at a time.
Schema
A Schema defines the structure, organization, and constraints of data – whether in databases, APIs, or structured data formats.
Schema-on-Read
Schema-on-Read is a data management approach where the structure of data is applied only at query time, not when storing.
Seasonality
Regularly recurring patterns in time series that repeat at fixed intervals.
Segment Analysis
Segment analysis breaks metrics down by meaningful groups (segments) such as channel, device, region, customer tier, or intent.
Sensitivity Analysis
Sensitivity analysis evaluates how changes in inputs affect outputs, to understand robustness and key drivers.
Sentiment Score
Numerical value that quantifies the emotional polarity of a text.
Session
Period of user interaction with a website or app.
Sessionization
Sessionization groups user events into sessions to analyze behavior over time (page flows, search sequences, conversions).
SimHash
SimHash is a fingerprinting method that produces a compact hash where similar documents tend to have similar hashes (small Hamming distance).
Simpson's Paradox
Simpson's paradox is when a trend appears in multiple groups but reverses or disappears when the groups are combined, due to confounding and aggregation.
Snorkel
Snorkel is a framework for programmatic data labeling that uses labeling functions instead of manual annotation to efficiently create large training datasets.
Snowflake
Snowflake is a cloud-native data warehouse platform that separates storage and compute, enabling scalable data analysis with SQL.
Specificity
The proportion of correctly classified negative cases out of all actual negative cases.
Stationarity
A time series is stationary when its statistical properties remain constant over time.
Statistical Significance
Statistical significance describes the probability that an observed effect did not arise by chance — measured via the p-value against a defined threshold (usually 0.05).
Streaming Data
Continuous data flow that is processed in real-time.
Survival Analysis
Statistical method for analyzing time until an event occurs (e.g., churn, conversion, failure), accounting for censored data.
Synthetic Data
Artificially generated data that replicates statistical properties of real data – used for training, testing, and privacy protection when real data is scarce, sensitive, or expensive.
T
Taxonomy
A Taxonomy is a hierarchical classification system that organizes concepts, content, or entities into ordered categories and subcategories.
Time Series
Sequence of data points ordered in time.
Time Series Analysis
Analysis of data points collected over time to identify patterns.
Topic Modeling
Unsupervised ML method for discovering abstract topics in document collections.
Treatment Effect (ATE/CATE)
The causal effect of an intervention (treatment) on an outcome. ATE is the average, CATE the conditional effect for subgroups.
U
UDF (User-Defined Function)
A UDF is a custom function to extend a platform (SQL engines, data warehouses).
Unit Economics
Unit economics measures profitability per unit (customer, query, workflow) vs variable costs.
Unstructured Data
Unstructured data is not stored in a predefined schema (PDFs, emails, chats, wikis, tickets).
Usage Telemetry
Usage telemetry captures how a product is used (events, funnels, intent patterns).
V
Validation Set
A validation set is a held-out dataset used during model development to tune hyperparameters and select model versions without touching the final test set.
Variance
Variance is the degree to which a model's performance changes across different datasets/samples; high variance often indicates sensitivity to training data (overfitting risk).
Vector Database
A vector database stores embeddings and supports fast similarity search (nearest neighbors), often with metadata filtering and indexing for scale.
Vector Embedding
A vector embedding is a numerical representation (array of floats) of text, images, or other data that encodes semantic meaning in a high-dimensional space.
Vector Index
A vector index is the data structure/algorithm used to speed up nearest-neighbor search over embeddings at scale.
Vector Quantization
Vector quantization (VQ) compresses continuous vectors by mapping them to a finite set of representative vectors (a codebook).
Vector Search
Vector search retrieves items by similarity in an embedding space rather than exact keyword match.
Vector Similarity
Vector similarity is a measure of how close two embeddings are (commonly cosine similarity or dot product).
Vector Store
A vector store is the storage layer (database or service) that holds embeddings plus metadata for retrieval and similarity search.
Vector Store Hygiene
Vector store hygiene is the operational discipline of keeping a vector store accurate, secure, performant, and up-to-date (dedupe, versioning, ACL correctness, drift monitoring, purge workflows).
Y
Y-Axis Compression
Y-axis compression is a visualization issue where scaling choices flatten differences, making changes look smaller (or larger) than they are.
Yield
Yield is the proportion of inputs that successfully produce acceptable outputs (e.g., successful runs, valid records, passing artifacts).
Yield Rate
Yield rate is yield expressed as a percentage over a defined population and time window.
Yottabyte
A yottabyte (YB) is a unit of data equal to 10²⁴ bytes (a septillion bytes).
YoY (Year-over-Year)
Year-over-Year (YoY) compares a metric to the same period in the previous year (e.g., Jan 2026 vs Jan 2025).
YTD (Year-to-Date)
Year-to-Date (YTD) measures performance from the start of the current year up to today.
Yule–Simpson Paradox
The Yule–Simpson paradox (often called Simpson's paradox) occurs when a trend appears in several groups but reverses or disappears when the groups are combined.
Z
Z-Order Curve
A Z-order curve (Morton order) is a space-filling curve that maps multi-dimensional data into a one-dimensional ordering while preserving locality.
Z-Ordering
Z-ordering is the practice of physically organizing stored data using Z-order curves so that related values are colocated on disk.
Z-Score
A z-score is the number of standard deviations a data point is from the mean.
Z-Test
A z-test is a statistical hypothesis test used to determine whether a sample mean differs from a known population mean (or whether two means differ) under certain assumptions.
Zero-ETL
Zero-ETL refers to architectures that minimize or eliminate traditional ETL pipelines by enabling near-direct data access/replication between systems with low operational overhead.
Zero-Party Data
Zero-party data is data a customer intentionally and proactively shares with a brand (preferences, intents, goals), rather than inferred or tracked.
Zettabyte
A zettabyte (ZB) is a unit of data equal to 10²¹ bytes.
Zipf's Law
Zipf's law describes how, in many natural datasets (language, queries), a few items are extremely frequent while most items are rare (long-tail distribution).