Data Terms A-Z
Understand the language of data: From Big Data to ETL to Predictive Analytics – all important terms for data-driven marketing and Business Intelligence.
A
B
Batch Processing
Processing large amounts of data in collected blocks rather than real-time.
Benchmark
A reference point or standard against which performance is measured and compared.
Business Intelligence
Business Intelligence (BI) is the practice and tooling for transforming data into dashboards, reports, and analyses that support business decision-making.
C
Causal Inference
Causal inference is the discipline of estimating cause-and-effect relationships (what would happen if we changed X), not just correlations.
Chain of Custody
Chain of custody is the documented trail of how an artifact (data, evidence, content) was collected, handled, stored, and accessed—ensuring integrity and accountability.
Clickstream Data
A time-ordered record of user interactions (clicks, page views, events) across digital properties such as websites and apps.
Cohen's Kappa
A statistic for measuring inter-rater reliability for categorical ratings, corrected for chance agreement.
Cohort Analysis
Cohort analysis groups users or entities by a shared starting event/time (e.g., signup week) and tracks behavior over time.
Confounding
Confounding occurs when a third variable influences both the "cause" and the "effect," creating a misleading association between them.
Confusion Matrix
A table that summarizes classification performance by counting true positives, false positives, true negatives, and false negatives.
Content Fingerprinting
Content fingerprinting creates a compact signature (fingerprint) of content to enable identification, deduplication, similarity detection, or provenance tracking.
Cosine Similarity
Measure of similarity between two vectors based on the angle between them.
Customer Data Platform (CDP)
Central system for unifying customer data from all sources.
D
Dashboard
A visual interface that presents key metrics, trends, and alerts to support decision-making.
Data Catalog
A searchable inventory of an organization's data assets including metadata, ownership, and documentation.
Data Dictionary
Documentation that defines the meaning, format, allowed values, and usage of data fields.
Data Drift
The change in statistical properties of input data over time, which can degrade model performance.
Data Enrichment
Adding additional attributes to existing data—via internal joins or external sources (firmographic providers, geo data).
Data Governance
Data governance is the framework of policies, roles, processes, and controls that ensure data is accurate, secure, compliant, and usable across an organization.
Data Labeling
Process of annotating data with ground truth for supervised learning.
Data Lake
Central storage for large amounts of unstructured and structured data.
Data Layout
The physical or logical arrangement of data in memory or on storage media, which influences access speed, cache efficiency, and processing performance.
Data Lineage
Data lineage describes where data comes from, how it moves through systems, and how it is transformed into downstream datasets and outputs.
Data Mesh
Decentralized approach to data architecture with domain-oriented data products.
Data Mining
The process of discovering patterns, anomalies, and relationships in large datasets using statistical and machine learning methods.
Data Pipeline
A sequence of processes that moves and transforms data from sources to destinations (lake, warehouse, feature store, vector index).
Data Preprocessing
Transforming raw data into a form suitable for modeling or analysis (cleaning, normalization, encoding).
Data Visualization
The graphical representation of data to communicate insights and patterns.
Data Warehouse
A system optimized for structured analytics queries over curated, cleaned data—often with strong governance and performance.
Databricks
Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on Apache Spark.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that finds clusters based on density of data points and automatically identifies outliers.
Decision Support System (DSS)
A Decision Support System (DSS) helps people make better decisions by combining data, models, and user interfaces.
Decision Threshold
The cutoff used to convert a model score/probability into an action (e.g., approve/deny, route/escalate).
Deduplication
Deduplication is identifying and removing duplicate (or near-duplicate) items to reduce redundancy and improve quality.
Demand Forecasting
Prediction of future demand based on historical data and factors.
Differential Privacy
A mathematical framework providing formal guarantees that individual data points cannot be inferred from aggregates or models.
Dimensionality Reduction
Techniques for reducing the number of features while preserving important information.
Distribution Shift
A mismatch between the data distribution seen in training and the distribution encountered in deployment.
E
ELT
ELT (Extract, Load, Transform) is a data integration paradigm where raw data is first loaded into a data warehouse and then transformed there.
Entity Resolution
Entity resolution is the process of identifying, matching, and merging multiple records from different sources that refer to the same real-world entity (person, company, product) — even when spellings, IDs, or fields are not identical.
Error Rate
Error rate is the proportion of outcomes that are incorrect relative to a defined ground truth or acceptance criteria.
ETL (Extract, Transform, Load)
Extract, Transform, Load – the process of extracting data, transforming it, and loading it into target systems.
Euclidean Distance
Geometric distance between two points in vector space.
Event Tracking
The capture and analysis of user interactions and actions on digital platforms.
Exploratory Data Analysis
The process of visually and statistically examining data before model building.
F
F1 Score
The harmonic mean of precision and recall, a single metric that balances both aspects of classification performance.
Feature Engineering
The process of selecting, transforming, and creating input variables (features) for machine learning models to improve their predictive power.
Feature Importance
Feature importance quantifies how much each input feature contributes to a model's predictions (globally or for a specific prediction).
FinOps for AI
FinOps for AI applies financial operations practices (cost visibility, optimization, budgeting, accountability) to AI workloads and AI product usage.
First-Party Data
Data collected directly from own customers and users.
First-Party Data AI
Strategic approach of using proprietary customer data as a differentiation layer on top of generic foundation models.
Fraud Detection
AI-powered detection of fraudulent activities and transactions.
Fuzzy Matching
Techniques for finding approximate rather than exact matches in data.
H
Heatmap
A visual representation of data where values are encoded by color intensity.
Hit Rate
Measures the proportion of queries for which at least one relevant result was found in the top-k – often as Recall@1.
Hypothesis Testing
Hypothesis testing is a class of statistical procedures used to evaluate whether a claim about a population (alternative hypothesis), based on sample data, is statistically defensible compared with a default assumption (null hypothesis).
I
K
K-Anonymity
K-anonymity is a privacy property where each record in a dataset is indistinguishable from at least k−1 other records with respect to quasi-identifiers.
Kalman Filter
A Kalman filter is an algorithm for estimating the hidden state of a system over time from noisy measurements.
Kaplan-Meier Estimator
The Kaplan–Meier estimator estimates a survival function (probability of "not yet churned" over time), handling censored data.
L
Lift
Lift is the incremental change in an outcome attributable to an intervention.
Lift Chart
A lift chart shows how well a model ranks positives by comparing outcomes across scored segments.
Locality-Sensitive Hashing (LSH)
Locality-Sensitive Hashing (LSH) is a technique that hashes similar items into the same "buckets" with high probability, enabling fast approximate similarity search.
M
MAP (Mean Average Precision)
The average of Average Precision across all queries – considers both precision and ranking position of all relevant documents.
Master Data Management (MDM)
Master Data Management (MDM) is an approach to ensure critical enterprise data (e.g., customers, products, locations) is consistent, accurate, and governed across systems—often aiming for a "single source/version of truth."
MinHash
MinHash is a technique to efficiently estimate similarity between sets (especially Jaccard similarity), commonly used for near-duplicate detection.
Minimum Detectable Effect (MDE)
MDE is the smallest true effect size an experiment can reliably detect given traffic, variance, significance level, and power.
MRR (Mean Reciprocal Rank)
The average of the reciprocal ranks of the first relevant result across all queries – MRR = 1/n × Σ(1/rank_i).
N
NaN (Not a Number)
NaN is a special floating-point value meaning "Not a Number," used to represent undefined or unrepresentable numeric results (e.g., 0/0).
Natural Experiment
A natural experiment uses real-world events or operational changes (not randomized by you) that approximate random assignment, enabling causal inference under assumptions.
NDCG (Normalized Discounted Cumulative Gain)
A ranking metric that considers both relevance grades and positions in the ranking – higher-ranked relevant items are weighted more heavily.
NDJSON (Newline-Delimited JSON)
NDJSON is a format where each line is a valid JSON object—making it easy to stream, append, and process logs/events at scale.
Negative Binomial Regression
Negative binomial regression is a statistical model for count data (e.g., clicks, conversions) that handles overdispersion (variance > mean), unlike Poisson regression.
Negative Control
A negative control is a variable, outcome, or test condition that should not be affected by an intervention—used to detect bias, confounding, or measurement artifacts.
NHST (Null Hypothesis Significance Testing)
NHST is the traditional statistical testing framework where you test whether observed data is unlikely under a null hypothesis (often "no effect"), typically using p-values.
NMI (Normalized Mutual Information)
NMI is a metric used to compare clustering assignments by measuring how much information one clustering shares with another, normalized to be scale-friendly.
Noise-to-Signal Ratio
Noise-to-signal ratio measures how much random variation (noise) exists relative to the meaningful pattern (signal) you want to detect.
Non-Negative Matrix Factorization (NMF)
NMF factorizes a non-negative matrix into two smaller non-negative matrices, often used for interpretable topic-like decompositions.
Non-Production Data Masking
Non-production data masking is the practice of anonymizing, tokenizing, or synthesizing sensitive data before it is used in dev/staging/test environments.
Normal Form (Database)
In databases, normal forms (1NF, 2NF, 3NF, BCNF) describe levels of normalization that reduce redundancy and improve data integrity.
Normalized Cost per Answer
Normalized cost per answer is the cost of generating an AI answer adjusted for comparability (e.g., normalized by answer length, tokens, difficulty tier, or traffic segment).
Normalized RMSE (NRMSE)
NRMSE is RMSE normalized by a scale factor (e.g., range, mean, or standard deviation) to make errors comparable across datasets.
Null Value
A null value represents missing or unknown data (distinct from zero, empty string, or false).
O
Observed vs Expected
Compares actual system behavior to a baseline or model of expected behavior to detect anomalies and regressions.
OLAP
A technology for fast, multidimensional analysis of large datasets, enabling slice, dice, drill-down, and roll-up operations.
One-Hot Encoding
Represents a categorical value as a vector of zeros with a single 1 at the category index.
Ontology
Formal description of concepts, properties, and relationships in a knowledge domain.
Outlier
A data point that deviates significantly from the rest of the distribution.
Outlier Detection
Identifies anomalous data points or behaviors that differ from expected patterns.
P
p-Hacking
Manipulating analysis choices (stopping rules, segmentation, metrics, exclusions) to obtain statistically significant results.
p-Value
The probability of observing results at least as extreme as what you observed if the null hypothesis were true.
PII (Personally Identifiable Information)
Information that can identify a person directly or indirectly (e.g., name, email, phone number, government IDs).
Precision
The proportion of correctly classified positive cases out of all cases classified as positive.
Precision and Recall
Two complementary metrics for evaluating classification models on imbalanced data.
Precision@k
Measures how many of the top-k retrieved items are relevant (relevant items in top-k ÷ k).
Provenance
Provenance is metadata that describes the origin, history, and transformation path of data or content—where it came from, how it changed, and who/what changed it.
Pseudonymization
Replaces identifiers with pseudonyms so data can't be directly attributed to a person without additional information kept separately.
Q
Quality-Adjusted Cost per Answer
Quality-adjusted cost per answer is cost-per-answer interpreted alongside quality metrics, ensuring cost savings don't come from degraded outputs.
Quantile
A quantile is a value below which a certain percentage of observations fall (e.g., p50/median, p95, p99).
Quantile Regression
Quantile regression predicts a chosen quantile of the target distribution (e.g., p90 outcome) rather than the mean.
Quasi-Experiment
A quasi-experiment estimates causal effects without random assignment, using designs like difference-in-differences, regression discontinuity, or matching.
Quasi-Identifier
A quasi-identifier is a data attribute (or combination) that may not uniquely identify someone alone, but can identify them when combined with other attributes.
Query Optimizer
A query optimizer is the system component that chooses an efficient query plan, often based on statistics and heuristics.
Query Plan
A query plan is the execution strategy a database/search engine uses to answer a query (joins, index usage, filters, scan order).
R
Recall
The proportion of correctly identified positive cases out of all actual positive cases.
Redaction
Redaction is removing or masking sensitive information (PII, secrets, credentials) from text, logs, documents, or outputs.
Reporting
The process of collecting, organizing, and presenting data in structured formats (reports, dashboards) to inform stakeholders and support decisions.
S
Sampling
Sampling is selecting a subset of data (or outcomes) from a larger population/process to estimate properties, reduce cost, or enable exploration.
Scenario Analysis
Scenario analysis evaluates outcomes under a set of coherent, plausible future conditions (scenarios), rather than changing one variable at a time.
Schema
A Schema defines the structure, organization, and constraints of data – whether in databases, APIs, or structured data formats.
Schema-on-Read
Schema-on-Read is a data management approach where the structure of data is applied only at query time, not when storing.
Segment Analysis
Segment analysis breaks metrics down by meaningful groups (segments) such as channel, device, region, customer tier, or intent.
Sensitivity Analysis
Sensitivity analysis evaluates how changes in inputs affect outputs, to understand robustness and key drivers.
Sentiment Score
Numerical value that quantifies the emotional polarity of a text.
Session
Period of user interaction with a website or app.
Sessionization
Sessionization groups user events into sessions to analyze behavior over time (page flows, search sequences, conversions).
SimHash
SimHash is a fingerprinting method that produces a compact hash where similar documents tend to have similar hashes (small Hamming distance).
Simpson's Paradox
Simpson's paradox is when a trend appears in multiple groups but reverses or disappears when the groups are combined, due to confounding and aggregation.
Snorkel
Snorkel is a framework for programmatic data labeling that uses labeling functions instead of manual annotation to efficiently create large training datasets.
Snowflake
Snowflake is a cloud-native data warehouse platform that separates storage and compute, enabling scalable data analysis with SQL.
Statistical Significance
Statistical significance describes the probability that an observed effect did not arise by chance — measured via the p-value against a defined threshold (usually 0.05).
Streaming Data
Continuous data flow that is processed in real-time.
T
Taxonomy
A Taxonomy is a hierarchical classification system that organizes concepts, content, or entities into ordered categories and subcategories.
Time Series
Sequence of data points ordered in time.
Time Series Analysis
Analysis of data points collected over time to identify patterns.
Topic Modeling
Unsupervised ML method for discovering abstract topics in document collections.
U
UDF (User-Defined Function)
A UDF is a custom function to extend a platform (SQL engines, data warehouses).
Unit Economics
Unit economics measures profitability per unit (customer, query, workflow) vs variable costs.
Unstructured Data
Unstructured data is not stored in a predefined schema (PDFs, emails, chats, wikis, tickets).
Usage Telemetry
Usage telemetry captures how a product is used (events, funnels, intent patterns).
V
Validation Set
A validation set is a held-out dataset used during model development to tune hyperparameters and select model versions without touching the final test set.
Variance
Variance is the degree to which a model's performance changes across different datasets/samples; high variance often indicates sensitivity to training data (overfitting risk).
Vector Database
A vector database stores embeddings and supports fast similarity search (nearest neighbors), often with metadata filtering and indexing for scale.
Vector Embedding
A vector embedding is a numerical representation (array of floats) of text, images, or other data that encodes semantic meaning in a high-dimensional space.
Vector Index
A vector index is the data structure/algorithm used to speed up nearest-neighbor search over embeddings at scale.
Vector Quantization
Vector quantization (VQ) compresses continuous vectors by mapping them to a finite set of representative vectors (a codebook).
Vector Search
Vector search retrieves items by similarity in an embedding space rather than exact keyword match.
Vector Similarity
Vector similarity is a measure of how close two embeddings are (commonly cosine similarity or dot product).
Vector Store
A vector store is the storage layer (database or service) that holds embeddings plus metadata for retrieval and similarity search.
Vector Store Hygiene
Vector store hygiene is the operational discipline of keeping a vector store accurate, secure, performant, and up-to-date (dedupe, versioning, ACL correctness, drift monitoring, purge workflows).
Y
Y-Axis Compression
Y-axis compression is a visualization issue where scaling choices flatten differences, making changes look smaller (or larger) than they are.
Yield
Yield is the proportion of inputs that successfully produce acceptable outputs (e.g., successful runs, valid records, passing artifacts).
Yield Rate
Yield rate is yield expressed as a percentage over a defined population and time window.
Yottabyte
A yottabyte (YB) is a unit of data equal to 10²⁴ bytes (a septillion bytes).
YoY (Year-over-Year)
Year-over-Year (YoY) compares a metric to the same period in the previous year (e.g., Jan 2026 vs Jan 2025).
YTD (Year-to-Date)
Year-to-Date (YTD) measures performance from the start of the current year up to today.
Yule–Simpson Paradox
The Yule–Simpson paradox (often called Simpson's paradox) occurs when a trend appears in several groups but reverses or disappears when the groups are combined.
Z
Z-Order Curve
A Z-order curve (Morton order) is a space-filling curve that maps multi-dimensional data into a one-dimensional ordering while preserving locality.
Z-Ordering
Z-ordering is the practice of physically organizing stored data using Z-order curves so that related values are colocated on disk.
Z-Score
A z-score is the number of standard deviations a data point is from the mean.
Z-Test
A z-test is a statistical hypothesis test used to determine whether a sample mean differs from a known population mean (or whether two means differ) under certain assumptions.
Zero-ETL
Zero-ETL refers to architectures that minimize or eliminate traditional ETL pipelines by enabling near-direct data access/replication between systems with low operational overhead.
Zero-Party Data
Zero-party data is data a customer intentionally and proactively shares with a brand (preferences, intents, goals), rather than inferred or tracked.
Zettabyte
A zettabyte (ZB) is a unit of data equal to 10²¹ bytes.
Zipf's Law
Zipf's law describes how, in many natural datasets (language, queries), a few items are extremely frequent while most items are rare (long-tail distribution).