Data & Analytics

Data Terms A-Z

Understand the language of data: From Big Data to ETL to Predictive Analytics – all important terms for data-driven marketing and Business Intelligence.

Big Data

Data Lakes

ETL Processes

Business Intelligence

Predictive Analytics

Data Governance

Other categories:Artificial Intelligence Marketing Technology Automation

161 terms in Data & Analytics

A

Accuracy

A metric in machine learning that measures the proportion of correct predictions made by a model out of all predictions made.

Analytics

The systematic analysis of data to gain insights and support decision-making.

Anomaly Detection

Identification of unusual patterns or outliers in data.

B

Batch Processing

Processing large amounts of data in collected blocks rather than real-time.

Benchmark

A reference point or standard against which performance is measured and compared.

Business Intelligence

Business Intelligence (BI) is the practice and tooling for transforming data into dashboards, reports, and analyses that support business decision-making.

C

Causal Inference

Causal inference is the discipline of estimating cause-and-effect relationships (what would happen if we changed X), not just correlations.

Chain of Custody

Chain of custody is the documented trail of how an artifact (data, evidence, content) was collected, handled, stored, and accessed—ensuring integrity and accountability.

Clickstream Data

A time-ordered record of user interactions (clicks, page views, events) across digital properties such as websites and apps.

Cohen's Kappa

A statistic for measuring inter-rater reliability for categorical ratings, corrected for chance agreement.

Cohort Analysis

Cohort analysis groups users or entities by a shared starting event/time (e.g., signup week) and tracks behavior over time.

Confounding

Confounding occurs when a third variable influences both the "cause" and the "effect," creating a misleading association between them.

Confusion Matrix

A table that summarizes classification performance by counting true positives, false positives, true negatives, and false negatives.

Content Fingerprinting

Content fingerprinting creates a compact signature (fingerprint) of content to enable identification, deduplication, similarity detection, or provenance tracking.

Cosine Similarity

Measure of similarity between two vectors based on the angle between them.

Customer Data Platform (CDP)

Central system for unifying customer data from all sources.

D

Dashboard

A visual interface that presents key metrics, trends, and alerts to support decision-making.

Data Catalog

A searchable inventory of an organization's data assets including metadata, ownership, and documentation.

Data Dictionary

Documentation that defines the meaning, format, allowed values, and usage of data fields.

Data Drift

The change in statistical properties of input data over time, which can degrade model performance.

Data Enrichment

Adding additional attributes to existing data—via internal joins or external sources (firmographic providers, geo data).

Data Governance

Data governance is the framework of policies, roles, processes, and controls that ensure data is accurate, secure, compliant, and usable across an organization.

Data Labeling

Process of annotating data with ground truth for supervised learning.

Data Lake

Central storage for large amounts of unstructured and structured data.

Data Layout

The physical or logical arrangement of data in memory or on storage media, which influences access speed, cache efficiency, and processing performance.

Data Lineage

Data lineage describes where data comes from, how it moves through systems, and how it is transformed into downstream datasets and outputs.

Data Mesh

Decentralized approach to data architecture with domain-oriented data products.

Data Mining

The process of discovering patterns, anomalies, and relationships in large datasets using statistical and machine learning methods.

Data Pipeline

A sequence of processes that moves and transforms data from sources to destinations (lake, warehouse, feature store, vector index).

Data Preprocessing

Transforming raw data into a form suitable for modeling or analysis (cleaning, normalization, encoding).

Data Visualization

The graphical representation of data to communicate insights and patterns.

Data Warehouse

A system optimized for structured analytics queries over curated, cleaned data—often with strong governance and performance.

Databricks

Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on Apache Spark.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that finds clusters based on density of data points and automatically identifies outliers.

Decision Support System (DSS)

A Decision Support System (DSS) helps people make better decisions by combining data, models, and user interfaces.

Decision Threshold

The cutoff used to convert a model score/probability into an action (e.g., approve/deny, route/escalate).

Deduplication

Deduplication is identifying and removing duplicate (or near-duplicate) items to reduce redundancy and improve quality.

Demand Forecasting

Prediction of future demand based on historical data and factors.

Differential Privacy

A mathematical framework providing formal guarantees that individual data points cannot be inferred from aggregates or models.

Dimensionality Reduction

Techniques for reducing the number of features while preserving important information.

Distribution Shift

A mismatch between the data distribution seen in training and the distribution encountered in deployment.

E

ELT

ELT (Extract, Load, Transform) is a data integration paradigm where raw data is first loaded into a data warehouse and then transformed there.

Entity Resolution

Entity resolution is the process of identifying, matching, and merging multiple records from different sources that refer to the same real-world entity (person, company, product) — even when spellings, IDs, or fields are not identical.

Error Rate

Error rate is the proportion of outcomes that are incorrect relative to a defined ground truth or acceptance criteria.

ETL (Extract, Transform, Load)

Extract, Transform, Load – the process of extracting data, transforming it, and loading it into target systems.

Euclidean Distance

Geometric distance between two points in vector space.

Event Tracking

The capture and analysis of user interactions and actions on digital platforms.

Exploratory Data Analysis

The process of visually and statistically examining data before model building.

F

F1 Score

The harmonic mean of precision and recall, a single metric that balances both aspects of classification performance.

Feature Engineering

The process of selecting, transforming, and creating input variables (features) for machine learning models to improve their predictive power.

Feature Importance

Feature importance quantifies how much each input feature contributes to a model's predictions (globally or for a specific prediction).

FinOps for AI

FinOps for AI applies financial operations practices (cost visibility, optimization, budgeting, accountability) to AI workloads and AI product usage.

First-Party Data

Data collected directly from own customers and users.

First-Party Data AI

Strategic approach of using proprietary customer data as a differentiation layer on top of generic foundation models.

Fraud Detection

AI-powered detection of fraudulent activities and transactions.

Fuzzy Matching

Techniques for finding approximate rather than exact matches in data.

G

Gaussian Distribution

A symmetric probability distribution, also known as normal distribution.

H

Heatmap

A visual representation of data where values are encoded by color intensity.

Hit Rate

Measures the proportion of queries for which at least one relevant result was found in the top-k – often as Recall@1.

Hypothesis Testing

Hypothesis testing is a class of statistical procedures used to evaluate whether a claim about a population (alternative hypothesis), based on sample data, is statistically defensible compared with a default assumption (null hypothesis).

I

Insights

Insights are meaningful interpretations of data that reduce uncertainty and enable better decisions (descriptive, diagnostic, predictive, or prescriptive).

Inter-Annotator Agreement (IAA)

A metric for measuring the agreement between different annotators when evaluating the same data.

K

K-Anonymity

K-anonymity is a privacy property where each record in a dataset is indistinguishable from at least k−1 other records with respect to quasi-identifiers.

Kalman Filter

A Kalman filter is an algorithm for estimating the hidden state of a system over time from noisy measurements.

Kaplan-Meier Estimator

The Kaplan–Meier estimator estimates a survival function (probability of "not yet churned" over time), handling censored data.

L

Lift

Lift is the incremental change in an outcome attributable to an intervention.

Lift Chart

A lift chart shows how well a model ranks positives by comparing outcomes across scored segments.

Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing (LSH) is a technique that hashes similar items into the same "buckets" with high probability, enabling fast approximate similarity search.

M

MAP (Mean Average Precision)

The average of Average Precision across all queries – considers both precision and ranking position of all relevant documents.

Master Data Management (MDM)

Master Data Management (MDM) is an approach to ensure critical enterprise data (e.g., customers, products, locations) is consistent, accurate, and governed across systems—often aiming for a "single source/version of truth."

MinHash

MinHash is a technique to efficiently estimate similarity between sets (especially Jaccard similarity), commonly used for near-duplicate detection.

Minimum Detectable Effect (MDE)

MDE is the smallest true effect size an experiment can reliably detect given traffic, variance, significance level, and power.

MRR (Mean Reciprocal Rank)

The average of the reciprocal ranks of the first relevant result across all queries – MRR = 1/n × Σ(1/rank_i).

N

NaN (Not a Number)

NaN is a special floating-point value meaning "Not a Number," used to represent undefined or unrepresentable numeric results (e.g., 0/0).

Natural Experiment

A natural experiment uses real-world events or operational changes (not randomized by you) that approximate random assignment, enabling causal inference under assumptions.

NDCG (Normalized Discounted Cumulative Gain)

A ranking metric that considers both relevance grades and positions in the ranking – higher-ranked relevant items are weighted more heavily.

NDJSON (Newline-Delimited JSON)

NDJSON is a format where each line is a valid JSON object—making it easy to stream, append, and process logs/events at scale.

Negative Binomial Regression

Negative binomial regression is a statistical model for count data (e.g., clicks, conversions) that handles overdispersion (variance > mean), unlike Poisson regression.

Negative Control

A negative control is a variable, outcome, or test condition that should not be affected by an intervention—used to detect bias, confounding, or measurement artifacts.

NHST (Null Hypothesis Significance Testing)

NHST is the traditional statistical testing framework where you test whether observed data is unlikely under a null hypothesis (often "no effect"), typically using p-values.

NMI (Normalized Mutual Information)

NMI is a metric used to compare clustering assignments by measuring how much information one clustering shares with another, normalized to be scale-friendly.

Noise-to-Signal Ratio

Noise-to-signal ratio measures how much random variation (noise) exists relative to the meaningful pattern (signal) you want to detect.

Non-Negative Matrix Factorization (NMF)

NMF factorizes a non-negative matrix into two smaller non-negative matrices, often used for interpretable topic-like decompositions.

Non-Production Data Masking

Non-production data masking is the practice of anonymizing, tokenizing, or synthesizing sensitive data before it is used in dev/staging/test environments.

Normal Form (Database)

In databases, normal forms (1NF, 2NF, 3NF, BCNF) describe levels of normalization that reduce redundancy and improve data integrity.

Normalized Cost per Answer

Normalized cost per answer is the cost of generating an AI answer adjusted for comparability (e.g., normalized by answer length, tokens, difficulty tier, or traffic segment).

Normalized RMSE (NRMSE)

NRMSE is RMSE normalized by a scale factor (e.g., range, mean, or standard deviation) to make errors comparable across datasets.

Null Value

A null value represents missing or unknown data (distinct from zero, empty string, or false).

O

Observed vs Expected

Compares actual system behavior to a baseline or model of expected behavior to detect anomalies and regressions.

OLAP

A technology for fast, multidimensional analysis of large datasets, enabling slice, dice, drill-down, and roll-up operations.

One-Hot Encoding

Represents a categorical value as a vector of zeros with a single 1 at the category index.

Ontology

Formal description of concepts, properties, and relationships in a knowledge domain.

Outlier

A data point that deviates significantly from the rest of the distribution.

Outlier Detection

Identifies anomalous data points or behaviors that differ from expected patterns.

P

p-Hacking

Manipulating analysis choices (stopping rules, segmentation, metrics, exclusions) to obtain statistically significant results.

p-Value

The probability of observing results at least as extreme as what you observed if the null hypothesis were true.

PII (Personally Identifiable Information)

Information that can identify a person directly or indirectly (e.g., name, email, phone number, government IDs).

Precision

The proportion of correctly classified positive cases out of all cases classified as positive.

Precision and Recall

Two complementary metrics for evaluating classification models on imbalanced data.

Precision@k

Measures how many of the top-k retrieved items are relevant (relevant items in top-k ÷ k).

Provenance

Provenance is metadata that describes the origin, history, and transformation path of data or content—where it came from, how it changed, and who/what changed it.

Pseudonymization

Replaces identifiers with pseudonyms so data can't be directly attributed to a person without additional information kept separately.

Q

Quality-Adjusted Cost per Answer

Quality-adjusted cost per answer is cost-per-answer interpreted alongside quality metrics, ensuring cost savings don't come from degraded outputs.

Quantile

A quantile is a value below which a certain percentage of observations fall (e.g., p50/median, p95, p99).

Quantile Regression

Quantile regression predicts a chosen quantile of the target distribution (e.g., p90 outcome) rather than the mean.

Quasi-Experiment

A quasi-experiment estimates causal effects without random assignment, using designs like difference-in-differences, regression discontinuity, or matching.

Quasi-Identifier

A quasi-identifier is a data attribute (or combination) that may not uniquely identify someone alone, but can identify them when combined with other attributes.

Query Optimizer

A query optimizer is the system component that chooses an efficient query plan, often based on statistics and heuristics.

Query Plan

A query plan is the execution strategy a database/search engine uses to answer a query (joins, index usage, filters, scan order).

R

Recall

The proportion of correctly identified positive cases out of all actual positive cases.

Redaction

Redaction is removing or masking sensitive information (PII, secrets, credentials) from text, logs, documents, or outputs.

Reporting

The process of collecting, organizing, and presenting data in structured formats (reports, dashboards) to inform stakeholders and support decisions.

S

Sampling

Sampling is selecting a subset of data (or outcomes) from a larger population/process to estimate properties, reduce cost, or enable exploration.

Scenario Analysis

Scenario analysis evaluates outcomes under a set of coherent, plausible future conditions (scenarios), rather than changing one variable at a time.

Schema

A Schema defines the structure, organization, and constraints of data – whether in databases, APIs, or structured data formats.

Schema-on-Read

Schema-on-Read is a data management approach where the structure of data is applied only at query time, not when storing.

Segment Analysis

Segment analysis breaks metrics down by meaningful groups (segments) such as channel, device, region, customer tier, or intent.

Sensitivity Analysis

Sensitivity analysis evaluates how changes in inputs affect outputs, to understand robustness and key drivers.

Sentiment Score

Numerical value that quantifies the emotional polarity of a text.

Session

Period of user interaction with a website or app.

Sessionization

Sessionization groups user events into sessions to analyze behavior over time (page flows, search sequences, conversions).

SimHash

SimHash is a fingerprinting method that produces a compact hash where similar documents tend to have similar hashes (small Hamming distance).

Simpson's Paradox

Simpson's paradox is when a trend appears in multiple groups but reverses or disappears when the groups are combined, due to confounding and aggregation.

Snorkel

Snorkel is a framework for programmatic data labeling that uses labeling functions instead of manual annotation to efficiently create large training datasets.

Snowflake

Snowflake is a cloud-native data warehouse platform that separates storage and compute, enabling scalable data analysis with SQL.

Statistical Significance

Statistical significance describes the probability that an observed effect did not arise by chance — measured via the p-value against a defined threshold (usually 0.05).

Streaming Data

Continuous data flow that is processed in real-time.

T

Taxonomy

A Taxonomy is a hierarchical classification system that organizes concepts, content, or entities into ordered categories and subcategories.

Time Series

Sequence of data points ordered in time.

Time Series Analysis

Analysis of data points collected over time to identify patterns.

Topic Modeling

Unsupervised ML method for discovering abstract topics in document collections.

U

UDF (User-Defined Function)

A UDF is a custom function to extend a platform (SQL engines, data warehouses).

Unit Economics

Unit economics measures profitability per unit (customer, query, workflow) vs variable costs.

Unstructured Data

Unstructured data is not stored in a predefined schema (PDFs, emails, chats, wikis, tickets).

Usage Telemetry

Usage telemetry captures how a product is used (events, funnels, intent patterns).

V

Validation Set

A validation set is a held-out dataset used during model development to tune hyperparameters and select model versions without touching the final test set.

Variance

Variance is the degree to which a model's performance changes across different datasets/samples; high variance often indicates sensitivity to training data (overfitting risk).

Vector Database

A vector database stores embeddings and supports fast similarity search (nearest neighbors), often with metadata filtering and indexing for scale.

Vector Embedding

A vector embedding is a numerical representation (array of floats) of text, images, or other data that encodes semantic meaning in a high-dimensional space.

Vector Index

A vector index is the data structure/algorithm used to speed up nearest-neighbor search over embeddings at scale.

Vector Quantization

Vector quantization (VQ) compresses continuous vectors by mapping them to a finite set of representative vectors (a codebook).

Vector Search

Vector search retrieves items by similarity in an embedding space rather than exact keyword match.

Vector Similarity

Vector similarity is a measure of how close two embeddings are (commonly cosine similarity or dot product).

Vector Store

A vector store is the storage layer (database or service) that holds embeddings plus metadata for retrieval and similarity search.

Vector Store Hygiene

Vector store hygiene is the operational discipline of keeping a vector store accurate, secure, performant, and up-to-date (dedupe, versioning, ACL correctness, drift monitoring, purge workflows).

W

What-If Analysis

What-if analysis explores how outcomes change when you alter inputs, assumptions, or decisions.

Y

Y-Axis Compression

Y-axis compression is a visualization issue where scaling choices flatten differences, making changes look smaller (or larger) than they are.

Yield

Yield is the proportion of inputs that successfully produce acceptable outputs (e.g., successful runs, valid records, passing artifacts).

Yield Rate

Yield rate is yield expressed as a percentage over a defined population and time window.

Yottabyte

A yottabyte (YB) is a unit of data equal to 10²⁴ bytes (a septillion bytes).

YoY (Year-over-Year)

Year-over-Year (YoY) compares a metric to the same period in the previous year (e.g., Jan 2026 vs Jan 2025).

YTD (Year-to-Date)

Year-to-Date (YTD) measures performance from the start of the current year up to today.

Yule–Simpson Paradox

The Yule–Simpson paradox (often called Simpson's paradox) occurs when a trend appears in several groups but reverses or disappears when the groups are combined.

Z

Z-Order Curve

A Z-order curve (Morton order) is a space-filling curve that maps multi-dimensional data into a one-dimensional ordering while preserving locality.

Z-Ordering

Z-ordering is the practice of physically organizing stored data using Z-order curves so that related values are colocated on disk.

Z-Score

A z-score is the number of standard deviations a data point is from the mean.

Z-Test

A z-test is a statistical hypothesis test used to determine whether a sample mean differs from a known population mean (or whether two means differ) under certain assumptions.

Zero-ETL

Zero-ETL refers to architectures that minimize or eliminate traditional ETL pipelines by enabling near-direct data access/replication between systems with low operational overhead.

Zero-Party Data

Zero-party data is data a customer intentionally and proactively shares with a brand (preferences, intents, goals), rather than inferred or tracked.

Zettabyte

A zettabyte (ZB) is a unit of data equal to 10²¹ bytes.

Zipf's Law

Zipf's law describes how, in many natural datasets (language, queries), a few items are extremely frequent while most items are rare (long-tail distribution).

Term not found?

Browse the full glossary with over 1407 terms from all categories.

View Full Glossary