Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Data & Analytics

    Data Terms A-Z

    Understand the language of data: From Big Data to ETL to Predictive Analytics – all important terms for data-driven marketing and Business Intelligence.

    Big Data
    Data Lakes
    ETL Processes
    Business Intelligence
    Predictive Analytics
    Data Governance
    193 terms in Data & Analytics

    C

    Causal Inference

    Causal inference is the discipline of estimating cause-and-effect relationships (what would happen if we changed X), not just correlations.

    Chain of Custody

    Chain of custody is the documented trail of how an artifact (data, evidence, content) was collected, handled, stored, and accessed—ensuring integrity and accountability.

    Changepoint Detection

    Detection of time points at which the statistical properties of a time series significantly change.

    Clickstream Data

    A time-ordered record of user interactions (clicks, page views, events) across digital properties such as websites and apps.

    Cohen's Kappa

    A statistic for measuring inter-rater reliability for categorical ratings, corrected for chance agreement.

    Cohort Analysis

    Cohort analysis groups users or entities by a shared starting event/time (e.g., signup week) and tracks behavior over time.

    Confounding

    A confounder is a variable that influences both the independent and dependent variable, creating a spurious association.

    Confusion Matrix

    A table that summarizes classification performance by counting true positives, false positives, true negatives, and false negatives.

    Content Fingerprinting

    Content fingerprinting creates a compact signature (fingerprint) of content to enable identification, deduplication, similarity detection, or provenance tracking.

    Cosine Similarity

    A measure of similarity between two vectors that calculates the cosine of the angle between them, independent of their magnitude.

    Customer Data Platform (CDP)

    Central system for unifying customer data from all sources.

    D

    Dashboard

    A visual interface that presents key metrics, trends, and alerts to support decision-making.

    Data Catalog

    A searchable inventory of an organization's data assets including metadata, ownership, and documentation.

    Data Clean Room

    A secure environment where multiple parties can combine their data for joint analyses without sharing raw data.

    Data Dictionary

    Documentation that defines the meaning, format, allowed values, and usage of data fields.

    Data Drift

    The change in statistical properties of input data over time, which can degrade model performance.

    Data Enrichment

    Adding additional attributes to existing data—via internal joins or external sources (firmographic providers, geo data).

    Data Governance

    The framework for policies, processes, and responsibilities to manage data assets in an organization.

    Data Labeling

    Process of annotating data with ground truth for supervised learning.

    Data Lake

    Central storage for large amounts of unstructured and structured data.

    Data Layout

    The physical or logical arrangement of data in memory or on storage media, which influences access speed, cache efficiency, and processing performance.

    Data Lineage

    Data lineage describes where data comes from, how it moves through systems, and how it is transformed into downstream datasets and outputs.

    Data Mesh

    Decentralized approach to data architecture with domain-oriented data products.

    Data Mining

    The process of discovering patterns, anomalies, and relationships in large datasets using statistical and machine learning methods.

    Data Pipeline

    A sequence of processes that moves and transforms data from sources to destinations (lake, warehouse, feature store, vector index).

    Data Preprocessing

    Transforming raw data into a form suitable for modeling or analysis (cleaning, normalization, encoding).

    Data Processing Agreement (DPA)

    A legally binding contract between data controller and data processor that governs the terms for processing personal data according to GDPR.

    Data Validation (ML)

    Automated checking of data quality, schema conformity, and statistical properties in ML pipelines.

    Data Visualization

    The graphical representation of data to communicate insights and patterns.

    Data Warehouse

    A system optimized for structured analytics queries over curated, cleaned data—often with strong governance and performance.

    Databricks

    Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on Apache Spark.

    DBSCAN

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that finds clusters based on density of data points and automatically identifies outliers.

    Decision Support System (DSS)

    A Decision Support System (DSS) helps people make better decisions by combining data, models, and user interfaces.

    Decision Threshold

    The cutoff used to convert a model score/probability into an action (e.g., approve/deny, route/escalate).

    Deduplication

    Deduplication is identifying and removing duplicate (or near-duplicate) items to reduce redundancy and improve quality.

    Demand Forecasting

    Prediction of future demand based on historical data and factors.

    Difference-in-Differences (DiD)

    Quasi-experimental method that estimates causal effects by comparing changes over time between treatment and control groups.

    Differential Privacy

    A mathematically rigorous definition of privacy that guarantees an individual's participation in a dataset is statistically undetectable – even against attackers with arbitrary background knowledge.

    Dimensionality Reduction

    Techniques for reducing the number of features while preserving important information.

    Double Machine Learning (DML)

    Causal inference method that uses ML models to flexibly control for confounding while enabling valid statistical inference.

    N

    NaN (Not a Number)

    NaN is a special floating-point value meaning "Not a Number," used to represent undefined or unrepresentable numeric results (e.g., 0/0).

    Natural Experiment

    A natural experiment uses real-world events or operational changes (not randomized by you) that approximate random assignment, enabling causal inference under assumptions.

    NDCG (Normalized Discounted Cumulative Gain)

    A ranking metric that considers both relevance grades and positions in the ranking – higher-ranked relevant items are weighted more heavily.

    NDJSON (Newline-Delimited JSON)

    NDJSON is a format where each line is a valid JSON object—making it easy to stream, append, and process logs/events at scale.

    Negative Binomial Regression

    Negative binomial regression is a statistical model for count data (e.g., clicks, conversions) that handles overdispersion (variance > mean), unlike Poisson regression.

    Negative Control

    A negative control is a variable, outcome, or test condition that should not be affected by an intervention—used to detect bias, confounding, or measurement artifacts.

    NHST (Null Hypothesis Significance Testing)

    NHST is the traditional statistical testing framework where you test whether observed data is unlikely under a null hypothesis (often "no effect"), typically using p-values.

    NMI (Normalized Mutual Information)

    NMI is a metric used to compare clustering assignments by measuring how much information one clustering shares with another, normalized to be scale-friendly.

    Noise-to-Signal Ratio

    Noise-to-signal ratio measures how much random variation (noise) exists relative to the meaningful pattern (signal) you want to detect.

    Non-Negative Matrix Factorization (NMF)

    NMF factorizes a non-negative matrix into two smaller non-negative matrices, often used for interpretable topic-like decompositions.

    Non-Production Data Masking

    Non-production data masking is the practice of anonymizing, tokenizing, or synthesizing sensitive data before it is used in dev/staging/test environments.

    Normal Form (Database)

    In databases, normal forms (1NF, 2NF, 3NF, BCNF) describe levels of normalization that reduce redundancy and improve data integrity.

    Normalized Cost per Answer

    Normalized cost per answer is the cost of generating an AI answer adjusted for comparability (e.g., normalized by answer length, tokens, difficulty tier, or traffic segment).

    Normalized RMSE (NRMSE)

    NRMSE is RMSE normalized by a scale factor (e.g., range, mean, or standard deviation) to make errors comparable across datasets.

    Nowcasting

    Forecasting the current or imminent state using high-frequency real-time data.

    Null Value

    A null value represents missing or unknown data (distinct from zero, empty string, or false).

    S

    Sampling

    Sampling is selecting a subset of data (or outcomes) from a larger population/process to estimate properties, reduce cost, or enable exploration.

    Scenario Analysis

    Scenario analysis evaluates outcomes under a set of coherent, plausible future conditions (scenarios), rather than changing one variable at a time.

    Schema

    A Schema defines the structure, organization, and constraints of data – whether in databases, APIs, or structured data formats.

    Schema-on-Read

    Schema-on-Read is a data management approach where the structure of data is applied only at query time, not when storing.

    Seasonality

    Regularly recurring patterns in time series that repeat at fixed intervals.

    Segment Analysis

    Segment analysis breaks metrics down by meaningful groups (segments) such as channel, device, region, customer tier, or intent.

    Sensitivity Analysis

    Sensitivity analysis evaluates how changes in inputs affect outputs, to understand robustness and key drivers.

    Sentiment Score

    Numerical value that quantifies the emotional polarity of a text.

    Session

    Period of user interaction with a website or app.

    Sessionization

    Sessionization groups user events into sessions to analyze behavior over time (page flows, search sequences, conversions).

    SimHash

    SimHash is a fingerprinting method that produces a compact hash where similar documents tend to have similar hashes (small Hamming distance).

    Simpson's Paradox

    Simpson's paradox is when a trend appears in multiple groups but reverses or disappears when the groups are combined, due to confounding and aggregation.

    Snorkel

    Snorkel is a framework for programmatic data labeling that uses labeling functions instead of manual annotation to efficiently create large training datasets.

    Snowflake

    Snowflake is a cloud-native data warehouse platform that separates storage and compute, enabling scalable data analysis with SQL.

    Specificity

    The proportion of correctly classified negative cases out of all actual negative cases.

    Stationarity

    A time series is stationary when its statistical properties remain constant over time.

    Statistical Significance

    Statistical significance describes the probability that an observed effect did not arise by chance — measured via the p-value against a defined threshold (usually 0.05).

    Streaming Data

    Continuous data flow that is processed in real-time.

    Survival Analysis

    Statistical method for analyzing time until an event occurs (e.g., churn, conversion, failure), accounting for censored data.

    Synthetic Data

    Artificially generated data that replicates statistical properties of real data – used for training, testing, and privacy protection when real data is scarce, sensitive, or expensive.

    V

    Validation Set

    A validation set is a held-out dataset used during model development to tune hyperparameters and select model versions without touching the final test set.

    Variance

    Variance is the degree to which a model's performance changes across different datasets/samples; high variance often indicates sensitivity to training data (overfitting risk).

    Vector Database

    A vector database stores embeddings and supports fast similarity search (nearest neighbors), often with metadata filtering and indexing for scale.

    Vector Embedding

    A vector embedding is a numerical representation (array of floats) of text, images, or other data that encodes semantic meaning in a high-dimensional space.

    Vector Index

    A vector index is the data structure/algorithm used to speed up nearest-neighbor search over embeddings at scale.

    Vector Quantization

    Vector quantization (VQ) compresses continuous vectors by mapping them to a finite set of representative vectors (a codebook).

    Vector Search

    Vector search retrieves items by similarity in an embedding space rather than exact keyword match.

    Vector Similarity

    Vector similarity is a measure of how close two embeddings are (commonly cosine similarity or dot product).

    Vector Store

    A vector store is the storage layer (database or service) that holds embeddings plus metadata for retrieval and similarity search.

    Vector Store Hygiene

    Vector store hygiene is the operational discipline of keeping a vector store accurate, secure, performant, and up-to-date (dedupe, versioning, ACL correctness, drift monitoring, purge workflows).

    Term not found?

    Browse the full glossary with over 1922 terms from all categories.

    View Full Glossary
    👋Questions? Chat with us!