Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Data & Analytics

    Data Terms A-Z

    Understand the language of data: From Big Data to ETL to Predictive Analytics – all important terms for data-driven marketing and Business Intelligence.

    Big Data
    Data Lakes
    ETL Processes
    Business Intelligence
    Predictive Analytics
    Data Governance
    161 terms in Data & Analytics

    D

    Dashboard

    A visual interface that presents key metrics, trends, and alerts to support decision-making.

    Data Catalog

    A searchable inventory of an organization's data assets including metadata, ownership, and documentation.

    Data Dictionary

    Documentation that defines the meaning, format, allowed values, and usage of data fields.

    Data Drift

    The change in statistical properties of input data over time, which can degrade model performance.

    Data Enrichment

    Adding additional attributes to existing data—via internal joins or external sources (firmographic providers, geo data).

    Data Governance

    Data governance is the framework of policies, roles, processes, and controls that ensure data is accurate, secure, compliant, and usable across an organization.

    Data Labeling

    Process of annotating data with ground truth for supervised learning.

    Data Lake

    Central storage for large amounts of unstructured and structured data.

    Data Layout

    The physical or logical arrangement of data in memory or on storage media, which influences access speed, cache efficiency, and processing performance.

    Data Lineage

    Data lineage describes where data comes from, how it moves through systems, and how it is transformed into downstream datasets and outputs.

    Data Mesh

    Decentralized approach to data architecture with domain-oriented data products.

    Data Mining

    The process of discovering patterns, anomalies, and relationships in large datasets using statistical and machine learning methods.

    Data Pipeline

    A sequence of processes that moves and transforms data from sources to destinations (lake, warehouse, feature store, vector index).

    Data Preprocessing

    Transforming raw data into a form suitable for modeling or analysis (cleaning, normalization, encoding).

    Data Visualization

    The graphical representation of data to communicate insights and patterns.

    Data Warehouse

    A system optimized for structured analytics queries over curated, cleaned data—often with strong governance and performance.

    Databricks

    Databricks is a unified analytics platform that combines data engineering, data science, and machine learning on Apache Spark.

    DBSCAN

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that finds clusters based on density of data points and automatically identifies outliers.

    Decision Support System (DSS)

    A Decision Support System (DSS) helps people make better decisions by combining data, models, and user interfaces.

    Decision Threshold

    The cutoff used to convert a model score/probability into an action (e.g., approve/deny, route/escalate).

    Deduplication

    Deduplication is identifying and removing duplicate (or near-duplicate) items to reduce redundancy and improve quality.

    Demand Forecasting

    Prediction of future demand based on historical data and factors.

    Differential Privacy

    A mathematical framework providing formal guarantees that individual data points cannot be inferred from aggregates or models.

    Dimensionality Reduction

    Techniques for reducing the number of features while preserving important information.

    Distribution Shift

    A mismatch between the data distribution seen in training and the distribution encountered in deployment.

    N

    NaN (Not a Number)

    NaN is a special floating-point value meaning "Not a Number," used to represent undefined or unrepresentable numeric results (e.g., 0/0).

    Natural Experiment

    A natural experiment uses real-world events or operational changes (not randomized by you) that approximate random assignment, enabling causal inference under assumptions.

    NDCG (Normalized Discounted Cumulative Gain)

    A ranking metric that considers both relevance grades and positions in the ranking – higher-ranked relevant items are weighted more heavily.

    NDJSON (Newline-Delimited JSON)

    NDJSON is a format where each line is a valid JSON object—making it easy to stream, append, and process logs/events at scale.

    Negative Binomial Regression

    Negative binomial regression is a statistical model for count data (e.g., clicks, conversions) that handles overdispersion (variance > mean), unlike Poisson regression.

    Negative Control

    A negative control is a variable, outcome, or test condition that should not be affected by an intervention—used to detect bias, confounding, or measurement artifacts.

    NHST (Null Hypothesis Significance Testing)

    NHST is the traditional statistical testing framework where you test whether observed data is unlikely under a null hypothesis (often "no effect"), typically using p-values.

    NMI (Normalized Mutual Information)

    NMI is a metric used to compare clustering assignments by measuring how much information one clustering shares with another, normalized to be scale-friendly.

    Noise-to-Signal Ratio

    Noise-to-signal ratio measures how much random variation (noise) exists relative to the meaningful pattern (signal) you want to detect.

    Non-Negative Matrix Factorization (NMF)

    NMF factorizes a non-negative matrix into two smaller non-negative matrices, often used for interpretable topic-like decompositions.

    Non-Production Data Masking

    Non-production data masking is the practice of anonymizing, tokenizing, or synthesizing sensitive data before it is used in dev/staging/test environments.

    Normal Form (Database)

    In databases, normal forms (1NF, 2NF, 3NF, BCNF) describe levels of normalization that reduce redundancy and improve data integrity.

    Normalized Cost per Answer

    Normalized cost per answer is the cost of generating an AI answer adjusted for comparability (e.g., normalized by answer length, tokens, difficulty tier, or traffic segment).

    Normalized RMSE (NRMSE)

    NRMSE is RMSE normalized by a scale factor (e.g., range, mean, or standard deviation) to make errors comparable across datasets.

    Null Value

    A null value represents missing or unknown data (distinct from zero, empty string, or false).

    S

    Sampling

    Sampling is selecting a subset of data (or outcomes) from a larger population/process to estimate properties, reduce cost, or enable exploration.

    Scenario Analysis

    Scenario analysis evaluates outcomes under a set of coherent, plausible future conditions (scenarios), rather than changing one variable at a time.

    Schema

    A Schema defines the structure, organization, and constraints of data – whether in databases, APIs, or structured data formats.

    Schema-on-Read

    Schema-on-Read is a data management approach where the structure of data is applied only at query time, not when storing.

    Segment Analysis

    Segment analysis breaks metrics down by meaningful groups (segments) such as channel, device, region, customer tier, or intent.

    Sensitivity Analysis

    Sensitivity analysis evaluates how changes in inputs affect outputs, to understand robustness and key drivers.

    Sentiment Score

    Numerical value that quantifies the emotional polarity of a text.

    Session

    Period of user interaction with a website or app.

    Sessionization

    Sessionization groups user events into sessions to analyze behavior over time (page flows, search sequences, conversions).

    SimHash

    SimHash is a fingerprinting method that produces a compact hash where similar documents tend to have similar hashes (small Hamming distance).

    Simpson's Paradox

    Simpson's paradox is when a trend appears in multiple groups but reverses or disappears when the groups are combined, due to confounding and aggregation.

    Snorkel

    Snorkel is a framework for programmatic data labeling that uses labeling functions instead of manual annotation to efficiently create large training datasets.

    Snowflake

    Snowflake is a cloud-native data warehouse platform that separates storage and compute, enabling scalable data analysis with SQL.

    Statistical Significance

    Statistical significance describes the probability that an observed effect did not arise by chance — measured via the p-value against a defined threshold (usually 0.05).

    Streaming Data

    Continuous data flow that is processed in real-time.

    V

    Validation Set

    A validation set is a held-out dataset used during model development to tune hyperparameters and select model versions without touching the final test set.

    Variance

    Variance is the degree to which a model's performance changes across different datasets/samples; high variance often indicates sensitivity to training data (overfitting risk).

    Vector Database

    A vector database stores embeddings and supports fast similarity search (nearest neighbors), often with metadata filtering and indexing for scale.

    Vector Embedding

    A vector embedding is a numerical representation (array of floats) of text, images, or other data that encodes semantic meaning in a high-dimensional space.

    Vector Index

    A vector index is the data structure/algorithm used to speed up nearest-neighbor search over embeddings at scale.

    Vector Quantization

    Vector quantization (VQ) compresses continuous vectors by mapping them to a finite set of representative vectors (a codebook).

    Vector Search

    Vector search retrieves items by similarity in an embedding space rather than exact keyword match.

    Vector Similarity

    Vector similarity is a measure of how close two embeddings are (commonly cosine similarity or dot product).

    Vector Store

    A vector store is the storage layer (database or service) that holds embeddings plus metadata for retrieval and similarity search.

    Vector Store Hygiene

    Vector store hygiene is the operational discipline of keeping a vector store accurate, secure, performant, and up-to-date (dedupe, versioning, ACL correctness, drift monitoring, purge workflows).

    Term not found?

    Browse the full glossary with over 1407 terms from all categories.

    View Full Glossary
    👋Questions? Chat with us!