Back to Research Notes
Specificationv1.1

Tabular Manifold Spec

A cognitive transmission format for AI agents.
Feature engineering as an interface contract.

1. What Is TMS?

The Tabular Manifold Spec defines a structured, multi-resolution data format optimized for consumption by AI agents and LLMs.

TMS is:

  • Not a storage format (use Parquet/Delta for that)
  • Not a visualization format (use dashboards for humans)
  • A cognitive transmission format — the interface layer between data pipelines and agent reasoning

TMS manifolds sit alongside dashboards, not instead of them:

                    ┌─────────────────┐
                    │   Data Lake     │
                    │ (Parquet/Delta) │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌─────────────────┐           ┌─────────────────┐
    │   Dashboards    │           │  TMS Manifolds  │
    │  (for humans)   │           │  (for agents)   │
    └─────────────────┘           └─────────────────┘

2. Core Principles

2.1 Columnar Encoding

Keys appear once. Values are dense arrays aligned to column order.

{
  "format": "columnar_json_v1",
  "schema": {
    "columns": [
      { "name": "period", "type": "string" },
      { "name": "value", "type": "double" }
    ]
  },
  "rows": [
    ["2025-01", 10.5],
    ["2025-02", 11.2]
  ]
}

Why: Eliminates key repetition. A 1000-row table with 10 columns saves ~9000 key tokens.

2.2 Progressive Disclosure (Three Levels)

LevelNamePurposeToken Cost
Level 0SummaryInstant situational awareness200-500 tokens
Level 1GeometryAggregated structure/trends500-2000 tokens
Level 2TelemetryRaw evidence for forensics2000-50000+ tokens

Agents start at Level 0. They drill down only when anomalies or quality flags demand it.

2.3 Self-Describing

Every manifold contains enough metadata that an agent can interpret it without external documentation:

  • Column types and descriptions
  • Quality flags and reliability scores
  • Interpretation hints in natural language

3. Manifold Envelope Schema

{
  "artifact_type": "tabular_manifold",
  "artifact_version": "1.1",
  "manifold_kind": "<canonical_kind>",

  "subject": { },
  "time_window": { },

  "level_0_summary": { },
  "level_1_geometry": { },
  "level_2_telemetry": { },

  "token_budget": { },
  "lineage": { }
}

3.1 Required Fields

FieldTypeDescription
artifact_typestringAlways "tabular_manifold"
artifact_versionstringSpec version (e.g., "1.1")
manifold_kindenumOne of the canonical kinds (see §4)
subjectobjectWhat this manifold describes
level_0_summaryobjectRequired. The cheap cognitive entry point.

3.2 Optional Fields

FieldTypeDescription
time_windowobjectFor time-based manifolds
level_1_geometryobjectAggregated data in columnar format
level_2_telemetryobjectRaw data in columnar format
token_budgetobjectHints for agent token management
lineageobjectProvenance metadata

4. Canonical Manifold Kinds

TMS defines five canonical manifold kinds. Each has a fixed Level 0 schema with optional extensions.

timeseries_metric

For any metric observed over time (prices, counts, rates, etc.)

Level 0 required fields:

{
  "observation_count": 150,
  "time_coverage": {
    "expected_periods": 12,
    "observed_periods": 10,
    "coverage_ratio": 0.833
  },
  "distribution": {
    "min": 9.85,
    "max": 18.40,
    "mean": 12.62,
    "median": 10.90,
    "stddev": 3.44,
    "cv": 0.273
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

funnel_conversion

For sequential stage-based processes (sales funnels, onboarding flows, etc.)

Level 0 required fields:

{
  "stage_count": 5,
  "total_entered": 10000,
  "total_converted": 342,
  "overall_conversion_rate": 0.0342,
  "bottleneck_stage": "checkout",
  "bottleneck_drop_rate": 0.67,
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

cohort_behavior

For tracking groups over time (user cohorts, customer segments, etc.)

Level 0 required fields:

{
  "cohort_count": 12,
  "total_subjects": 5000,
  "observation_periods": 6,
  "retention_summary": {
    "period_1": 0.85,
    "period_3": 0.62,
    "period_6": 0.41
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

inventory_snapshot

For point-in-time inventory or resource states

Level 0 required fields:

{
  "snapshot_timestamp": "2025-01-14T00:00:00Z",
  "total_skus": 1500,
  "total_units": 125000,
  "total_value": 2500000.00,
  "stockout_skus": 45,
  "overstock_skus": 120,
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

anomaly_detection

For systems monitoring and alerting contexts

Level 0 required fields:

{
  "detection_window": {
    "start": "2025-01-01T00:00:00Z",
    "end": "2025-01-14T00:00:00Z"
  },
  "anomaly_count": 3,
  "severity_distribution": {
    "critical": 1,
    "warning": 2,
    "info": 0
  },
  "top_anomaly": {
    "timestamp": "2025-01-10T14:30:00Z",
    "metric": "cpu_usage",
    "observed": 98.5,
    "expected_range": [20, 60],
    "severity": "critical"
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

5. Reliability Block (Required in Level 0)

Every Level 0 must include a reliability block that quantifies confidence in the summary statistics.

"reliability": {
  "sample_size_class": "sparse|adequate|robust",
  "sample_size_n": 5,
  "sample_size_threshold_adequate": 30,
  "sample_size_threshold_robust": 100,

  "confidence_in_mean": {
    "level": 0.95,
    "margin_of_error": 2.4,
    "interval": [10.22, 15.02]
  },

  "data_quality_score": 0.85,
  "data_quality_notes": "3% of rows had imputed values",

  "staleness": {
    "last_observation": "2025-11-21T00:00:00Z",
    "days_since_last": 54,
    "is_stale": true,
    "stale_threshold_days": 30
  }
}

Sample Size Classes

ClassCriteriaImplication
sparsen < 30Summary stats are unstable. Treat with caution.
adequate30 ≤ n < 100Stats are reasonable but not rock-solid.
robustn ≥ 100High confidence in summary statistics.

6. Quality Flags (Required in Level 0)

Standardized boolean flags that trigger agent attention:

"quality_flags": {
  "low_sample_size": true,
  "missing_periods": true,
  "suspected_outliers": true,
  "data_staleness": true,
  "high_variance": false,
  "imputation_applied": false,
  "schema_drift_detected": false
}
FlagTrigger Condition
low_sample_sizereliability.sample_size_class == "sparse"
missing_periodstime_coverage.coverage_ratio < 0.8
suspected_outliersAny value > 3σ from mean, or IQR-based detection
data_stalenessreliability.staleness.is_stale == true
high_variancedistribution.cv > 0.3
imputation_appliedAny values were filled/estimated
schema_drift_detectedColumn types or names changed from baseline

7. Token Budget Block

Helps agents decide whether to load deeper levels:

"token_budget": {
  "level_0_tokens_approx": 450,
  "level_1_tokens_approx": 1200,
  "level_2_tokens_approx": 8500,

  "level_2_row_count": 517,
  "level_2_inline_row_limit": 50,
  "level_2_inline_strategy": "preview_outliers",

  "compression_ratios": {
    "level_1_vs_level_2": 7.1,
    "level_0_vs_level_2": 18.9
  },

  "recommended_strategy": "Load Level 0 first. If quality_flags has any true values, load Level 1. Only load Level 2 if investigating specific anomalies."
}

Level 2 Inline Strategies

StrategyBehavior
preview_outliersInline only rows flagged as outliers or anomalies
preview_recentInline only the N most recent rows
preview_sampleInline a random sample of N rows
full_inlineInline all rows (use only if row_count < limit)
noneNo rows inlined; agent must use retrieval

8. Level 1 Geometry Schema

Level 1 uses columnar encoding for aggregated data.

"level_1_geometry": {
  "format": "columnar_json_v1",
  "granularity": "month",

  "schema": {
    "columns": [
      { "name": "period", "type": "string", "description": "Aggregation bucket (YYYY-MM)" },
      { "name": "n", "type": "integer", "description": "Observation count in period" },
      { "name": "min", "type": "double", "description": "Minimum value in period" },
      { "name": "max", "type": "double", "description": "Maximum value in period" },
      { "name": "mean", "type": "double", "description": "Mean value in period" },
      { "name": "median", "type": "double", "description": "Median value in period" },
      { "name": "stddev", "type": "double", "description": "Standard deviation", "nullable": true },
      { "name": "flag", "type": "string", "description": "Optional anomaly flag", "nullable": true }
    ],
    "primary_sort": ["period"]
  },

  "rows": [
    ["2025-01", 15, 9.80, 11.20, 10.45, 10.40, 0.35, null],
    ["2025-02", 12, 10.10, 11.50, 10.80, 10.75, 0.42, null],
    ["2025-07", 3, 17.50, 18.90, 18.20, 18.40, 0.70, "spike_detected"]
  ],

  "missing_periods": ["2025-03", "2025-04", "2025-05", "2025-06"]
}

9. Level 2 Telemetry Schema

Level 2 contains raw observations. For large datasets, use preview + retrieval.

"level_2_telemetry": {
  "format": "columnar_json_v1",

  "schema": {
    "columns": [
      { "name": "ts", "type": "timestamp", "description": "Observation timestamp" },
      { "name": "value", "type": "double", "description": "Observed value" },
      { "name": "document_id", "type": "string", "description": "Source document reference", "nullable": true },
      { "name": "notes", "type": "string", "description": "Human or ETL notes", "nullable": true }
    ],
    "primary_sort": ["ts"]
  },

  "row_count_total": 517,

  "inline_rows": {
    "strategy": "preview_outliers",
    "rows": [
      ["2025-07-09T00:00:00Z", 18.40, "INV-12002", "Expedite fee applied"],
      ["2025-07-15T00:00:00Z", 17.90, "INV-12015", "Small lot surcharge"]
    ]
  },

  "retrieval": {
    "method": "mcp_tool",
    "tool_name": "get_timeseries_telemetry",
    "tool_args": {
      "manifold_id": "mfld_abc123",
      "level": 2,
      "filters": {}
    },
    "pagination": {
      "default_page_size": 100,
      "max_page_size": 500
    }
  }
}

When to Inline vs. Retrieve

Row CountRecommendation
≤ 50Full inline ("strategy": "full_inline")
51-500Preview inline + retrieval available
> 500Preview inline only; retrieval required for full data

10. Interpretation Hints

Natural language guidance for agents, always an array of strings:

"interpretation_hints": [
  "Sparse series: only 5 observations across 12 months. Summary statistics are unreliable.",
  "July 2025 shows a price spike (18.40 vs median 10.90). Investigate Level 2 for evidence.",
  "High coefficient of variation (0.27) suggests inconsistent pricing or mixed product types.",
  "Coverage ratio is 0.42, meaning 58% of expected periods have no data."
]

Guidelines for Hint Authoring

  • Lead with the most actionable insight
  • Reference specific numbers from the manifold
  • Suggest next steps (e.g., "Investigate Level 2")
  • Keep each hint to 1-2 sentences

11. Lineage Block

Provenance metadata for auditability:

"lineage": {
  "manifold_id": "mfld_abc123",
  "computed_at": "2026-01-14T22:10:00Z",
  "computed_by": "tms_generator_v1.2",
  "computation_duration_ms": 450,

  "inputs": [
    {
      "dataset_id": "purchase_orders_silver",
      "dataset_version": "v5.2",
      "row_count": 125000,
      "as_of_timestamp": "2026-01-14T00:00:00Z"
    }
  ],

  "filters_applied": [
    { "field": "supplier_id", "operator": "eq", "value": "S-789" },
    { "field": "part_id", "operator": "eq", "value": "P-123456" }
  ],

  "transformations": [
    "Converted unit_price from cents to dollars",
    "Excluded cancelled PO lines",
    "Imputed missing facility codes as 'UNKNOWN'"
  ]
}

12. Complete Example

A full timeseries_metric manifold:

{
  "artifact_type": "tabular_manifold",
  "artifact_version": "1.1",
  "manifold_kind": "timeseries_metric",

  "subject": {
    "entity_type": "part_supplier_price",
    "part_id": "P-123456",
    "part_number": "WIDGET-SS-075",
    "part_description": "Widget, stainless steel, 3/4 inch",
    "supplier_id": "S-789",
    "supplier_name": "Acme Industrial Supply",
    "metric_name": "unit_price",
    "currency": "USD",
    "unit_of_measure": "EA"
  },

  "time_window": {
    "start": "2025-01-01",
    "end": "2025-12-31",
    "timezone": "UTC",
    "granularity": "month"
  },

  "level_0_summary": {
    "observation_count": 5,
    "time_coverage": {
      "expected_periods": 12,
      "observed_periods": 4,
      "coverage_ratio": 0.333
    },
    "distribution": {
      "min": 9.85,
      "max": 18.40,
      "mean": 12.01,
      "median": 10.70,
      "stddev": 3.44,
      "cv": 0.286
    },
    "reliability": {
      "sample_size_class": "sparse",
      "sample_size_n": 5
    },
    "quality_flags": {
      "low_sample_size": true,
      "missing_periods": true,
      "suspected_outliers": true,
      "data_staleness": true
    },
    "interpretation_hints": [
      "Sparse data: only 5 observations across 12 months.",
      "One outlier detected: $18.40 in July (82% above median).",
      "Recommended: inspect Level 1 to identify spike month."
    ]
  },

  "level_1_geometry": {
    "format": "columnar_json_v1",
    "granularity": "month",
    "schema": {
      "columns": [
        { "name": "period", "type": "string" },
        { "name": "n", "type": "integer" },
        { "name": "mean", "type": "double" },
        { "name": "flag", "type": "string", "nullable": true }
      ]
    },
    "rows": [
      ["2025-01", 1, 10.10, null],
      ["2025-03", 2, 10.80, null],
      ["2025-07", 1, 18.40, "outlier_spike"],
      ["2025-11", 1, 9.85, null]
    ]
  },

  "level_2_telemetry": {
    "format": "columnar_json_v1",
    "row_count_total": 5,
    "inline_rows": {
      "strategy": "full_inline",
      "rows": [
        ["2025-01-14T00:00:00Z", 10.10, "PO-555"],
        ["2025-03-02T00:00:00Z", 10.70, "PO-612"],
        ["2025-03-28T00:00:00Z", 10.90, "PO-640"],
        ["2025-07-09T00:00:00Z", 18.40, "PO-777"],
        ["2025-11-21T00:00:00Z", 9.85, "PO-901"]
      ]
    }
  },

  "lineage": {
    "manifold_id": "mfld_price_P123456_S789_2025",
    "computed_at": "2026-01-14T22:30:00Z",
    "computed_by": "tms_generator_v1.1"
  }
}

13. JSON Schema

The formal JSON Schema for validation. Key validation rules:

  • artifact_type must equal "tabular_manifold"
  • manifold_kind must be one of the five canonical kinds
  • level_0_summary is required and must include reliability and quality_flags
  • If level_1_geometry or level_2_telemetry exists, it must have format: "columnar_json_v1"
  • rows array length must match for all rows (equal to columns array length)
Download JSON Schema

14. MCP Integration Pattern

TMS manifolds are designed to be returned by MCP tools. Recommended pattern:

@mcp_tool
def get_price_manifold(part_id: str, supplier_id: str, year: int) -> dict:
    """
    Returns a TMS manifold for part/supplier price history.
    
    The manifold includes:
    - Level 0: Summary statistics and quality flags
    - Level 1: Monthly aggregates
    - Level 2: Raw transactions (preview only if >50 rows)
    
    Start with level_0_summary. Check quality_flags to decide 
    whether deeper investigation is needed.
    """
    # ... generate manifold ...
    return manifold

Tool Documentation Should Instruct the Agent:

  1. Always read level_0_summary first
  2. Check quality_flags for any true values
  3. If flags are set, read level_1_geometry to locate the issue
  4. Only load level_2_telemetry when investigating specific anomalies
  5. Use interpretation_hints as reasoning guidance

Changelog

v1.1 (2026-01-14)

  • Added required reliability block to Level 0
  • Defined five canonical manifold_kind values with fixed Level 0 schemas
  • Added token_budget block with compression ratios
  • Specified inline_rows.strategy enum for Level 2
  • Removed underspecified drilldown_policy DSL (replaced with interpretation_hints and recommended_strategy)
  • Added outlier_summary to Level 0 for timeseries_metric
  • Clarified inline vs retrieval thresholds

v1.0 (2026-01-10)

  • Initial spec

License

TMS is released under Apache 2.0. Use it freely.