Specificationv1.1

Tabular Manifold Spec

A cognitive transmission format for AI agents.
Feature engineering as an interface contract.

1. What Is TMS?

The Tabular Manifold Spec defines a structured, multi-resolution data format optimized for consumption by AI agents and LLMs.

TMS is:

Not a storage format (use Parquet/Delta for that)
Not a visualization format (use dashboards for humans)
A cognitive transmission format — the interface layer between data pipelines and agent reasoning

TMS manifolds sit alongside dashboards, not instead of them:

                    ┌─────────────────┐
                    │   Data Lake     │
                    │ (Parquet/Delta) │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌─────────────────┐           ┌─────────────────┐
    │   Dashboards    │           │  TMS Manifolds  │
    │  (for humans)   │           │  (for agents)   │
    └─────────────────┘           └─────────────────┘

2. Core Principles

2.1 Columnar Encoding

Keys appear once. Values are dense arrays aligned to column order.

{
  "format": "columnar_json_v1",
  "schema": {
    "columns": [
      { "name": "period", "type": "string" },
      { "name": "value", "type": "double" }
    ]
  },
  "rows": [
    ["2025-01", 10.5],
    ["2025-02", 11.2]
  ]
}

Why: Eliminates key repetition. A 1000-row table with 10 columns saves ~9000 key tokens.

2.2 Progressive Disclosure (Three Levels)

Level	Name	Purpose	Token Cost
Level 0	Summary	Instant situational awareness	200-500 tokens
Level 1	Geometry	Aggregated structure/trends	500-2000 tokens
Level 2	Telemetry	Raw evidence for forensics	2000-50000+ tokens

Agents start at Level 0. They drill down only when anomalies or quality flags demand it.

2.3 Self-Describing

Every manifold contains enough metadata that an agent can interpret it without external documentation:

Column types and descriptions
Quality flags and reliability scores
Interpretation hints in natural language

3. Manifold Envelope Schema

{
  "artifact_type": "tabular_manifold",
  "artifact_version": "1.1",
  "manifold_kind": "<canonical_kind>",

  "subject": { },
  "time_window": { },

  "level_0_summary": { },
  "level_1_geometry": { },
  "level_2_telemetry": { },

  "token_budget": { },
  "lineage": { }
}

3.1 Required Fields

Field	Type	Description
artifact_type	string	Always "tabular_manifold"
artifact_version	string	Spec version (e.g., "1.1")
manifold_kind	enum	One of the canonical kinds (see §4)
subject	object	What this manifold describes
level_0_summary	object	Required. The cheap cognitive entry point.

3.2 Optional Fields

Field	Type	Description
time_window	object	For time-based manifolds
level_1_geometry	object	Aggregated data in columnar format
level_2_telemetry	object	Raw data in columnar format
token_budget	object	Hints for agent token management
lineage	object	Provenance metadata

4. Canonical Manifold Kinds

TMS defines five canonical manifold kinds. Each has a fixed Level 0 schema with optional extensions.

`timeseries_metric`

For any metric observed over time (prices, counts, rates, etc.)

Level 0 required fields:

{
  "observation_count": 150,
  "time_coverage": {
    "expected_periods": 12,
    "observed_periods": 10,
    "coverage_ratio": 0.833
  },
  "distribution": {
    "min": 9.85,
    "max": 18.40,
    "mean": 12.62,
    "median": 10.90,
    "stddev": 3.44,
    "cv": 0.273
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

`funnel_conversion`

For sequential stage-based processes (sales funnels, onboarding flows, etc.)

Level 0 required fields:

{
  "stage_count": 5,
  "total_entered": 10000,
  "total_converted": 342,
  "overall_conversion_rate": 0.0342,
  "bottleneck_stage": "checkout",
  "bottleneck_drop_rate": 0.67,
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

`cohort_behavior`

For tracking groups over time (user cohorts, customer segments, etc.)

Level 0 required fields:

{
  "cohort_count": 12,
  "total_subjects": 5000,
  "observation_periods": 6,
  "retention_summary": {
    "period_1": 0.85,
    "period_3": 0.62,
    "period_6": 0.41
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

`inventory_snapshot`

For point-in-time inventory or resource states

Level 0 required fields:

{
  "snapshot_timestamp": "2025-01-14T00:00:00Z",
  "total_skus": 1500,
  "total_units": 125000,
  "total_value": 2500000.00,
  "stockout_skus": 45,
  "overstock_skus": 120,
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

`anomaly_detection`

For systems monitoring and alerting contexts

Level 0 required fields:

{
  "detection_window": {
    "start": "2025-01-01T00:00:00Z",
    "end": "2025-01-14T00:00:00Z"
  },
  "anomaly_count": 3,
  "severity_distribution": {
    "critical": 1,
    "warning": 2,
    "info": 0
  },
  "top_anomaly": {
    "timestamp": "2025-01-10T14:30:00Z",
    "metric": "cpu_usage",
    "observed": 98.5,
    "expected_range": [20, 60],
    "severity": "critical"
  },
  "reliability": { },
  "quality_flags": { },
  "interpretation_hints": []
}

5. Reliability Block (Required in Level 0)

Every Level 0 must include a reliability block that quantifies confidence in the summary statistics.

"reliability": {
  "sample_size_class": "sparse|adequate|robust",
  "sample_size_n": 5,
  "sample_size_threshold_adequate": 30,
  "sample_size_threshold_robust": 100,

  "confidence_in_mean": {
    "level": 0.95,
    "margin_of_error": 2.4,
    "interval": [10.22, 15.02]
  },

  "data_quality_score": 0.85,
  "data_quality_notes": "3% of rows had imputed values",

  "staleness": {
    "last_observation": "2025-11-21T00:00:00Z",
    "days_since_last": 54,
    "is_stale": true,
    "stale_threshold_days": 30
  }
}

Sample Size Classes

Class	Criteria	Implication
sparse	n < 30	Summary stats are unstable. Treat with caution.
adequate	30 ≤ n < 100	Stats are reasonable but not rock-solid.
robust	n ≥ 100	High confidence in summary statistics.

6. Quality Flags (Required in Level 0)

Standardized boolean flags that trigger agent attention:

"quality_flags": {
  "low_sample_size": true,
  "missing_periods": true,
  "suspected_outliers": true,
  "data_staleness": true,
  "high_variance": false,
  "imputation_applied": false,
  "schema_drift_detected": false
}

Flag	Trigger Condition
low_sample_size	reliability.sample_size_class == "sparse"
missing_periods	time_coverage.coverage_ratio < 0.8
suspected_outliers	Any value > 3σ from mean, or IQR-based detection
data_staleness	reliability.staleness.is_stale == true
high_variance	distribution.cv > 0.3
imputation_applied	Any values were filled/estimated
schema_drift_detected	Column types or names changed from baseline

7. Token Budget Block

Helps agents decide whether to load deeper levels:

"token_budget": {
  "level_0_tokens_approx": 450,
  "level_1_tokens_approx": 1200,
  "level_2_tokens_approx": 8500,

  "level_2_row_count": 517,
  "level_2_inline_row_limit": 50,
  "level_2_inline_strategy": "preview_outliers",

  "compression_ratios": {
    "level_1_vs_level_2": 7.1,
    "level_0_vs_level_2": 18.9
  },

  "recommended_strategy": "Load Level 0 first. If quality_flags has any true values, load Level 1. Only load Level 2 if investigating specific anomalies."
}

Level 2 Inline Strategies

Strategy	Behavior
preview_outliers	Inline only rows flagged as outliers or anomalies
preview_recent	Inline only the N most recent rows
preview_sample	Inline a random sample of N rows
full_inline	Inline all rows (use only if row_count < limit)
none	No rows inlined; agent must use retrieval

8. Level 1 Geometry Schema

Level 1 uses columnar encoding for aggregated data.

"level_1_geometry": {
  "format": "columnar_json_v1",
  "granularity": "month",

  "schema": {
    "columns": [
      { "name": "period", "type": "string", "description": "Aggregation bucket (YYYY-MM)" },
      { "name": "n", "type": "integer", "description": "Observation count in period" },
      { "name": "min", "type": "double", "description": "Minimum value in period" },
      { "name": "max", "type": "double", "description": "Maximum value in period" },
      { "name": "mean", "type": "double", "description": "Mean value in period" },
      { "name": "median", "type": "double", "description": "Median value in period" },
      { "name": "stddev", "type": "double", "description": "Standard deviation", "nullable": true },
      { "name": "flag", "type": "string", "description": "Optional anomaly flag", "nullable": true }
    ],
    "primary_sort": ["period"]
  },

  "rows": [
    ["2025-01", 15, 9.80, 11.20, 10.45, 10.40, 0.35, null],
    ["2025-02", 12, 10.10, 11.50, 10.80, 10.75, 0.42, null],
    ["2025-07", 3, 17.50, 18.90, 18.20, 18.40, 0.70, "spike_detected"]
  ],

  "missing_periods": ["2025-03", "2025-04", "2025-05", "2025-06"]
}

9. Level 2 Telemetry Schema

Level 2 contains raw observations. For large datasets, use preview + retrieval.

"level_2_telemetry": {
  "format": "columnar_json_v1",

  "schema": {
    "columns": [
      { "name": "ts", "type": "timestamp", "description": "Observation timestamp" },
      { "name": "value", "type": "double", "description": "Observed value" },
      { "name": "document_id", "type": "string", "description": "Source document reference", "nullable": true },
      { "name": "notes", "type": "string", "description": "Human or ETL notes", "nullable": true }
    ],
    "primary_sort": ["ts"]
  },

  "row_count_total": 517,

  "inline_rows": {
    "strategy": "preview_outliers",
    "rows": [
      ["2025-07-09T00:00:00Z", 18.40, "INV-12002", "Expedite fee applied"],
      ["2025-07-15T00:00:00Z", 17.90, "INV-12015", "Small lot surcharge"]
    ]
  },

  "retrieval": {
    "method": "mcp_tool",
    "tool_name": "get_timeseries_telemetry",
    "tool_args": {
      "manifold_id": "mfld_abc123",
      "level": 2,
      "filters": {}
    },
    "pagination": {
      "default_page_size": 100,
      "max_page_size": 500
    }
  }
}

When to Inline vs. Retrieve

Row Count	Recommendation
≤ 50	Full inline ("strategy": "full_inline")
51-500	Preview inline + retrieval available
> 500	Preview inline only; retrieval required for full data

10. Interpretation Hints

Natural language guidance for agents, always an array of strings:

"interpretation_hints": [
  "Sparse series: only 5 observations across 12 months. Summary statistics are unreliable.",
  "July 2025 shows a price spike (18.40 vs median 10.90). Investigate Level 2 for evidence.",
  "High coefficient of variation (0.27) suggests inconsistent pricing or mixed product types.",
  "Coverage ratio is 0.42, meaning 58% of expected periods have no data."
]

Guidelines for Hint Authoring

Lead with the most actionable insight
Reference specific numbers from the manifold
Suggest next steps (e.g., "Investigate Level 2")
Keep each hint to 1-2 sentences

11. Lineage Block

Provenance metadata for auditability:

"lineage": {
  "manifold_id": "mfld_abc123",
  "computed_at": "2026-01-14T22:10:00Z",
  "computed_by": "tms_generator_v1.2",
  "computation_duration_ms": 450,

  "inputs": [
    {
      "dataset_id": "purchase_orders_silver",
      "dataset_version": "v5.2",
      "row_count": 125000,
      "as_of_timestamp": "2026-01-14T00:00:00Z"
    }
  ],

  "filters_applied": [
    { "field": "supplier_id", "operator": "eq", "value": "S-789" },
    { "field": "part_id", "operator": "eq", "value": "P-123456" }
  ],

  "transformations": [
    "Converted unit_price from cents to dollars",
    "Excluded cancelled PO lines",
    "Imputed missing facility codes as 'UNKNOWN'"
  ]
}

12. Complete Example

A full timeseries_metric manifold:

{
  "artifact_type": "tabular_manifold",
  "artifact_version": "1.1",
  "manifold_kind": "timeseries_metric",

  "subject": {
    "entity_type": "part_supplier_price",
    "part_id": "P-123456",
    "part_number": "WIDGET-SS-075",
    "part_description": "Widget, stainless steel, 3/4 inch",
    "supplier_id": "S-789",
    "supplier_name": "Acme Industrial Supply",
    "metric_name": "unit_price",
    "currency": "USD",
    "unit_of_measure": "EA"
  },

  "time_window": {
    "start": "2025-01-01",
    "end": "2025-12-31",
    "timezone": "UTC",
    "granularity": "month"
  },

  "level_0_summary": {
    "observation_count": 5,
    "time_coverage": {
      "expected_periods": 12,
      "observed_periods": 4,
      "coverage_ratio": 0.333
    },
    "distribution": {
      "min": 9.85,
      "max": 18.40,
      "mean": 12.01,
      "median": 10.70,
      "stddev": 3.44,
      "cv": 0.286
    },
    "reliability": {
      "sample_size_class": "sparse",
      "sample_size_n": 5
    },
    "quality_flags": {
      "low_sample_size": true,
      "missing_periods": true,
      "suspected_outliers": true,
      "data_staleness": true
    },
    "interpretation_hints": [
      "Sparse data: only 5 observations across 12 months.",
      "One outlier detected: $18.40 in July (82% above median).",
      "Recommended: inspect Level 1 to identify spike month."
    ]
  },

  "level_1_geometry": {
    "format": "columnar_json_v1",
    "granularity": "month",
    "schema": {
      "columns": [
        { "name": "period", "type": "string" },
        { "name": "n", "type": "integer" },
        { "name": "mean", "type": "double" },
        { "name": "flag", "type": "string", "nullable": true }
      ]
    },
    "rows": [
      ["2025-01", 1, 10.10, null],
      ["2025-03", 2, 10.80, null],
      ["2025-07", 1, 18.40, "outlier_spike"],
      ["2025-11", 1, 9.85, null]
    ]
  },

  "level_2_telemetry": {
    "format": "columnar_json_v1",
    "row_count_total": 5,
    "inline_rows": {
      "strategy": "full_inline",
      "rows": [
        ["2025-01-14T00:00:00Z", 10.10, "PO-555"],
        ["2025-03-02T00:00:00Z", 10.70, "PO-612"],
        ["2025-03-28T00:00:00Z", 10.90, "PO-640"],
        ["2025-07-09T00:00:00Z", 18.40, "PO-777"],
        ["2025-11-21T00:00:00Z", 9.85, "PO-901"]
      ]
    }
  },

  "lineage": {
    "manifold_id": "mfld_price_P123456_S789_2025",
    "computed_at": "2026-01-14T22:30:00Z",
    "computed_by": "tms_generator_v1.1"
  }
}

13. JSON Schema

The formal JSON Schema for validation. Key validation rules:

artifact_type must equal "tabular_manifold"
manifold_kind must be one of the five canonical kinds
level_0_summary is required and must include reliability and quality_flags
If level_1_geometry or level_2_telemetry exists, it must have format: "columnar_json_v1"
rows array length must match for all rows (equal to columns array length)

Download JSON Schema

14. MCP Integration Pattern

TMS manifolds are designed to be returned by MCP tools. Recommended pattern:

@mcp_tool
def get_price_manifold(part_id: str, supplier_id: str, year: int) -> dict:
    """
    Returns a TMS manifold for part/supplier price history.
    
    The manifold includes:
    - Level 0: Summary statistics and quality flags
    - Level 1: Monthly aggregates
    - Level 2: Raw transactions (preview only if >50 rows)
    
    Start with level_0_summary. Check quality_flags to decide 
    whether deeper investigation is needed.
    """
    # ... generate manifold ...
    return manifold

Tool Documentation Should Instruct the Agent:

Always read level_0_summary first
Check quality_flags for any true values
If flags are set, read level_1_geometry to locate the issue
Only load level_2_telemetry when investigating specific anomalies
Use interpretation_hints as reasoning guidance

Changelog

v1.1 (2026-01-14)

Added required reliability block to Level 0
Defined five canonical manifold_kind values with fixed Level 0 schemas
Added token_budget block with compression ratios
Specified inline_rows.strategy enum for Level 2
Removed underspecified drilldown_policy DSL (replaced with interpretation_hints and recommended_strategy)
Added outlier_summary to Level 0 for timeseries_metric
Clarified inline vs retrieval thresholds

v1.0 (2026-01-10)

Initial spec

License

TMS is released under Apache 2.0. Use it freely.