Skip to content

The ASTRA Specification Explained

As agents make it easier to generate analyses, the bottleneck shifts from producing results to inspecting whether each result should be trusted. An astra.yaml file is a scientific record that chains together inputs, outputs, methodological choices, evidence, and claims. It allows an experiment to be quickly checked and expanded upon. astra.yaml is meant to be written and read by agents as readily as by people.

Here, we explain each part of the format piece by piece.

What an ASTRA document describes

An ASTRA document describes the purpose of an experiment, which artifacts matter, which choices shape those artifacts, and how a reader can trace claims back to evidence.

An astra.yaml file contains an Analysis, which declares:

Section Question it answers
description How is this analysis explained in prose?
inputs What data or prior analyses does this analysis consume?
outputs What metrics, figures, tables, data products, or reports does it produce?
decisions Which methodological choice points shape the outputs?
prior_insights Which existing claims or sources inform the analysis?
findings What claims does this analysis make after its outputs are produced?
analyses Which nested sub-analyses make up a larger analysis tree?

Minimal ASTRA document

A minimal useful ASTRA document names the analysis, declares an input, an output, and the decisions that affect that output. In the example below, the analysis consumes the catalog_data input, produces the fit_params output, and exposes a methodological choice, fit_method.

version: "1.0"
name: Period-Luminosity Fit

description: |
  Fit a period-luminosity relation from a measurement catalog.

inputs:
  - id: catalog_data
    type: data
    source: data/catalog_data.csv
    description: Periods and mean apparent magnitudes.

outputs:
  - id: fit_params
    type: table
    description: Slope, intercept, and scatter for the period-luminosity relation.
    inputs: [catalog_data]
    decisions: [fit_method]
    recipe:
      command: >-
        python src/fit_period_luminosity.py
        --catalog {inputs.catalog_data}
        --method {decisions.fit_method}
        --out {output}

decisions:
  fit_method:
    label: Fitting method
    rationale: The fitting method determines how outliers influence the inferred relation.
    default: ordinary_least_squares
    options:
      ordinary_least_squares:
        label: Ordinary least squares
      robust_linear:
        label: Robust linear fit

Reading the example from top to bottom

version: "1.0"
name: Period-Luminosity Fit

The version field records the ASTRA schema version the document expects. The name field gives the analysis a human-readable title. Real projects usually also include id, tags, and sometimes a node-level container used as the default execution environment for recipes.

Description

description: |
  Fit a period-luminosity relation from a measurement catalog.

description is a single optional free-prose field — the same field every other content object carries (Input, Output, Option, Universe). It gives readers a short orientation to the analysis. ASTRA deliberately keeps this lightweight: a richer write-up — with figures, citations, live numbers, and multi-page structure — is authored outside astra.yaml as a report that references the analysis's elements rather than restating them. ASTRA is not prescriptive about the authoring framework; MySTRA is one example — a MyST plugin that renders ASTRA components straight from astra.yaml. See RFC-0002 for the rationale.

Inputs

inputs:
  - id: catalog_data
    type: data
    source: data/catalog_data.csv
    description: Periods and mean apparent magnitudes.

An input is something the analysis consumes. It can be a dataset, a file, an external resource, or the outputs of another ASTRA analysis. The id is the local name used by outputs and recipes. The type says whether the input is data or an external analysis. Notably, the source is usually a path or URI, a loader name, or another data locator, and it is descriptive rather than prescriptive because it records enough information for agents to know where the input is sourced.

Outputs

outputs:
  - id: fit_params
    type: table
    description: Slope, intercept, and scatter for the period-luminosity relation.
    inputs: [catalog_data]
    decisions: [fit_method]
    recipe:
      command: >-
        python src/fit_period_luminosity.py
        --catalog {inputs.catalog_data}
        --method {decisions.fit_method}
        --out {output}

An output is a scientific artifact the analysis produces: a metric, figure, table, data product, or report. Importantly, each output says what it depends on, i.e. inputs names the upstream artifacts required to produce it, and decisions names the methodological choices that parameterize it. Finally, recipe gives the Python command the runner invokes.

The recipe is not allowed to invent hidden dependencies, which makes the output a reviewable unit.

Decisions

A reviewer might ask: "What if you used fitting method B instead of method A?" In ASTRA, you can codify this decision and track how it changes the outputs.

decisions:
  fit_method:
    label: Fitting method
    rationale: The fitting method determines how outliers influence the inferred relation.
    default: ordinary_least_squares
    options:
      ordinary_least_squares:
        label: Ordinary least squares
      robust_linear:
        label: Robust linear fit

The fit_method decision captures the fitting-method choice. default records the baseline choice used by the analysis, and options records the alternatives.

Use a decision when changing a methodological choice could change an output. Give the choice an id, record the baseline with default, and list the allowed options. Then attach the decision to each affected output. In this example, fit_params.decisions: [fit_method] tells the reader that the fitted parameters depend on the selected fitting method.

Building up the specification

The minimal document above is enough to explain the basic shape. The rest of the specification adds structure that becomes important in real analyses: constraints between options, conditional outputs, evidence-backed claims, and nested sub-analyses.

Options

Usually, a methodological question has a small set of plausible answers. ASTRA records those answers as options inside a decision. In the example below, the photometric_band decision asks which measurement band should be used in the fit, and the options are g_band, i_band, and w1_band.

decisions:
  photometric_band:
    label: Photometric band
    rationale: The chosen band changes the fitted relation and its scatter.
    default: g_band
    options:
      g_band:
        label: G band
        description: Use mean G-band magnitudes.
      i_band:
        label: I band
        description: Use mean I-band magnitudes.
      w1_band:
        label: W1 band
        description: Use near-infrared W1 magnitudes.

Constraints between options

Sometimes, methodological choices are linked: one choice may require another, or two choices may not make sense together. ASTRA records these relationships with requires and incompatible_with. Constraint references use decision_id.option_id, so each rule points to a specific option inside a specific decision.

decisions:
  fit_method:
    label: Line fitting method
    default: ordinary_least_squares
    options:
      ordinary_least_squares:
        label: Ordinary least squares
      robust_linear:
        label: Robust linear fit

  outlier_handling:
    label: Outlier handling
    default: keep_all
    options:
      keep_all:
        label: Keep all points
      sigma_clip:
        label: Remove extreme outliers
        incompatible_with:
          - fit_method.robust_linear

Here, outlier_handling.sigma_clip is incompatible with fit_method.robust_linear. Both choices reduce the influence of points far from the fitted trend: sigma_clip removes extreme points before fitting, while robust_linear keeps them but downweights them. A universe that selects both is invalid, so ASTRA makes that methodological boundary explicit.

Universes

A universe is one complete selection of decision options, stored separately from astra.yaml. A project usually keeps one YAML file per universe in a universes/ directory, such as universes/baseline.yaml or universes/cleaned-data.yaml. If the analysis defines fit_method and outlier_handling, then each universe file chooses one option for each decision.

# universes/baseline.yaml
id: baseline
description: Fit the relation with ordinary least squares and keep all points.

decisions:
  fit_method: ordinary_least_squares
  outlier_handling: keep_all

One analysis can have many universes. A baseline universe might keep all points and use ordinary least squares. A robustness universe might switch to fit_method: robust_linear. A cleaned-data universe might use outlier_handling: sigma_clip. Each universe produces its own outputs under one declared choice configuration. For example, the plot from the baseline universe might be written to results/baseline/, while the cleaned-data universe plot might be written to results/clean/.

Recipes and command templates

After an output has declared what it depends on, the recipe says how a runner should produce it. In this example, the recipe passes the declared catalog and selected fitting method to a Python script, then writes the fitted parameters to {output}.

outputs:
  - id: fit_params
    type: table
    inputs: [catalog_data]
    decisions: [fit_method]
    recipe:
      command: >-
        python src/fit_period_luminosity.py
        --catalog {inputs.catalog_data}
        --method {decisions.fit_method}
        --out {output}

The command is the only required part of a recipe. You can add optional container and resources elements when the runner needs execution context, for example a Docker image for the software environment, or CPU, memory, and wall-time requests for compute.

Prior insights, findings, and evidence

Scientific review is not only about checking the final result. It is also about checking why the analysis was set up the way it was and what the analysis claimed afterward. ASTRA separates claims that motivate the analysis from claims produced by the analysis. A prior_insight records an imported claim used to justify a choice, while a finding records a claim made by the current analysis. Both use the shared Insight model and can point to evidence.

prior_insights:
  calibration_reference:
    claim: External calibration information can shift the fitted relation.
    created_at: "2026-05-11T00:00:00Z"
    evidence:
      - id: ev_calibration_reference
        doi: "10.1051/0004-6361/202244775"

decisions:
  correction_mode:
    label: Correction mode
    rationale: Calibration choices can shift the fitted intercept.
    options:
      calibrated:
        label: Apply calibration
        insights: [calibration_reference]

findings:
  cleaned_fit:
    claim: The cleaned-data universe reduced the fit scatter.
    created_at: "2026-05-11T00:00:00Z"
    evidence:
      - id: ev_fit_params
        artifact: fit_params
        quote:
          exact: "scatter = 0.18 mag"
    derived: true

Evidence is what lets a reader check a claim instead of simply accepting it. It can cite a paper, identify a passage in a source document, or point to an artifact produced by the analysis. The same structure works for prior insights and findings: a decision can say which insight supports it, and each insight can say exactly which source or output supports the claim. With evidence verification enabled, tools can check whether quoted text actually appears in the cited source.

Excluded options

A rejected option can still be scientifically important. ASTRA lets authors keep it in the record while marking it as unavailable for valid universes.

options:
  quadratic_fit:
    label: Quadratic relation
    excluded: true
    excluded_reason: Pilot residuals did not justify adding curvature.

In this way, a reviewer can see not only what was chosen, but what was considered and why it was rejected.

Sub-analyses

Experiments are usually made of smaller analyses. In ASTRA, you can build them up as nested analyses: a cleaning stage can feed a fitting stage, which can feed a summary plot.

analyses:
  catalog_cleaning:
    id: catalog_cleaning
    inputs:
      - id: raw_catalog
        from: ../source_catalog
    outputs:
      - id: cleaned_catalog
        type: data
        decisions: [outlier_handling]
        recipe:
          command: python src/clean_catalog.py --out {output}
    decisions:
      outlier_handling:
        label: Outlier handling
        default: keep_all
        options:
          keep_all: { label: Keep all points }
          sigma_clip: { label: Remove extreme outliers }

A sub-analysis is itself an Analysis. It can have its own description, inputs, outputs, decisions, findings, and nested children. This self-similar structure lets authors describe a project at multiple levels of detail without switching formats.

Sub-analyses can be written inline, as above, or split into their own directories when a project becomes large. If path is set, that child analysis is read from another astra.yaml, while still belonging to the same conceptual analysis tree.

analyses:
  catalog_cleaning:
    path: stages/catalog_cleaning

If the dependency is a separate ASTRA record rather than a child of the current analysis, declare it as an input with type: analysis and ref.

inputs:
  - id: prior_fit
    type: analysis
    ref: analyses/baseline_fit
    ref_version: "v1.2"
    use_outputs: [fit_params, residual_plot]

Conditional elements

Use when when a choice creates a branch of the analysis. For example, one choice may require an extra assumption, diagnostic, or output that should not appear in every universe. Conditions use decision.option, with a ~ prefix for negation. Multiple conditions are ANDed together.

decisions:
  correction_mode:
    label: Correction mode
    default: none
    options:
      none: { label: No correction }
      calibrated: { label: Apply calibration }

  calibration_prior:
    label: Calibration prior
    when:
      - correction_mode.calibrated
    default: weak
    options:
      weak: { label: Weak prior }
      informative: { label: Informative prior }

outputs:
  - id: calibrated_table
    type: table
    when:
      - correction_mode.calibrated
    recipe:
      command: python src/apply_calibration.py --out {output}

In the example, calibration_prior and calibrated_table exist only for the calibrated branch. A baseline universe that selects correction_mode.none stays simpler: it does not carry a prior or output that it never uses.

Validation model

ASTRA validation is designed to catch both syntax errors and scientific-record errors.

Stage What it checks
Schema validation YAML shape, types, enums, version and DOI patterns.
Semantic validation Duplicate IDs, references, from paths, recipe placeholders, and constraint satisfaction.
Evidence verification Optional quote matching against cited sources.

Run validation with:

astra validate astra.yaml

Evidence verification is opt-in:

astra validate astra.yaml --verify-evidence

Remember, validation does not prove that the science is correct, obviously! It proves that the record is structured enough to inspect.

Conclusion

That covers the ASTRA format. If you're curious, try asking your agent to turn a piece of your own research into an astra.yaml and see what comes back. The rest of this page is a field reference for the individual schema elements.


Field reference

For generated class-level documentation, see the schema reference.

Analysis

The Analysis object is the root of astra.yaml and the type used for every sub-analysis.

Field Type Required Meaning
id string No Identifier for this analysis, especially when nested.
version string No ASTRA schema version, e.g. "1.0" or "1.0.0".
name string No Human-readable analysis name.
description string No Free-prose description of the analysis.
tags string[] No Free-form categorization tags.
container string No Default container for recipes in this analysis node.
inputs Input[] No Data or prior analyses consumed by this analysis.
outputs Output[] No Artifacts produced or re-exported by this analysis.
decisions map of Decision No Methodological choice points.
prior_insights map of Insight No Existing claims used to motivate choices.
findings map of Insight No Claims produced by this analysis.
analyses map of Analysis No Nested sub-analyses.
path string No External directory containing a sub-analysis ASTRA file.

path is for nested analyses only. It is mutually exclusive with inline content fields on that sub-analysis.

Input

An input declares something the analysis consumes, or aliases an upstream artifact with from.

Field Type Required Meaning
id string Yes Local identifier.
label string No Short display name.
type data or analysis Yes when from is absent Kind of input.
description string No Human-readable explanation.
source string No URI, path, loader, or other data locator for type: data.
ref string No Reference to another ASTRA analysis for type: analysis.
ref_version string No Version of the referenced analysis.
use_outputs string[] No Outputs to consume from a referenced analysis.
from string No Path alias to an upstream input or sibling output.

When from is present, the input is a pure alias. Only id and from may be declared, and content is inherited from the source.

Output

An output is an artifact produced locally or re-exported from a child sub-analysis.

Field Type Required Meaning
id string Yes Local identifier.
label string No Short display name.
type metric, figure, table, data, or report Yes when from is absent Artifact kind.
description string No What the output represents.
from string No Path alias to a child output.
when string[] No Conditions under which the output is active.
inputs string[] No Local input or sibling output IDs this output depends on.
decisions string[] No Local decision IDs that parameterize the output.
recipe Recipe No Command and execution context for producing the output.

Output types:

Type Use for
metric A scalar or categorical measurement such as fit scatter, p-value, likelihood, or score.
figure A visual artifact such as a plot, map, diagnostic, or image.
table Structured tabular output.
data A processed dataset, catalog, calibration table, or intermediate artifact.
report Textual or document output.

When from is present, the output is a pure re-export. Only id, from, and when may be declared locally.

Recipe

A recipe is a command plus optional execution context.

Field Type Required Meaning
command string Yes POSIX shell command template.
resources Resources No Compute requirements.
container string No Container image or path to a Containerfile.

Recipe placeholders:

Placeholder Meaning
{inputs.<id>} Path to a named input declared in the parent output's inputs.
{inputs} Space-separated paths to all parent output inputs in declaration order.
{decisions.<id>} Active option ID for a decision declared in the parent output's decisions.
{output} Path where the runner should write the produced artifact.
{{ and }} Literal braces.

The command string is a typed template. Every {inputs.<id>} placeholder must name an input or sibling output listed in the parent Output.inputs. Every {decisions.<id>} placeholder must name a decision listed in the parent Output.decisions. Validators reject unresolved or undeclared placeholders.

Placeholders always use local IDs in the surrounding analysis scope. If an input or decision is aliased from another scope with from, the recipe still names the local alias. Recipes do not use ../ path syntax. Cross-scope wiring is declared once on the Input, Output, or Decision.

Decision placeholders resolve to the selected option ID in the current universe. If a script needs a numeric value, such as a seed, either map the option ID inside the script or choose option IDs that are usable directly.

Resources:

Field Type Meaning
cpus number Requested CPU cores. Fractional values are allowed.
memory string Memory with units, e.g. "16Gi" or "8GB".
time_limit string Wall-time duration, e.g. "30m" or "2h".
disk string Disk with units.
gpus integer Number of GPUs.

A node-level container on Analysis sets the default for recipes in that node. A recipe-level container overrides it. Image names such as python:3.12-slim or ghcr.io/org/image:latest are interpreted as pre-built images. Paths such as Containerfile or containers/Dockerfile are interpreted as build contexts by runners that support them.

Example rendered command after a runner materializes paths and selects a universe:

python src/fit_period_luminosity.py \
  --catalog /work/baseline/catalog.csv \
  --method ordinary_least_squares \
  --out /work/baseline/fit_params.csv

Decision

A decision is a methodological choice point.

Field Type Required Meaning
label string Yes when from is absent Human-readable name.
rationale string No Why this choice matters scientifically.
tags string[] No Grouping labels.
when string[] No Conditions under which this decision is active.
from string No Alias to an ancestor decision.
default string No Default option ID.
options map of Option Yes when from is absent Available choices.

When from is present, the decision is a pure alias to an ancestor decision. Only from and, where needed, when may be declared locally.

Option

An option is one possible selection for a decision.

Field Type Required Meaning
label string Yes Human-readable name.
description string No Explanation of what the option does.
notes string No Additional author notes.
insights string[] No Prior insight IDs supporting the option.
requires string[] No Other options that must be selected with this one.
incompatible_with string[] No Other options that cannot be selected with this one.
excluded boolean No Marks an option as considered but unavailable.
excluded_reason string No Why the option was excluded.

Constraint references use decision_id.option_id and are scoped within the same analysis node.

Insight and Evidence

Insight is used for both prior_insights and findings.

Field Type Required Meaning
id string Yes Unique insight identifier.
claim string Yes The scientific claim.
label string No Short display name.
created_at datetime Yes Creation timestamp.
evidence Evidence[] Yes Sources or artifacts supporting the claim.
derived boolean No Whether the claim was produced by this analysis.
scope string No Applicability conditions for the claim.
tags string[] No Categorization tags.
notes string No Additional prose.

Each evidence item references either literature with doi or an analysis artifact with artifact. Exactly one of those source fields must be set. Literature evidence should include a text quote for verifiability.

Evidence fields:

Field Type Required Meaning
id string Yes Local evidence identifier.
doi string Exactly one of doi or artifact DOI for a cited paper or source.
artifact string Exactly one of doi or artifact ASTRA artifact ID, often an output.
version integer No Source paper version, especially for arXiv papers.
snapshot string No Path to an immutable copy of an artifact.
source_commit string No Commit that produced an artifact.
quote TextQuoteSelector No Exact text quote and optional prefix/suffix.
location FragmentSelector No Source location hint such as a PDF page.

ASTRA follows the spirit of W3C selectors: evidence should identify not just a source, but the specific location or text that supports the claim.

TextQuoteSelector fields:

Field Type Required Meaning
exact string Yes Exact quoted text.
prefix string No Text before the quote, used for disambiguation.
suffix string No Text after the quote, used for disambiguation.

FragmentSelector fields:

Field Type Required Meaning
value string No Fragment value, such as page=6.
page integer No One-indexed page number.

Universe

A universe selects one option for every active decision.

Field Type Required Meaning
id string Yes Universe identifier.
description string No Human-readable explanation.
decisions map of decision_id: option_id No Selections at the current analysis scope.
analyses map of UniverseNode No Nested selections mirroring sub-analyses.

Universe IDs may use lowercase letters, numbers, underscores, and hyphens. Decision and option IDs use lowercase snake_case.

Conditions and constraints

when conditions use:

decision_id.option_id
~decision_id.option_id

The first form means “active when this option is selected.” The second means “active when this option is not selected.” Multiple entries are ANDed.

Option constraints use the same decision_id.option_id reference form:

Field Meaning
requires The referenced option must also be selected.
incompatible_with The referenced option must not be selected.

For example, using a calibrated input can require a selected correction method:

decisions:
  data_version:
    options:
      calibrated:
        label: Calibrated input
        requires:
          - correction_mode.calibrated

An option can also rule out another option:

decisions:
  outlier_handling:
    options:
      sigma_clip:
        label: Remove extreme outliers
        incompatible_with:
          - fit_method.robust_linear

Negated conditions use the same ~decision.option form:

decisions:
  residual_summary:
    label: Residual summary
    when:
      - ~outlier_handling.sigma_clip
    options:
      all_points:
        label: Use all residuals

Bridges and path grammar

from aliases elements across analysis scopes. The grammar is shared by inputs, outputs, and decisions, but each slot restricts which directions are legal.

Form Meaning
../id Move up one scope and reference id.
../../id Move up two scopes and reference id.
../scope.id Move up, then descend into a named child scope.
scope.id Descend into a named child scope.
scope.sub.id Descend through nested child scopes.

Legal directions:

Slot Legal forms Purpose
Input.from ../id, ../../id, ../scope.out_id Alias an ancestor input or sibling sub-analysis output.
Output.from child.out_id, child.sub.out_id Re-export a child output.
Decision.from ../id, ../../id Inherit an ancestor decision.

from is the only primitive for crossing analysis scopes. Recipe templates, Output.inputs, and Output.decisions continue to use local IDs in their surrounding scope. When from is set, the node is a pure alias: only id, from, and, where applicable, when may be declared locally. Content fields such as type, description, label, source, options, default, and recipe are inherited from the source.

Inputs and outputs can reach into subordinate scopes for artifacts, which can flow upward by re-export or laterally between sibling sub-analyses. Decisions only flow downward from ancestors into descendants. To share a decision between siblings, declare it on their common ancestor and alias it with from inside each child.

Sibling output alias:

analyses:
  preprocessing:
    outputs:
      - id: cleaned_catalog
        type: data

  fitting:
    inputs:
      - id: catalog
        from: ../preprocessing.cleaned_catalog

Child output re-export:

outputs:
  - id: fit_parameters
    from: fitting.fit_parameters

Ancestor decision alias:

analyses:
  fitting:
    decisions:
      magnitude:
        from: ../magnitude

External analysis dependencies are separate from from. Use type: analysis with ref when the dependency is a different ASTRA analysis rather than an element inside the current analysis tree:

inputs:
  - id: prior_fit
    type: analysis
    ref: analyses/baseline_fit
    ref_version: "1.2"
    use_outputs: [fit_parameters, residual_plot]

ID conventions

Context Pattern Example
Input, output, decision, option, sub-analysis, insight, evidence IDs ^[a-z][a-z0-9_]*$ catalog_data, fit_method
Universe IDs ^[a-z][a-z0-9_-]*$ baseline, cleaned-data
Constraint references decision_id.option_id fit_method.robust_linear
Version ^\d+\.\d+(\.\d+)?$ "1.0", "1.0.0"
DOI ^10\.\d{4,}/.*$ "10.48550/arXiv.1706.03762"

These category names are reserved and cannot be used as entity IDs:

inputs   outputs   decisions   findings   prior_insights
analyses options   content

The reserved names prevent ambiguity in tree-path references and element addressing.

Schema artifacts

ASTRA is defined in LinkML. The source schema files live in src/astra/schema/ and generate datamodels and validation artifacts for multiple ecosystems.

Generated artifacts:

Artifact Purpose
LinkML YAML Merged source schema definition.
JSON Schema YAML/JSON validation artifact.
JSON-LD Context Linked-data context.
Python datamodels Generated classes distributed with the package.
File Defines
analysis.yaml Analysis, Input, Output, Decision, Option, Recipe, Resources, and cross-scope aliases.
universe.yaml Universe, UniverseNode, DecisionSelection, and decision selections.
insight.yaml Insight, Evidence, InsightCollection, quote selectors, and fragment selectors.

Generated Python datamodels are distributed with the package. The documentation site also includes the auto-generated schema reference for exact class and slot details.