content is still in production and may contain factual errors

Build — How-to

Download and run the pipeline yourself

The pipeline is open source. You can download a small sample of the brightest stars in minutes with no account required, or run a full all-sky build from the Gaia Archive. Everything you need is in the repository.

Prerequisites

The pipeline requires Python ≥ 3.13 and uv for environment and dependency management. Clone the repository and install:

git clone https://github.com/Found-in-Space/pipeline.git
cd pipeline
uv sync

Generate a starter project configuration file — this is the single source of truth for all input and output paths:

uv run fis-pipeline project init project.toml

Getting Gaia data

Gaia data is accessed through the Gaia Archive using ADQL (Astronomical Data Query Language — essentially SQL against the catalogue tables). The pipeline reads VOTable files (the archive's native format, available gzip-compressed), so you download first, then run the pipeline against the local files.

The query joins two tables: gaiadr3.gaia_source for astrometry and photometry, and external.gaiaedr3_distance for Bailer-Jones probabilistic distances. The filter astrometric_params_solved IN (31, 95) keeps only stars with five- or six-parameter astrometric solutions — the ones with reliable parallaxes.

Start here

Bright stars — no account required

Limiting to G ≤ 9 gives around 175,000 stars in a ~16 MB download. This fits within the Gaia Archive's anonymous query limit and covers every star visible to the naked eye from a dark site, plus a large number of telescopically visible ones.

SELECT
  g.source_id,
  g.ra,
  g.dec,
  g.parallax,
  g.parallax_error,
  g.pmra,
  g.pmdec,
  g.phot_g_mean_mag,
  g.phot_bp_mean_mag,
  g.phot_rp_mean_mag,
  g.ruwe,
  g.mg_gspphot,
  g.ag_gspphot,
  g.mg_gspphot_upper,
  g.mg_gspphot_lower,
  g.teff_esphs,
  g.teff_gspspec,
  g.teff_espucd,
  g.teff_gspphot,
  d.r_med_geo,
  d.r_lo_geo,
  d.r_hi_geo,
  d.r_med_photogeo,
  d.r_lo_photogeo,
  d.r_hi_photogeo
FROM gaiadr3.gaia_source AS g
JOIN external.gaiaedr3_distance AS d
  ON d.source_id = g.source_id
WHERE
  g.astrometric_params_solved IN (31, 95)
  AND (
    d.r_med_photogeo IS NOT NULL
    OR d.r_med_geo IS NOT NULL
  )
  AND g.phot_g_mean_mag <= 9.0

Paste this into the ADQL tab of the Gaia Archive, run it, and download the result as VOTable (gzip). You can also submit it programmatically via astroquery:

from astroquery.gaia import Gaia

job = Gaia.launch_job_async(query, output_format="votable_gzip")
result = job.get_results()
result.write("gaia_bright.vot.gz", format="votable", overwrite=True)
Magnitude limit Approximate star count Approximate size Account required?
G ≤ 9 175,000 16 MB No
G ≤ 12 3,060,000 277 MB Yes (free)
G ≤ 15 36,600,000 3.2 GB Yes (free)
Full sky ~1.3 billion ~1.1 TB Yes — batched download required

The Gaia Archive public API caps results at around 3 million rows per query. G ≤ 12 exceeds this, so a free account is needed. Registration is straightforward via the archive web interface. For G ≤ 15 and beyond, results still fit in a single (large) async job. The full all-sky download requires partitioning — see below.

Full all-sky download

Without a magnitude limit the query returns around 1.3 billion rows — far beyond a single TAP job. The data must be partitioned and downloaded in batches.

The partition key is HEALPix level 3, encoded directly in each Gaia source_id via integer division by 253. Level 3 gives 768 sky tiles. Star counts vary enormously across the sky (the galactic plane vs. the poles), so tiles are grouped into batches that stay under a row cap of ~55 million — a trade-off between the number of async jobs and output file size.

The repository's technical notes in docs/gaia-downloads.md document this batch planning strategy in full, including the dynamic-programming tile packer used to fill each batch as efficiently as possible.

Running the pipeline

With your VOTable files downloaded, point the project configuration at them and run:

# Process Gaia VOTable(s) → Parquet
uv run fis-pipeline gaia build --project project.toml gaia_bright.vot.gz

# Download Hipparcos (fetched automatically from Vizier)
uv run fis-pipeline hip build --project project.toml

# Download the Gaia–Hipparcos crossmatch table
uv run fis-pipeline gaia-to-hip build --project project.toml

# Build the overrides Parquet (Sun, Alpha Cen, etc.)
uv run fis-pipeline overrides build --project project.toml

# Merge everything into HEALPix-partitioned output
uv run fis-pipeline merge build --project project.toml

Hipparcos and the crossmatch table are downloaded automatically from ESA's Vizier service and the Gaia TAP endpoint respectively — no manual steps needed. Downloads are cached; re-running skips them.

The merge step also builds the identifiers sidecar (HD numbers, Bayer/Flamsteed designations, proper names) if you run identifiers build first. It's optional — the core merge output works without it.

What the pipeline produces

The final output is HEALPix-partitioned Parquet under {output_dir}/healpix/{pixel}/. Each file uses the fixed schema below, compressed with zstd.

Column Type Description
sourcestring"gaia", "hip", or "manual"
source_idstringCatalogue identifier within that namespace
x_icrs_pcfloat64Sun-centred x coordinate (parsecs, ICRS J2016.0)
y_icrs_pcfloat64Sun-centred y coordinate (parsecs)
z_icrs_pcfloat64Sun-centred z coordinate (parsecs)
ra_degfloat64Right ascension (degrees, J2016.0)
dec_degfloat64Declination (degrees, J2016.0)
r_pcfloat64Distance from the Sun (parsecs)
mag_absfloat32Absolute magnitude
tefffloat32Effective temperature (K)
quality_flagsuint16Packed provenance and status bits
astrometry_qualityfloat32Fractional parallax error (or tier sentinel)
photometry_qualityfloat32GSP-Phot MG confidence width

Alongside the Parquet shards, the merge step writes merge_report.json (aggregate counts by category — how many Gaia-only, HIP-only, matched pairs, and override actions) and merge_decisions.parquet (one row per matched pair or override, with quality scores and diagnostics).