Prerequisites
The pipeline requires Python ≥ 3.13 and
uv
for environment and dependency management. Clone the repository and install:
git clone https://github.com/Found-in-Space/pipeline.git
cd pipeline
uv sync Generate a starter project configuration file — this is the single source of truth for all input and output paths:
uv run fis-pipeline project init project.toml Getting Gaia data
Gaia data is accessed through the Gaia Archive using ADQL (Astronomical Data Query Language — essentially SQL against the catalogue tables). The pipeline reads VOTable files (the archive's native format, available gzip-compressed), so you download first, then run the pipeline against the local files.
The query joins two tables: gaiadr3.gaia_source for astrometry and
photometry, and external.gaiaedr3_distance for Bailer-Jones probabilistic
distances. The filter astrometric_params_solved IN (31, 95) keeps only
stars with five- or six-parameter astrometric solutions — the ones with reliable
parallaxes.
Start here
Bright stars — no account required
Limiting to G ≤ 9 gives around 175,000 stars in a ~16 MB download. This fits within the Gaia Archive's anonymous query limit and covers every star visible to the naked eye from a dark site, plus a large number of telescopically visible ones.
SELECT
g.source_id,
g.ra,
g.dec,
g.parallax,
g.parallax_error,
g.pmra,
g.pmdec,
g.phot_g_mean_mag,
g.phot_bp_mean_mag,
g.phot_rp_mean_mag,
g.ruwe,
g.mg_gspphot,
g.ag_gspphot,
g.mg_gspphot_upper,
g.mg_gspphot_lower,
g.teff_esphs,
g.teff_gspspec,
g.teff_espucd,
g.teff_gspphot,
d.r_med_geo,
d.r_lo_geo,
d.r_hi_geo,
d.r_med_photogeo,
d.r_lo_photogeo,
d.r_hi_photogeo
FROM gaiadr3.gaia_source AS g
JOIN external.gaiaedr3_distance AS d
ON d.source_id = g.source_id
WHERE
g.astrometric_params_solved IN (31, 95)
AND (
d.r_med_photogeo IS NOT NULL
OR d.r_med_geo IS NOT NULL
)
AND g.phot_g_mean_mag <= 9.0
Paste this into the ADQL tab of the Gaia Archive, run it, and
download the result as VOTable (gzip). You can also submit it programmatically
via astroquery:
from astroquery.gaia import Gaia
job = Gaia.launch_job_async(query, output_format="votable_gzip")
result = job.get_results()
result.write("gaia_bright.vot.gz", format="votable", overwrite=True) | Magnitude limit | Approximate star count | Approximate size | Account required? |
|---|---|---|---|
| G ≤ 9 | 175,000 | 16 MB | No |
| G ≤ 12 | 3,060,000 | 277 MB | Yes (free) |
| G ≤ 15 | 36,600,000 | 3.2 GB | Yes (free) |
| Full sky | ~1.3 billion | ~1.1 TB | Yes — batched download required |
The Gaia Archive public API caps results at around 3 million rows per query. G ≤ 12 exceeds this, so a free account is needed. Registration is straightforward via the archive web interface. For G ≤ 15 and beyond, results still fit in a single (large) async job. The full all-sky download requires partitioning — see below.
Full all-sky download
Without a magnitude limit the query returns around 1.3 billion rows — far beyond a single TAP job. The data must be partitioned and downloaded in batches.
The partition key is HEALPix level 3, encoded directly in each Gaia
source_id via integer division by 253. Level 3 gives 768 sky
tiles. Star counts vary enormously across the sky (the galactic plane vs. the poles),
so tiles are grouped into batches that stay under a row cap of ~55 million — a
trade-off between the number of async jobs and output file size.
The repository's technical notes in docs/gaia-downloads.md document this
batch planning strategy in full, including the dynamic-programming tile packer used to
fill each batch as efficiently as possible.
Running the pipeline
With your VOTable files downloaded, point the project configuration at them and run:
# Process Gaia VOTable(s) → Parquet
uv run fis-pipeline gaia build --project project.toml gaia_bright.vot.gz
# Download Hipparcos (fetched automatically from Vizier)
uv run fis-pipeline hip build --project project.toml
# Download the Gaia–Hipparcos crossmatch table
uv run fis-pipeline gaia-to-hip build --project project.toml
# Build the overrides Parquet (Sun, Alpha Cen, etc.)
uv run fis-pipeline overrides build --project project.toml
# Merge everything into HEALPix-partitioned output
uv run fis-pipeline merge build --project project.toml Hipparcos and the crossmatch table are downloaded automatically from ESA's Vizier service and the Gaia TAP endpoint respectively — no manual steps needed. Downloads are cached; re-running skips them.
The merge step also builds the identifiers sidecar (HD numbers, Bayer/Flamsteed
designations, proper names) if you run identifiers build first. It's
optional — the core merge output works without it.
What the pipeline produces
The final output is HEALPix-partitioned Parquet under
{output_dir}/healpix/{pixel}/.
Each file uses the fixed schema below, compressed with zstd.
| Column | Type | Description |
|---|---|---|
source | string | "gaia", "hip", or "manual" |
source_id | string | Catalogue identifier within that namespace |
x_icrs_pc | float64 | Sun-centred x coordinate (parsecs, ICRS J2016.0) |
y_icrs_pc | float64 | Sun-centred y coordinate (parsecs) |
z_icrs_pc | float64 | Sun-centred z coordinate (parsecs) |
ra_deg | float64 | Right ascension (degrees, J2016.0) |
dec_deg | float64 | Declination (degrees, J2016.0) |
r_pc | float64 | Distance from the Sun (parsecs) |
mag_abs | float32 | Absolute magnitude |
teff | float32 | Effective temperature (K) |
quality_flags | uint16 | Packed provenance and status bits |
astrometry_quality | float32 | Fractional parallax error (or tier sentinel) |
photometry_quality | float32 | GSP-Phot MG confidence width |
Alongside the Parquet shards, the merge step writes merge_report.json
(aggregate counts by category — how many Gaia-only, HIP-only, matched pairs, and
override actions) and merge_decisions.parquet (one row per matched pair
or override, with quality scores and diagnostics).