Download the star data | Build

Prerequisites

The pipeline requires Python ≥ 3.13 and uv for environment and dependency management. Clone the repository and install:

git clone https://github.com/Found-in-Space/pipeline.git
cd pipeline
uv sync

Generate a starter project configuration file — this is the single source of truth for all input and output paths:

uv run fis-pipeline project init --profile small project.toml

For the conventional full local layout, omit --profile small.

Getting Gaia data

Gaia data is accessed through the Gaia Archive using ADQL (Astronomical Data Query Language — essentially SQL against the catalogue tables). The pipeline reads VOTable files (the archive's native format, available gzip-compressed), so you download first, then run the pipeline against the local files.

The query joins two tables: gaiadr3.gaia_source for astrometry and photometry, and external.gaiaedr3_distance for Bailer-Jones probabilistic distances. The filter astrometric_params_solved IN (31, 95) keeps sources with five- or six-parameter astrometric solutions, so the query includes the astrometric fields needed for parallax-based distance work. Later pipeline stages still record distance source and quality flags.

Start here

Bright stars — no account required

Limiting to G ≤ 9 gives around 175,000 stars in a ~16 MB download. This fits within the Gaia Archive's anonymous query limit and covers every star visible to the naked eye from a dark site, plus a large number of telescopically visible ones.

SELECT
  g.source_id,
  g.ra,
  g.dec,
  g.parallax,
  g.parallax_error,
  g.pmra,
  g.pmdec,
  g.phot_g_mean_mag,
  g.phot_bp_mean_mag,
  g.phot_rp_mean_mag,
  g.ruwe,
  g.mg_gspphot,
  g.ag_gspphot,
  g.mg_gspphot_upper,
  g.mg_gspphot_lower,
  g.teff_esphs,
  g.teff_gspspec,
  g.teff_espucd,
  g.teff_gspphot,
  d.r_med_geo,
  d.r_lo_geo,
  d.r_hi_geo,
  d.r_med_photogeo,
  d.r_lo_photogeo,
  d.r_hi_photogeo
FROM gaiadr3.gaia_source AS g
JOIN external.gaiaedr3_distance AS d
  ON d.source_id = g.source_id
WHERE
  g.astrometric_params_solved IN (31, 95)
  AND (
    d.r_med_photogeo IS NOT NULL
    OR d.r_med_geo IS NOT NULL
  )
  AND g.phot_g_mean_mag <= 9.0

Paste this into the ADQL tab of the Gaia Archive, run it, and download the result as VOTable (gzip). You can also submit it programmatically via astroquery:

from astroquery.gaia import Gaia

job = Gaia.launch_job_async(query, output_format="votable_gzip")
result = job.get_results()
result.write("gaia_bright.vot.gz", format="votable", overwrite=True)

Magnitude limit	Approximate star count	Approximate size	Account required?
G ≤ 9	175,000	16 MB	No
G ≤ 12	3,060,000	277 MB	Yes (free)
G ≤ 15	36,600,000	3.2 GB	Yes (free)
Full sky	~1.3 billion	~1.1 TB	Yes — batched download required

The Gaia Archive public API caps results at around 3 million rows per query. G ≤ 12 exceeds this, so a free account is needed. Registration is straightforward via the archive web interface. For G ≤ 15 and beyond, results still fit in a single (large) async job. The full all-sky download requires partitioning — see below.

Full all-sky download

Without a magnitude limit the query returns around 1.3 billion rows — far beyond a single TAP job. The data must be partitioned and downloaded in batches.

The partition key is HEALPix level 3, encoded directly in each Gaia source_id via integer division by 2⁵³. Level 3 gives 768 sky tiles. Star counts vary enormously across the sky (the galactic plane vs. the poles), so tiles are grouped into batches that stay under a row cap of ~55 million — a trade-off between the number of async jobs and output file size.

The repository's technical notes in docs/gaia-downloads.md document this batch planning strategy in full, including the dynamic-programming tile packer used to fill each batch as efficiently as possible.

Running the pipeline

With your VOTable files downloaded, point the project configuration at them and run:

# Process Gaia VOTable(s) → Parquet
uv run fis-pipeline gaia build --project project.toml gaia_bright.vot.gz

# Download Hipparcos (fetched automatically from Vizier)
uv run fis-pipeline hip build --project project.toml

# Download the Gaia–Hipparcos crossmatch table
uv run fis-pipeline gaia-to-hip build --project project.toml

# Build the overrides Parquet (Sun, Alpha Cen, etc.)
uv run fis-pipeline overrides build --project project.toml

# Merge everything into HEALPix-partitioned output
uv run fis-pipeline merge build --project project.toml

Hipparcos and the crossmatch table are downloaded automatically from ESA's Vizier service and the Gaia TAP endpoint respectively — no manual steps needed. Downloads are cached; re-running skips them.

The merge step also builds the identifiers sidecar (HD numbers, Bayer/Flamsteed designations, proper names) if you run identifiers build first. It's optional — the core merge output works without it.

What the pipeline produces

The final output is HEALPix-partitioned Parquet under {output_dir}/healpix/{pixel}/. Each file uses the fixed schema below, compressed with zstd.

Column	Type	Description
`source`	string	`"gaia"`, `"hip"`, or `"manual"`
`source_id`	string	Catalogue identifier within that namespace
`x_icrs_pc`	float64	Sun-centred x coordinate (parsecs, ICRS J2016.0)
`y_icrs_pc`	float64	Sun-centred y coordinate (parsecs)
`z_icrs_pc`	float64	Sun-centred z coordinate (parsecs)
`ra_deg`	float64	Right ascension (degrees, J2016.0)
`dec_deg`	float64	Declination (degrees, J2016.0)
`r_pc`	float64	Distance from the Sun (parsecs)
`mag_abs`	float32	Absolute magnitude
`teff`	float32	Effective temperature (K)
`quality_flags`	uint16	Packed provenance and status bits
`astrometry_quality`	float32	Fractional parallax error (or tier sentinel)
`photometry_quality`	float32	GSP-Phot M_G confidence width

Alongside the Parquet shards, the merge step writes merge_report.json (aggregate counts by category — how many Gaia-only, HIP-only, matched pairs, and override actions) and merge_decisions.parquet (one row per matched pair or override, with quality scores and diagnostics).