Descriptive Statistics

Descriptive Statistics: First-Look Summary of Any CSV in Lattice

Descriptive statistics are the first thing to compute on a new dataset. Lattice profiles every numeric column for you — central tendency, spread, distribution shape, and data-quality red flags — so you start any analysis with a sober view of what you're working with.

Why this is always step one

Before any test, plot, or model, you need to know the basics: how many rows, how many missing, what range each variable spans, whether anything looks off. Five minutes here saves hours of misleading downstream results. A skewed predictor changes which regression you should fit; a column with 30% missingness changes whether to impute or drop; a constant column means you have a bug in extraction, not a finding.

What Lattice computes

The engine runs pandas's underlying numeric summaries plus scipy's skew / kurtosis, then packages them into a Pydantic-typed result. For each numeric column you get count, mean, standard deviation, min, 25th percentile, median, 75th percentile, max, skewness, kurtosis, missing count, and missing fraction. For each non-numeric column you get count, unique count, top value, and top-value count.

Every number carries a trace_id. The interpreter writes a plain-language summary that highlights anything unusual: outliers via IQR, skew exceeding ±1, missingness over 5%, or a numeric column that is actually constant.

1 · Intent → method

An LLM picks stats_describe from a fixed catalog.

2 · Method → numbers

Deterministic Python engine runs the math. Same input → same output.

3 · Numbers → plain language

A second LLM translates the result into your domain’s vocabulary.

How to phrase your request

The planner picks stats_describe from prompts like:

If you only want a subset of columns, name them ("describe yield, temperature, and pressure only") and the engine respects the filter.

Reading the result

Skim the missingness column first — anything above 5% deserves a decision before you proceed. Then check skew and kurtosis: heavy skew often means a log transform is sensible, and high kurtosis means heavy tails (relevant if you're about to run a t-test). Compare median and mean — if they differ a lot, the distribution is asymmetric.

The summary also calls out columns where the unique count equals the row count (likely an id column, not a measurement) or where it equals one (constant — probably a data extraction bug).

Common mistakes Lattice guards against

People sometimes treat descriptive statistics as the answer instead of the starting point — reporting just the mean and standard deviation in a paper without a histogram or test. Lattice's summary always points you to a follow-up: a histogram for shape, a box plot for outliers, or a t-test / ANOVA if a comparison is implied. The other slip is ignoring the missingness fraction; Lattice highlights it so the decision to impute or drop is explicit, not silent.

  • What stats does Lattice return?

    Per numeric column: count, mean, std, min, q25, median, q75, max, skewness, kurtosis, missing count, and missing fraction. Per non-numeric column: count, unique count, top value, top count.

  • How does this differ from pandas .describe()?

    It returns the same core numbers in a Pydantic-typed schema, plus skew, kurtosis, and missingness — and every value carries a trace_id so you can replay it. The interpreter also writes a plain-language summary that flags anything unusual.

  • Will Lattice warn me about outliers automatically?

    Yes. The summary flags columns where the IQR rule (1.5 × IQR) finds outliers, where skew exceeds ±1, or where missingness exceeds 5%. The actual outlier removal still requires your explicit confirmation.

Tool input schema

Schema for stats_describe not exported yet (run pnpm export:registry).