PAC Learning #
This file defines the Probably Approximately Correct (PAC) learning model
introduced by Valiant [Val84], generalized to an arbitrary label type β
and parameterized by a family of distributions 𝒟 on α × β.
A concept class C over domain α with labels in β is a collection of
functions α → β. A learning algorithm receives a labeled sample drawn i.i.d.
from an unknown joint distribution D on α × β and must produce a hypothesis
whose 0-1 error is within ε of the best concept in C, with probability at
least 1 - δ.
The single definition IsPACLearnerFor captures the realizable, agnostic, and
noise-tolerant settings by varying the distribution family 𝒟:
- Agnostic [Hau92]:
𝒟 = Set.univ— the learner must work for all distributions. - Realizable:
𝒟consists of pushforwards of arbitrary probability measuresPonαalong the graphx ↦ (x, c x)of some conceptc ∈ C, so thatoptimalError D C = 0. - Noise-tolerant [AL88]:
𝒟consists of noisy versions of realizable distributions, where each label is corrupted independently with some probabilityη.
The accuracy and confidence parameters ε and δ are elements of the subtype
Set.Ioo (0 : ℝ≥0) 1, which bundles the value together with the proof that it
lies in the open interval (0, 1), ensuring the learning condition is non-vacuous.
All declarations live under the Cslib.MachineLearning.PACLearning namespace so that
generic names like error and optimalError do not pollute the parent namespace.
Main definitions #
ConceptClass: a set of functionsα → β(classifiers).LabeledSample: a finite sequence of(point, label)pairs.Learner: a function from labeled samples to hypotheses.error: the 0-1 error of a hypothesis under a joint distribution.optimalError: the infimum oferrorover a concept class.IsPACLearnerFor: deterministic(ε, δ)-PAC learner over a distribution family.IsRPACLearnerFor: randomized variant ofIsPACLearnerFor. Universe-polymorphic in the randomness spaceΩ : Type*.IsPACLearnable: a concept class is PAC learnable ifIsPACLearnerForholds for allε, δ : Set.Ioo (0 : ℝ≥0) 1with some sample sizem.IsRPACLearnable: randomized variant ofIsPACLearnable. Pins the randomness space toType 0;IsRPACLearnerForitself remains universe-polymorphic for users who need it.LearnerModel: the common predicate shapeℕ → ε → δ → C → 𝒟 → Propabstracting both the deterministic and randomized learners so sample-complexity lemmas can be shared.sampleComplexity: sample complexity of a generic learner model.rsampleComplexity: randomized sample complexity, i.e.sampleComplexity IsRPACLearnerFor.
Binary classification #
When β = Bool, concepts correspond to subsets of α. The section
Binary Classification provides:
hypothesisError: the symmetric-difference errorP(h ∆ c).falsePositiveError,falseNegativeError: its decomposition.hypothesisError_eq_add: the decomposition theorem.error_map_eq_hypothesisError: bridge between the generalerrorand the binaryhypothesisErrorunder a realizable distribution.
Main statements #
IsPACLearnerFor.toIsRPACLearnerFor: every deterministic PAC learner is a randomized one (via the trivial randomness spacePUnit).IsPACLearnerFor.antitone_family,.antitone_C: the deterministic PAC learner predicate is antitone in the distribution family and concept class.IsPACLearnerFor.mono_δ,.mono_ε: the predicate is monotone in the confidence and accuracy parameters (a weaker bound still holds).IsRPACLearnerFor.antitone_family,.mono_δ: analogues for the randomized predicate. (mono_εandantitone_Care not provided because they change the integrand and would require an extra measurability assumption.)IsPACLearnable.toIsRPACLearnable: deterministic learnability implies randomized.IsPACLearnable.antitone_family,.antitone_C,IsRPACLearnable.antitone_family: PAC learnability is antitone in the distribution family and concept class.sampleComplexity_antitone_δ,_antitone_ε,_mono_family,_mono_C: variation of deterministic sample complexity in confidence, accuracy, distribution family, and concept class (antitone in the numeric parameters, monotone under⊆in the set parameters). The randomized analoguesrsampleComplexity_antitone_δand_mono_familyare provided.IsPACLearnable.sampleComplexity_*,IsRPACLearnable.rsampleComplexity_*: the same monotonicity facts phrased with a learnability hypothesis in place of the ad-hoc∃ m, IsPACLearnerFor m …existence witness, so callers who already know the class is learnable need not thread it through.hypothesisError_eq_add: total error = false positive + false negative.
References #
- L. G. Valiant, A Theory of the Learnable
- A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant, A General Lower Bound on the Number of Examples Needed for Learning
- M. J. Kearns, U. V. Vazirani, An Introduction to Computational Learning Theory
- D. Haussler, Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications
- D. Angluin, P. Laird, Learning from Noisy Examples
Core Definitions #
A concept class over domain α with label type β is a set of functions α → β.
For binary classification (β = Bool), this is equivalent to a collection of subsets of α
via the characteristic function.
Equations
- Cslib.MachineLearning.PACLearning.ConceptClass α β = Set (α → β)
Instances For
A labeled sample of size m over domain α with label type β is a finite sequence
of (point, label) pairs.
Equations
- Cslib.MachineLearning.PACLearning.LabeledSample α β m = (Fin m → α × β)
Instances For
A learner using m samples is a function that takes a labeled sample and produces
a hypothesis (a function from the domain to the label type).
Equations
- Cslib.MachineLearning.PACLearning.Learner α β m = (Cslib.MachineLearning.PACLearning.LabeledSample α β m → α → β)
Instances For
The prediction error (0-1 loss) of a hypothesis h under a joint distribution D
on α × β, defined as the probability that the prediction disagrees with the label:
D({(x, y) | h(x) ≠ y}).
Instances For
The optimal error of a concept class C under a joint distribution D, defined as the
infimum of error D c over all concepts c ∈ C. When C is empty this is ⊤, making the
PAC learning condition vacuously true.
Equations
Instances For
PAC Learners #
IsPACLearnerFor m ε δ C 𝒟 asserts that there exists a learner using m samples
that is (ε, δ)-correct for the concept class C over the distribution family 𝒟: for every
probability measure D ∈ 𝒟 on α × β, the probability (over i.i.d. samples from D) that
the learner's hypothesis has error exceeding opt_C(D) + ε is at most δ.
The parameters ε and δ are elements of Set.Ioo (0 : ℝ≥0) 1, bundling the value with
the proof that it lies in (0, 1). This ensures the condition is non-vacuous:
ε < 1 prevents the error threshold from exceeding the maximum possible error under a
probability measure, and δ < 1 prevents the confidence bound from being trivially
satisfied.
Equations
- One or more equations did not get rendered due to their size.
Instances For
IsRPACLearnerFor m ε δ C 𝒟 asserts that there exists a randomized learner using
m samples that is (ε, δ)-correct for the concept class C over the distribution family
𝒟. A randomized learner draws internal randomness ω from a probability space (Ω, Q) and
acts as the deterministic learner A(ω).
For every probability measure D ∈ 𝒟, the failure probability function
ω ↦ D^m{S | error(A(ω)(S)) > opt_C(D) + ε} must be Q-a.e. measurable, and its
expectation over ω must be at most δ.
The randomness space Ω : Type* is universe-polymorphic; the universe is an implicit
parameter of IsRPACLearnerFor, and downstream statements reference it via the pattern
IsRPACLearnerFor.{_, _, u}. Fix u := 0 for the usual case of a standard randomness space.
A deterministic learner (IsPACLearnerFor) is the special case Ω = PUnit;
see IsPACLearnerFor.toIsRPACLearnerFor.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Every deterministic PAC learner is in particular a randomized PAC learner
(with the trivial one-point randomness space PUnit).
The deterministic PAC learner predicate is antitone in the distribution family: a
learner for a larger family 𝒟' is also a learner for any subfamily 𝒟 ⊆ 𝒟'.
A PAC learner with confidence δ₁ is also a PAC learner with any weaker confidence
δ₂ ≥ δ₁: the failure-probability bound only gets looser.
A PAC learner with accuracy ε₁ is also a PAC learner with any weaker accuracy
ε₂ ≥ ε₁: the bad event {error > opt + ε} only shrinks.
The deterministic PAC learner predicate is antitone in the concept class: a learner
for a larger class C' is also a learner for any subclass C ⊆ C', since the agnostic
benchmark optimalError _ C ≥ optimalError _ C' makes the error requirement easier.
The randomized PAC learner predicate is antitone in the distribution family. The
universe of the randomness space Ω is pinned so the hypothesis and conclusion share it.
A randomized PAC learner with confidence δ₁ is also a randomized PAC learner with
any weaker confidence δ₂ ≥ δ₁. Unlike mono_ε or antitone_C, this does not touch the
integrand, so it carries the AEMeasurable part through unchanged.
PAC Learnability #
A concept class C is PAC learnable over the distribution family 𝒟 if for every
accuracy ε ∈ (0, 1) and confidence δ ∈ (0, 1), there exists a sample size m admitting
a deterministic (ε, δ)-PAC learner for C. Here ε and δ are elements of the subtype
Set.Ioo (0 : ℝ≥0) 1.
Equations
- Cslib.MachineLearning.PACLearning.IsPACLearnable C 𝒟 = ∀ (ε δ : ↑(Set.Ioo 0 1)), ∃ (m : ℕ), Cslib.MachineLearning.PACLearning.IsPACLearnerFor m ε δ C 𝒟
Instances For
A concept class C is randomized PAC learnable over the distribution family 𝒟 if for
every accuracy ε ∈ (0, 1) and confidence δ ∈ (0, 1), there exists a sample size m
admitting a randomized (ε, δ)-PAC learner for C. The randomness space is pinned to
Type 0 at the learnability level; IsRPACLearnerFor itself remains universe-polymorphic.
Equations
- Cslib.MachineLearning.PACLearning.IsRPACLearnable C 𝒟 = ∀ (ε δ : ↑(Set.Ioo 0 1)), ∃ (m : ℕ), Cslib.MachineLearning.PACLearning.IsRPACLearnerFor m ε δ C 𝒟
Instances For
Deterministic PAC learnability implies randomized PAC learnability.
PAC learnability is antitone in the distribution family: a subfamily of a learnable family is learnable.
PAC learnability is antitone in the concept class: a subclass of a learnable class is learnable.
Randomized PAC learnability is antitone in the distribution family.
Sample Complexity #
A learner model is a predicate on (sample size, accuracy, confidence, concept class,
distribution family) that classifies which sample sizes admit a learner of the given kind.
Instantiating with IsPACLearnerFor gives the deterministic model; with IsRPACLearnerFor
gives the randomized one.
Equations
- One or more equations did not get rendered due to their size.
Instances For
The sample complexity of a concept class C under a learner model L, at accuracy
ε ∈ (0, 1) and confidence δ ∈ (0, 1) over distribution family 𝒟, is the smallest sample
size m with L m ε δ C 𝒟. Specialize with L := IsPACLearnerFor for the deterministic model
and L := IsRPACLearnerFor for the randomized one.
Caveat: because sInf on ℕ returns 0 for the empty set, this definition returns 0
when no learner exists (e.g., a concept class of infinite VC dimension). It is only meaningful
when the defining set {m | L m ε δ C 𝒟} is nonempty. The IsPACLearnable.sampleComplexity_*
variants below discharge this nonemptiness from a learnability hypothesis.
Instances For
The randomized sample complexity of C, i.e. sampleComplexity instantiated at the
randomized learner model IsRPACLearnerFor. The randomness space is pinned to Type 0.
Equations
Instances For
Monotonicity of Sample Complexity #
These lemmas are all special cases of the following observation: if {m | L₁ m ε₁ δ₁ C₁ 𝒟₁} ⊆ {m | L₂ m ε₂ δ₂ C₂ 𝒟₂} and the first set is nonempty, then the sample complexity under
(L₂, ε₂, δ₂, C₂, 𝒟₂) is at most the sample complexity under (L₁, ε₁, δ₁, C₁, 𝒟₁). The
nonemptiness hypothesis is essential: sInf on ℕ returns 0 for an empty set, so without
it the inequality can fail at the degenerate boundary. The IsPACLearnable-flavoured variants
at the end of this section discharge that witness from a learnability hypothesis.
General pointwise monotonicity of sampleComplexity: if every witness sample size for
(L₁, ε₁, δ₁, C₁, 𝒟₁) is also a witness for (L₂, ε₂, δ₂, C₂, 𝒟₂), then the latter's
sample complexity is at most the former's (provided the former is attained).
Deterministic sample complexity is antitone in the confidence parameter δ: weaker
confidence requires no more samples.
Deterministic sample complexity is antitone in the accuracy parameter ε: weaker
accuracy requires no more samples.
Deterministic sample complexity is monotone in the distribution family under ⊆: a
smaller family (fewer distributions to cover) requires no more samples.
Deterministic sample complexity is monotone in the concept class under ⊆: a smaller
class (weaker agnostic benchmark) requires no more samples.
Randomized sample complexity is antitone in the confidence parameter δ.
Randomized sample complexity is monotone in the distribution family under ⊆.
Convenience variants conditional on learnability, which discharge the nonemptiness
hypothesis (∃ m, IsPACLearnerFor m …) from an IsPACLearnable / IsRPACLearnable witness.
Bodies go through sampleComplexity_le_of_forall directly rather than the top-level
sampleComplexity_* lemmas, whose unqualified names would resolve as self-recursion inside
these theorems' IsPACLearnable.* / IsRPACLearnable.* namespaces.
sampleComplexity_antitone_δ for a learnable class: the nonemptiness hypothesis comes
for free from IsPACLearnable.
sampleComplexity_antitone_ε for a learnable class.
sampleComplexity_mono_family for a learnable class (learnability at the larger
family 𝒟' is the hypothesis).
sampleComplexity_mono_C for a learnable class (learnability at the larger class
C' is the hypothesis).
rsampleComplexity_antitone_δ for a randomized-learnable class.
rsampleComplexity_mono_family for a randomized-learnable class.
Binary Classification #
When β = Bool, concepts correspond to subsets of α via the characteristic function.
The symmetric-difference error P(h ∆ c) is the natural error metric, and it decomposes
into false positive and false negative components.
The bridge lemma error_map_eq_hypothesisError connects the general error on α × Bool
to the binary hypothesisError on α, showing they coincide for realizable distributions.
The symmetric-difference error of a hypothesis h with respect to a target concept c
(both viewed as subsets of α) under distribution P, defined as P(h ∆ c).
Equations
- Cslib.MachineLearning.PACLearning.hypothesisError P h c = P (symmDiff h c)
Instances For
The false positive error P(h \ c) — points classified positive but not in the
concept.
Equations
- Cslib.MachineLearning.PACLearning.falsePositiveError P h c = P (h \ c)
Instances For
The false negative error P(c \ h) — points in the concept but classified negative.
Equations
- Cslib.MachineLearning.PACLearning.falseNegativeError P h c = P (c \ h)
Instances For
The total hypothesis error decomposes as the sum of false positive and false negative
errors, since h ∆ c = (h \ c) ∪ (c \ h) is a disjoint union.
Under a realizable distribution P.map (x ↦ (x, c(x))), the general 0-1 error
coincides with the binary hypothesisError P h c, where h and c are viewed as subsets
of α via the characteristic function decide (· ∈ ·).