Summary of M3: Statistics and Data Analysis (2021-22)

M3: Statistics and Data Analysis (2021-22)

Lecturer: Christl Donnelly
Lecturer: Dino Sejdinovic

Course Term: Trinity

Course Lecture Information: 16 lectures

Course Overview:

The course introduces the concept of likelihood for a probabilistic model and its use in estimating unknown model parameters. Models covered will include linear regression with one or two regressors. In many examples confidence intervals may be found by using the Central Limit Theorem (statement only). Model checking and outlier detection are core concepts that are broadly relevant across many aspects of mathematical modelling and will be explored here in the context of regression with one or two regressors. Regression models are an example of supervised learning, however a large part of statistics and data analysis can be classified as unsupervised learning, i.e. finding structure in data sets, e.g. data from financial markets, medical imaging, retail, population genetics, social networks. Techniques for finding structure in datasets are relevant to many parts of applied maths, specifically this course will cover principal components analysis and clustering techniques.

Learning Outcomes:

Students should have an understanding of likelihood, the use of maximum likelihood to find estimators, and some properties of the resulting estimators. They should have an understanding of confidence intervals and their construction using the Central Limit Theorem. They should have an understanding of linear regression with one or two regressors, and of finding structure in data sets using principal components and some clustering techniques.

Course Synopsis:

Random samples, concept of a statistic and its distribution, sample mean as a measure of location and sample variance as a measure of spread.

Concept of likelihood; examples of likelihood for simple distributions. Estimation for a single unknown parameter by maximising likelihood. Examples drawn from: Bernoulli, binomial, geometric, Poisson, exponential (parametrized by mean), normal (mean only, variance known). Data to include simple surveys, opinion polls, archaeological studies, etc. Properties of estimators---unbiasedness, Mean Squared Error = (bias\(^{2}\) + variance). Statement of Central Limit Theorem (excluding proof). Confidence intervals using CLT. Simple straight line fit, \(Y_{t}=a+bx_{t}+\varepsilon_{t}\), with \(\varepsilon _{t}\) normal independent errors of zero mean and common known variance. Estimators for \(a\), \(b\) by maximising likelihood using partial differentiation, unbiasedness and calculation of variance as linear sums of \(Y_{t}\). (No confidence intervals). Examples (use scatter plots to show suitability of linear regression).

Linear regression with 2 regressors. Special case of quadratic regression \(Y_t = a + bx_t + cx^2_t + \epsilon_t\). Model diagnostics and outlier detection. Residual plots. Heteroscedasticity.
Outliers and studentized residuals. High-leverage points and leverage statistics. [2.5]

Introduction to unsupervised learning with real world examples. Principal components analysis (PCA). Proof that PCs maximize directions of maximum variance and are orthogonal using Lagrange multipliers. PCA as eigendecomposition of covariance matrix. Eigenvalues as variances. Choosing number of PCs. The multivariate normal distribution pdf. Examples of PCA on multivariate normal data and clustered data. [3]

Clustering techniques; K-means clustering. Minimization of within-cluster variance. K-means algorithm and proof that it will decrease objective function. Local versus global optima and use of random initializations. Hierarchical clustering techniques. Agglomerative clustering using complete, average and single linkage [2.5]