Course Term: Michaelmas
Course Overview:

This is a course covering a number of topics inData Science, that will combine boththeoretical and practical approaches. The goal of the course is, on one hand, to understand (at least at a high level)the mathematical foundations behind some of the state-of-the-art algorithms for a wide range of tasks includingorganization and visualization of data clouds, dimensionality reduction, network analysis, clustering, classification,regression, and ranking. On the other hand, students will be exposed to numerous practical examples drawn froma wide range of topics including social network analysis, finance, statistics, etc.

Course Syllabus:

A list of tentative topics:
1. Review of basic statistics and probability; introduction to statistical learning
2. Bias-variance decomposition

Measures of correlation and dependence in data:
3. Pearson, Spearman, Kendall, Hoeffding’s
4. Maximal correlation, and review of characteristic functions; Distance correlation
5. Information theory (entropy, mutual information), and Maximal Information Coefficient (MIC) (De-tecting Novel Associations in Large Data Sets, Reshef et al., Science 2011)
6. Simple/multiple linear regression, proof that OLS is BLUE
7. Linear regression - practical considerations
8. Singular Value Decomposition (SVD), rank-k approximation, Principal Component Analysis (PCA)
9. PCA derivation (bestd-dimensional affine fit/projection that preserves the most variance)
10. PCA in high dimensions and random matrix theory (Marcenko-Pastur); applications to finance

Nonlinear dimensionality reduction methods:
11.Diffusion Maps and Laplacian Eigenmaps
12.Multidimensional scaling and ISOMAP
13.Locally Linear Embedding (LLE)
14.Kernel PCA
15. Ranking with pairwise incomplete noisy measurements, and applications; Page-Rank, Serial-Rank, Rank-Centrality, SVD-Rank
16. Group synchronization, Synchronization-Ranking, and applications

Clustering:
17.k-means, k-means++, hierarchical clustering
18.Spectral clustering, isoperimetry, conductance, Cheeger’s Inequality
19.Constrained clustering, clustering of signed networks and directed networks

Modern regression:
20.Ridge regression
21.The LASSO

Lecturer(s):

Prof. Mihai Cucuringu

Learning Outcomes:

This is a 16 hour course held in the first two weeks of the CDT in Random Systems