Probability and Statistics for Data Science
This website contains a free preprint, code, videos and exercise solutions for the book Probability and Statistics for Data Science, published by Cambridge University Press. The book is a self-contained guide to the two pillars of data science, probability theory, and statistics, which are presented side by side, in order to illuminate the connections between statistical techniques and the probabilistic concepts they are based on.
The topics covered in the book include random variables, nonparametric and parametric models, correlation, estimation of population parameters, hypothesis testing, principal component analysis, and both linear and nonlinear methods for regression and classification. Examples throughout the book draw from real-world datasets to demonstrate concepts in practice and confront readers with fundamental challenges in data science, such as overfitting, the curse of dimensionality, and causal inference.
If you have any comments or suggestions, or you find any typos, please reach out to ps4ds.book@gmail.com
The author Carlos Fernandez-Granda is an Associate Professor of Mathematics and Data Science at the Courant Institute and the Center for Data Science in New York University, where he has spent the last 10 years teaching probability and statistics to data-science students. His research focuses on applications of artificial intelligence and machine learning to medicine, imaging, climate and other scientific domains.
The development of these materials was made possible in part by the generous support of the Division of Mathematical Sciences at the National Science Foundation, through grants 1616340 and 2009752.