ML Resources


A list of technical readings that I enjoy

An opiniated collection of Python-centric ML resources that I have found to be particularly useful during my time in Data Science.

Data Visualization

  • Altair - Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite. The documentation for this library can be found here
  • Bokeh - Bokeh is a Python library for creating interactive visualizations for modern web browsers. The documentation for this library can be found here.
  • Dash - Written on top of Plotly.js and React.js, Dash is ideal for building and deploying data apps with customized user interfaces. The documentation for this library can be found here.
  • diagrams - diagrams lets you draw the cloud system architecture in Python code. The documentation for this library can be found here.
  • folium - makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. The documentation for this library can be found here.
  • igraph - a library for creating and manipulating graphs. It is intended to be as powerful (ie. fast) as possible to enable the analysis of large graphs. The documentation for this library can be found here.
  • matplotlib - a comprehensive library for creating static, animated, and interactive visualizations in Python. The documentation for this library can be found here.
  • networkx - a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. The documentation for this library can be found here.
  • pandas-profiling - Create HTML profiling reports from pandas DataFrame objects. The documentation for this library can be found here.
  • Plotly - Plotly’s Python graphing library makes interactive, publication-quality graphs. The documentation for this library can be found here
  • plotnine - an implementation of a grammar of graphics in Python, it is based on R’s ggplot2 library. The documentation for this library can be found here
  • seaborn - a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.. The documentation for this library can be found here.
  • Streamlit - Streamlit turns data scripts into shareable web apps in minutes. All in pure Python. No front‑end experience required. The documentation for this library can be found here.

General Purpose (Tabular) Machine Learning

  • annoy - approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk.
  • imbalanced-learn - a Python package to tackle the curse of imbalanced datasets in Machine Learning. The documentation for this library can be found here.
  • hummingbird - a library for compiling trained traditional ML models into tensor computations. The documentation for this library can be found here.
  • lifetimes - a Python library to help model customer behavior and measure Customer Lifetime Value. The documentation for this library can be found here.
  • metric-learn - efficient Python implementations of several popular supervised and weakly-supervised metric learning algorithms. The documentation for this library can be found here.
  • milk - Machine learning toolkit in Python with a strong emphasis on speed and low memory usage. The documentation for this library can be found here. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees.
  • pyBrain - a Python library to develop and implement neural networks. The documentation for this library can be found [here]http://pybrain.org/docs/index.html).
  • pycaret - a low-code machine learning library in Python that automates machine learning workflows. The documentation for this library can be found here.
  • pymc3 - a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API. The documentation for this library can be found here.
  • scikit-learn - Multi-purpose Machine Learning library in Python. The documentation for this library can be found here.
  • statsmodel - a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. The documentation for this library can be found here.
  • XGBoost - XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. The documentation for this library can be found here.

ML Explanability and Feature Interpretation

  • eli5 - a Python library for debugging/inspecting machine learning classifiers and explaining their predictions. The documentation for this library can be found here.
  • lime - a Python library to help explain the predictions of any machine learning classifier. A more thorough explanation of the methodology is available here.
  • omniXAI - a Python machine-learning library for explainable AI (XAI), offering omni-way explainable AI and interpretable machine learning capabilities. The documentation for this library can be found here.
  • shap - a game theoretic approach to explain the output of any machine learning model.
  • yellowbrick - a Python library that provides a suite of visual analysis and diagnostic tools to facilitate machine learning model selection. The documentation for this library can be found here.

Hyper-parameter Optimization

  • hyperopt - distributed Asynchronous Hyper-parameter Optimization. The documentation for this library can be found here.
  • optuna - an open source hyperparameter optimization framework to automate hyperparameter search. The documentation for this library can be found here.
  • ray - an open source framework packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library. The documentation for this library can be found here.
  • scikit-optimize - a simple and efficient library that implements several methods for sequential model-based optimization. The documentation for this library can be found here.
  • tpot - a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. The documentation for this library can be found here.

Time Series

  • Auto_TS - Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. The documentation for this library can be found here
  • darts - a Python library for easy manipulation and forecasting of time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The documentation for this library can be found here.
  • luminol - a lightweight python library for time series data analysis. The two major functionalities it supports are anomaly detection and correlation. The documentation for this library can be found here.
  • Prophet - a library for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. The documentation for this library can be found here.
  • sktime - provides an easy-to-use, flexible and modular open-source framework for a wide range of time series machine learning tasks. The documentation for this library can be found here.
  • statsforecast - lightning fast forecasting with statistical and econometric models. The documentation for this library can be found here.
  • tsfresh - automates the extraction of relevant features from time series data. The documentation for this library can be found here.
  • pyod - a Python toolkit that provides access to a wide range of outlier detection algorithms for detecting outliers in multivariate data. The documentation for this library can be found here.
  • pyts - a Python package dedicated to time series classification. It aims to make time series classification easily accessible by providing preprocessing and utility tools, and implementations of several time series classification algorithms. The documentation for this library can be found here.

Survival Analysis

  • lifelines - a complete survival analysis library, written in pure Python. The documentation for this library can be found here.
  • scikit-survival - a Python module for survival analysis built on top of scikit-learn. The documentation for this library can be found here.
  • pysurvival - an open source python package for Survival Analysis modeling built upon the most commonly used machine learning packages such as NumPy, SciPy and PyTorch. The documentation for this library can be found here.

Causal Inference

  • Causal ML - provides a suite of uplift modeling and causal inference methods that allows user to estimate the Conditional Average Treatment Effect (CATE) or Individual Treatment Effect (ITE) from experimental or observational data. The documentation for this library can be found here.
  • doWhy - An end-to-end library for causal inference. The documentation for this library can be found here.
  • EconML - applies machine learning techniques to estimate individualized causal responses from observational or experimental data. The suite of estimation methods provided in EconML represents the latest advances in causal machine learning. The documentation for this library can be found here.
  • scikit-uplift - an uplift modeling python package that provides fast sklearn-style models implementation, evaluation metrics and visualization tools. The documentation for this library can be found here.

Recommendation \& Ranking

  • lightFM - LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback. The documentation for this library can be found here.
  • surprise - a Python scikit for building and analyzing recommender systems. The documentation for this library can be found here.
  • pyTerrier - a Python framework for performing information retrieval experiments and implementing learn-to-rank pipelines. The documentation for this library can be found here.
  • python-recsys - a python library for implementing a recommender system. The documentation for this library can be found here.

Natural Language Processing

  • allennlp - an open-source NLP research library, built on PyTorch. The documentation for this library can be found here.
  • bert-embedding - token level embeddings from BERT model on mxnet and gluonnlp. The documentation for this library can be found here.
  • fastText - a library for efficient learning of word representations and sentence classification. The documentation for this library can be found here.
  • flair - a very simple framework for NLP that ships with state-of-the-art models for a range of NLP tasks. The documentation for this library can be found here.
  • fuzzywuzzy - About Fuzzy String Matching in Python. The documentation for this library can be found here.
  • gensim - a free open-source Python library for representing documents as semantic vectors. The documentation for this library can be found here.
  • NLTK - a leading platform for building Python programs to work with human language data. The documentation for this library can be found here.
  • Stanford CoreNLP Python - a Python wrapper for Stanford CoreNLP tools.
  • spacy - a library for advanced Natural Language Processing in Python and Cython. The documentation for this library can be found here.
  • stanza - the Stanford NLP Group’s official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. The documentation for this library can be found here.
  • textblob - a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction. The documentation for this library can be found here.
  • transformers - provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. The documentation for this library can be found here.
  • pattern - a web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization. The documentation for this library can be found here.
  • polyglot - supports various multilingual applications and offers a wide range of analysis and broad language coverage. The documentation for this library can be found here.
  • vaderSentiment - a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
  • word_forms - accurately generate all possible forms of an English word. The documentation for this library can be found here.

Computer Vision

  • NiLearn - makes it easy to use many advanced machine learning, pattern recognition and multivariate statistical techniques on neuroimaging data. The documentation for this library can be found here.
  • OpenCV - an open-source library that includes several hundreds of computer vision algorithms. The documentation for this library can be found here.

Miscalleneous