Data scientists love creating models and competing to slightly improve accuracy on datasets. By now, so many data scientists have used these popular datasets that it has become difficult to learn anything new from them. It also stops data science from being a problem solving subject to a more engineering subject: slightly tweaking models to slightly improve scores.
To do something unique in data science, you will have to create a dataset yourself and solve a new problem! Because most of us data scientists do not know much about data engineering or web scraping, this guide will show you how…
It is becoming more and more important to have an online presence. From data scientists, artists, writers all the way to small businesses; having a portfolio website is becoming critical to success.
Jekyll is a static site generator to help create blogs or as Tom Preston-Werner (the developer) would say to blog “like a hacker”. Because Jekyll powers Github pages it has become a popular solution for many blogs and portfolios.
This post will not go into how to create a Jekyll site as there are many tutorials online to do this (like this). However, this will be a short…
Choosing the right architecture for your deep learning model can drastically change the results achieved. Using too few neurons can lead to the model not finding complex relationships in the data, whereas using too many neurons can lead to an overfitting effect.
With tabular data it is usually understood that not many layers are required, one or two will suffice. To help understand why this is enough look at the Universal Approximation Theorem, which proves (in simple terms) that a neural network with one layer and a finite number of neurons can approximate any continuous function.
However, how do you…
Finding the optimal sample size can be important for many different contexts, from collecting voting intentions in an election to assessing the quality of machinery in a company. Finding a sample which best represents a population can help reduce costs and time while also providing conclusions which can be applied to the entirety of the population.
A way to help understand this is from George Gallup who was a pioneer of survey sampling techniques.
“If you have cooked a large pan of soup, you do not need to eat it all to find out if it needs more seasoning. …
When studying mathematics at University I was introduced to both Python and R in a statistics module. Since then I have only stuck to those two languages and only dabbled in other languages when needed.
Recently, I have wanted to improve my programming foundations and learn more about the underlying concepts that we data scientists take for granted with Python and R, while also finding some ways to improve my workflows. And so I undertook the challenge of learning C++, doing this then led me to find Cling.
Cling is an interactive interpreter for C++ which helps give a similar…
fastai is a deep learning library that simplifies training neural networks using modern best practices [1]. While fastai provides users with a high-level neural network API, it is designed to allow researchers and users to easily mix in low-level methods while still making the overall training process as easy and accessible to all.
This post is going to cover how to set up an autoencoder in fastai. This will go through creating a basic autoencoder model, setting up the data in fastai, and finally putting all this together into a learner model.
Note: a basic understanding of fastai and PyTorch…
UK Based Data Scientist \\ Personal website: henriwoodcock.github.io \\ The opinions expressed are my own views and not my employer. \\ @henriwoodcock