A full end to end tutorial to create a dataset without web scraping!

Image for post
Image for post
Photo by Taryn Elliott from Pexels

Data scientists love creating models and competing to slightly improve accuracy on datasets. By now, so many data scientists have used these popular datasets that it has become difficult to learn anything new from them. It also stops data science from being a problem solving subject to a more engineering subject: slightly tweaking models to slightly improve scores.

To do something unique in data science, you will have to create a dataset yourself and solve a new problem! Because most of us data scientists do not know much about data engineering or web scraping, this guide will show you how…

Customise your Jekyll site to show off your portfolio.

Image for post
Image for post
Photo by @jesuskiteque on Unsplash.

It is becoming more and more important to have an online presence. From data scientists, artists, writers all the way to small businesses; having a portfolio website is becoming critical to success.

Jekyll is a static site generator to help create blogs or as Tom Preston-Werner (the developer) would say to blog “like a hacker”. Because Jekyll powers Github pages it has become a popular solution for many blogs and portfolios.

This post will not go into how to create a Jekyll site as there are many tutorials online to do this (like this). However, this will be a short…

How to make complex neural networks without overfitting!

Image for post
Image for post
Photo from Startup Stock Photos.


Choosing the right architecture for your deep learning model can drastically change the results achieved. Using too few neurons can lead to the model not finding complex relationships in the data, whereas using too many neurons can lead to an overfitting effect.

With tabular data it is usually understood that not many layers are required, one or two will suffice. To help understand why this is enough look at the Universal Approximation Theorem, which proves (in simple terms) that a neural network with one layer and a finite number of neurons can approximate any continuous function.

However, how do you…

How to statistically find the required sample size to make accurate and high confidence generalisations. (With examples!)

Image for post
Image for post
Photo by Morning Brew on Unsplash.


Finding the optimal sample size can be important for many different contexts, from collecting voting intentions in an election to assessing the quality of machinery in a company. Finding a sample which best represents a population can help reduce costs and time while also providing conclusions which can be applied to the entirety of the population.

A way to help understand this is from George Gallup who was a pioneer of survey sampling techniques.

“If you have cooked a large pan of soup, you do not need to eat it all to find out if it needs more seasoning. …

An introduction to Cling to help learn C++

Image for post
Image for post
Image taken from https://negativespace.co/.

When studying mathematics at University I was introduced to both Python and R in a statistics module. Since then I have only stuck to those two languages and only dabbled in other languages when needed.

Recently, I have wanted to improve my programming foundations and learn more about the underlying concepts that we data scientists take for granted with Python and R, while also finding some ways to improve my workflows. And so I undertook the challenge of learning C++, doing this then led me to find Cling.


Cling is an interactive interpreter for C++ which helps give a similar…

Step by step guide to implementing an autoencoder in fastai.

Image for post
Image for post
Autoencoder Architecture. Image made using NN-SVG.


fastai is a deep learning library that simplifies training neural networks using modern best practices [1]. While fastai provides users with a high-level neural network API, it is designed to allow researchers and users to easily mix in low-level methods while still making the overall training process as easy and accessible to all.

This post is going to cover how to set up an autoencoder in fastai. This will go through creating a basic autoencoder model, setting up the data in fastai, and finally putting all this together into a learner model.

Note: a basic understanding of fastai and PyTorch…

Henri Woodcock

UK Based Data Scientist \\ Personal website: henriwoodcock.github.io \\ The opinions expressed are my own views and not my employer. \\ @henriwoodcock

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store