A guide to your first machine learning project — Hands on

What is machine learning?

Data Sciences, Big Data Analytics, Artificial Intelligence, Predictive Analytics, Computational Statistics… all these fancy words spin around your world? They sure would. So let me put it plain and simple. Machine Learning is about teaching computers how to learn from data to make decisions or predictions. It gives the computer to learn without being explicitly programmed. In short, you teach your computer how to think.

More formally. Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

What’s the Problem at hand?

Titanic — Machine Learning from Disaster

The infamous titanic problem which is one of the best entry level projects for many people pursuing machine learning including me.

The Challenge (Kaggle)

The sinking of the Titanic is one of the most infamous shipwrecks in history. (https://www.kaggle.com/c/titanic)

Fictional representation of the sinking of the Titanic

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

An overview of the problem

We are provided with 3 files

1. gender_submission.csv — An example of what a submission file should look like. Which contains two columns PassengerId and a Survived column

2. test.csv — To check accuracy of the model

3. train.csv — Contains the training data which we will be using for our modelling experiment. It contains 12 columns.

Approach

These are some of the most common steps you will encounter with any machine learning workflow whether it be simple or complex.

1. Load the data

2. Exploratory data analysis (EDA)

3. Pre-processing data (Filling missing values etc)

4. Feature Engineering

5. Modelling

6. Prediction

Initial EDA:

1. PassengerId (Integer) : Contains the serialized ordered numbers that provide a unique id to each passenger

2. Survived (Boolean) : It provided information whether a passenger survived or not using boolean variables. 0 = Not survived, 1 = Survived

3. Pclass (Integer): It is the ticket class the passenger belongs too.

1= 1st class, 2 = 2nd class, 3 = 3rd class

4. Name(String) : The name of the passenger

5. Sex(String): Provides the gender of a passenger. Male or Female

6. Age(Integer): Provides the age of the passenger

7. Sibsp(Integers): Gives us information on the number of siblings or spouses boarding the ship.

8. parch(Integer): Gives us information on the number of parents or children boarding the ship

9. Ticket(Integer): Gives us the ticket number for the passenger

10. Fare(Integer): The fare for the ticket for a passenger

11. Cabin(String): The cabin number of the passenger in the ship

12. Embarked(String): Information on the port in which they embarked the titanic C = Cherbourg, Q = Queenstown, S = Southampton

How to get hands on!

The part you have all been waiting for, how to get started?

Step 1:

The first step is to find the environment in which you would be writing code. I prefer to use Google Colaboratory notebooks as it is very easy to set up and gives you a free GPU to use to run your machine learning models fast. It's also completely free to use.

Just click on this link and click new notebook to get started. (you will need a google account to get started)

STEP 2:

Once you have created the notebook the first step is to load the data into the notebook.

!wget https://github.com/KavinRajagopal/Kaggle_titanic_machine_learning/blob/main/titanic/gender_submission.csv
!wget https://github.com/KavinRajagopal/Kaggle_titanic_machine_learning/blob/main/titanic/test.csv
!wget https://github.com/KavinRajagopal/Kaggle_titanic_machine_learning/blob/main/titanic/train.csv

Step 3:

Now to open the notebook:

Go to File -> open notebook -> github -> Paste the following link

https://github.com/KavinRajagopal/Kaggle_titanic_machine_learning/blob/main/Kaggle%20Titanic%20submission%20Notebook.ipynb

Voila! This will setup your first machine learning project environment. This notebook is your playground. Do play around with the code and twist some knobs for thats what machine learning is all about.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kavinrajagopal

I have a passion for everything data, so you would be seeing me write about AI/ML, Devops, Cloud computing and Data Engineering.