Dec 15

4 min read

What is EDA? Why Do We Need It? — Step by Step Guide

What is exploratory data analysis and how can we apply it with python?

Exploratory data analysis (EDA) is one of the most important stages for anyone interested in data science.
Defining and understanding the data set enables us to draw meaningful conclusions from the data and prepare the data for machine learning models. I divide it into 3 main categories but it depends on you to separate more topics. Like import data, and data visualization. For me basically, it’s 3 steps.

1. Understanding the data

2.Clean the data

3. Analyze the relationship between variables

1.Understanding the data

Everthing start with data importing you can import different ways, one of the usual one is use Pandas library.

import numpy as np
import pandas as pd
df=pd.read_csv("my_dataset") 
df=pd.read_excel("my_dataset")

We need to understand our data. So we need to check our data set. We can continue with seeing the data set’s head and tail. It gives a chance to see column names, variables and data set length. Additionaly we can check shape of data for seeing our data set’s size.Additionally we can check shape of data for seeing your data set’s size.

df.head()
df.tail()
df.shape

Statistical results mustn’t forget. We can check with describe function. It gives us basic statistical results like mean, median, min, max and quartiles with just one line code.

df.describe()

But if you are a hardgainer, you can use code block below:)

q1 = df['Column_name'].quantile(0.25)
q2 = df['Column_name'].quantile(0.50) # equal median
q3 = df['Column_name'].quantile(0.75)
IQR=q3-q1 #Interquartile Range
mean=df["Column_name"].mean()
median=df["Column_name"].median()
minimum = q1-1.5*IQR
maximum = q3+1.5*IQR
print("mean : " + str(mean))
print("q1 : " + str(q1))
print("q2 : " + str(q2))
print("q3 : " + str(q3))
print("IQR : " + str(IQR))
print("minimum : " + str(minimum))
print("maximum : " + str(maximum))

What about the data set which one has more than 10,20,30 columns? We cant able to understand all data set with the info function. So we need to check column names.

df.columns

Also, we need to check unique values to understand the data set. For now, I know the columns but don’t know how many different observations have in my columns. We can handle with;

df.nunique()

or you can directly see variables with ;

df["my_column"].unique()

2.Clean the data

Every time we don’t need all data in the data set, some data can break our data or make useless our ML model. At this time we should clean our data set. First thing is to check null values, with the isnull function we can find total null values in axis.

df.isnulll().sum()

There are many ways to handle missing values like drop, change with zero, median, or mean. You need to find the best solution for your data set.

If you dont need one of the column, you can drop it.

df=df.drop(["column_1","column_2"],axis=1)

Let’s say you need to column and there are no null values, is everything okay? unfortunately no :) You need to check outliers, some of your data can take way different values. Also, this situation can break your model. You can think to remove these outlier values.

3. Analyze the relationship between variables

We will look at a few measures. This part also includes data visualization. We need to look correlation between our columns. It’s the fastest and easiest way to understand a relationship. Please, Be sure that enough knowledge about positive and negative correlations.
You can see the correlation matrix in many ways but generally shows with a heatmap which is in seaborn library.

import seaborn as sns
sns.heatmap(df.corr())

The second most preferred method is the scatter plot. Scatter plot gives us a relationship between variables with graph. We can see all reation ships with the one ine code.

import seaborn as sns
sns.pairplot(df)

and you can keep going to data visualization, here is all my article about data visualization.

Data Visualization 101- Python for Data Analysis

Introduction of data visualization with python.

medium.com

Data Visualization 102- Python for Data Analysis

Visualization of numerical features with python.

medium.com

Data Visualization 103- Python for Data Analysis

How can we understand our data with only countplot?

medium.com

Last words

Thanks for reading this blog. Your comments and likes will help my growth. If you want to see more content like this, you can follow my medium profile.

You can find all the source code in my GitHub profile. You can keep in touch with me from my LinkedIn profile.

If I have any mistake, please feel free to comment.

What is EDA? Why Do We Need It? — Step by Step Guide

1.Understanding the data

2.Clean the data

3. Analyze the relationship between variables

Data Visualization 101- Python for Data Analysis

Introduction of data visualization with python.

Data Visualization 102- Python for Data Analysis

Visualization of numerical features with python.

Data Visualization 103- Python for Data Analysis

How can we understand our data with only countplot?

Last words

More from Kaan ÇUKUR

Get the Medium app

Kaan ÇUKUR