Univariate, Bivariate, and Multivariate Data Analysis in Python

Gaurav Singh Tanwar
8 min readApr 28, 2022

Keep Calm and learn Data Analysis

Photo by Myriam Jessier on Unsplash

Max Levchin, the co-founder of PayPal, once said -“The world is now awash in data and we can see consumers in a lot clearer ways.” This statement is so simple yet so meaningful. In the world of the Internet, data is everywhere around us, in spreadsheets, on social media platforms, on e-commerce websites, and more. Organizations spend lots of resources on collecting data and benefit from analyzing that collected data.

In a nutshell, the process of cleaning, transforming, visualizing, and analyzing the data to gain valuable insights to make more effective business decisions is known as Data Analysis.

In this article, we will try to look into data analysis techniques and see which techniques can be used with what kind of variables. Specifically, we will understand :

  1. Univariate, Bivariate, and Mulivariate Data Analysis Meanings
  2. Univariate Analysis for Continuous Variables and Categorical Variables
  3. Bivariate Analysis for Continuous Variable vs Continuous Variable, Categorical Variable vs Categorical Variable
  4. Multivariate Analysis for Numerical-Numerical-Categorical Variables
  5. Create Contingency Tables
  6. Interpret Results of analysis

So let’s gets started

To understand the definitions and the steps involved in data analysis we will import a dataset on which we will be implementing the data analysis operations on.

Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math

Importing the Dataset

Here, we will be using the Credit Card Approvals available on Kaggle.

card_approval_df=pd.read_csv(<PATH TO CSV FILE>)
print(card_approval_df.head())

Output:

Head of the DataFrame having the card approval data

Now lets get a summary of data using info method of the dataframe.

print(card_approval_df.info())

Output:

We can see that the data frame has 690 entries and 16 columns. Also for each of the columns, the non-null count is 690 which implies that no column contains null values.

Also, we call duplicated method of pandas data frame to see if there are any duplicate rows.

No duplicate rows

Now let’s mention which columns hold categorical data and which columns hold continuous data

Columns holding categorical data : Gender, Married, BankCustomer, Industry, Ethinicity, PriorDefault, Employed, DrivingLicense, Citizen, Approved
Columns holding continuous data: Age, debt, YearsEmployed, CreditScore, Income

Note: I have dropped the ZipCode column because that column won’t help in analysis.

Alright!!! Now we begin our analysis on the dataset. We will start with Univariate Analysis.

Univariate analysis is the most basic form of the data analysis technique. When we want to understand the data contained by only one variable and don’t want to deal with the causes or effect relationships then a Univariate analysis technique is used.

Univariate Analysis of continuous Variables

First, we will do the univariate analysis of continuous variables. We will first use the describe function to get the descriptive statistics of continuous variables.

card_approval_data[[‘Age’,’Debt’,’YearsEmployed’,’CreditScore’,’Income’]].describe()

By using the describe function on the selected columns, we get the mean, std, min, max, 25th percentile, 50th percentile, and 75% percentile values of the columns.
We can see that the minimum age among the applicants is 13.75. Also, the minimum value of the YearsEmployed column is 0. This tells that people without any employment history also applied for a credit card. A similar type of observation can be seen for other continuous columns.

Now we will plot histograms for continuous columns to see the frequency distribution of values of columns.

The histogram for the Age column can be plotted using the below line of code

sns.histplot(card_approval_data.Age,kde=True)
Histogram for the Age column

By analyzing the above plot, we find that very few people applied for credit cards after turning 50. Also, people between the ages of 20–and 40 applied the most as compared to other groups. This provides us an insight that people tend to apply for credit cards in the early phase of their lives. Hence, credit card issuing firms can target people in the age group 20–40.

The histogram for the YearsEmployed column is shown below.

Frequency distribution for YearsEmployed Column

The above histogram shows that people tend to apply for credit cards at a very early stage of their careers. The lower frequency in the region above 10 YOE may be due to the reason that people apply for credit cards in an early stage of their careers. Hence, they possess credit cards when they are professionally experienced (>10 YOE). Thus, they don’t need to apply for cards in that stage.

Now I will be coming on to the univariate analysis of categorical variables. But, feel free to draw histograms for other continuous columns !!!

Univariate Analysis of Categorical Variables

First we will plot count plots of categorical plot.

# plot count plot for the gender column
sns.countplot(card_approval_data.Gender)
Count Plots of Some Categorical Features

Observations:

  1. Males (Gender -1 ) applied more than women (Gender -0) did.
  2. People having bank accounts applied more than people who don’t have bank accounts. This cause no surprise.
  3. The trend in ethnicit ymight be due to the region the data was collected from.
  4. The majority of applications were rejected, i.e., less than 50% of the applications were approved.

Bivariate Analysis

Bivariate analysis is slightly more analytical than Univariate analysis. When the data set contains two variables and researchers aim to undertake comparisons between the two data set then Bivariate analysis is the right type of analysis technique.

Bivariate Analysis of Continuous Variables:

The first step in performing bivariate analysis between continuous variables would be to calculate correlations between them. Use corr function to construct the correlation matrix.

Correlation Matrix

Though in this dataset, we don’t see any strong correlation between any two continuous variables, in some datasets, continuous variables could be strongly correlated and the values of one might depend on others.

We can also draw line plots and scatterplots to see a relation between the two continuous variables.

Scatter PLot

The points in the above scatter plot don’t follow any specific pattern. This might be due to people applying for cards coming from different professions with varying payscales.

Bivariate Analysis of Categorical Variables vs Continuous Variables:

Now we will try to see how values of continuous variables behave for different values of categorical variables.

We will use the ‘Approved’ column of the data as the categorical variable for our analysis. Comparing the column ‘Approved’ column with other columns can provide us with some useful insights.

GroupBy: First, we will perform the GroupBy operation on the continuous variables. Groupby allows us to split our data into separate groups to perform computations for better analysis.

Group By operation on data

In the above table, we can see that the average credit score of people who got approval is more than people who didn’t get approval. The same pattern is observed for the Income and YearEmployed columns. This is very understandable because companies don’t issue credit cards to people with low credit scores and low income. Also, companies prefer decent employment history for issuing credit cards.

KDE Plots with Hue: A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.

We will plot KDE plots of continius variables with hue=’Approved’

KDE Plot for ‘Age’ Column wiht Code to plot it

Similarly, we can plot KDE plots for CreditScore and YearsEmployed Columns.

KDE Plots

In the above plots, we can see how the distribution of variables behaves separately for the “Approved‘ and ‘Rejected’ cases.

Bivariate Analysis of Categorical Variables vs Categorical Variables:

Now we will try to see the relationship between categorical variables. Again we will keep the ‘Approved’ column fixed and will compare it with other columns.

Countplot with Hue: We will plot count plots of categorical variables with Hue=’Approved’

Gender Countplot with Hue

By looking at the above plot, it does not seem that the Gender of applicants is considered a criterion to approve applications.

We can also plot a contingency table to get the actual numbers

Contingency Table

To see the percentages we can run the code shown below

all=pd.crosstab(card_approval_data.Gender,card_approval_data.Approved,margins=True)[‘All’]
pd.crosstab(card_approval_data.Gender,card_approval_data.Approved).divide(all,axis=0).dropna()

This gives:

Percentages

In the above table, we can see that the acceptance percentage for both the genders is very close (53% is close to 56.4%). Hence, it seems that there wasn’t any discrimination against any gender.

Count Plots with Hue

In the first countplot of the above three, we see that for the ‘Latino’ Ethnicity, most of the applications were rejected. So was there any discrimination against them? Let’s try to find out.

First we apply group by operation on the data.

Group By on continuous features

Then we filter the rows with Ethinicity = Latino and take mean of the required column values.

card_approval_data[card_approval_data.Ethnicity==’Latino’][[‘Age’,’Debt’,’YearsEmployed’,’CreditScore’,’Income’]].agg(‘mean’)
Aggregation

Now if we compare the mean CreditScore of Latino ethnicity (1.85)with the mean CreditScore of overall Approved applications (4.60), we find that Latino had less CreditScore than the population with approved applications. The same can be seen in Income column, where Latino has avg. income of 434.64 where the approved applications have avg. income above 2000. By this observation, we can say that it is very unlikely there was any discrimination against the Latino group. discrimination against them? Let’s try to find out.

Multivariate Analysis

Multivariate analysis is a more complex form of a statistical analysis technique and is used when there are more than two variables in the data set.

Here, we will try to see relations between continuous variables and the ‘Approved’ column. To do that, we will plot a pair plot, with Hue as Approved.

Pair Plot

We don’t see any pattern in the pair plot. But, again this can be used to see how two continuous features behave for different classes.

Conclusion

In this article, we looked at the definitions of univariate, bivariate, and multivariate analysis. We also looked at some ways to perform such analysis in python. We used some plots to identify relations between variables. We also understood how we can interpret the results of such analysis. Enough for this article. See you in the next article!!!

--

--