Data Science

Exploratory Data Analysis

Exploratory data analysis is the first and foremost step to analyse any kind of data. Rather than a specific set of procedures, EDA is an approach, or a philosophy, which seeks to explore the most important and often hidden patterns in a data set. In EDA, you explore the data and try to come up with a hypothesis about it which you can later test using hypothesis testing. Statisticians use it to take a bird’s eye view of the data and try to make some sense of it. The following topics are covered.

  1. Data sourcing
  2. Data cleaning
  3. Univariate analysis
  4. Bivariate analysis
  5. Derived metrics

Data Sourcing

To solve a business problem using analytics, you need to have historical data to come up with actionable insights. Data is the key — the better the data, the more insights you can get out of it.

Typically, data comes from various sources and your first job as a data analyst is to procure the data from them. Various sources of data and how to source data from public and private sources is as follows:

Public and Private Data

Data can be classified into two types, depending upon its source:

  1. Public Data
  2. Private Data

A large amount of data collected by the government or other public agencies is made public for the purpose of research. Such data sets do not require special permission for access and are, therefore, called public data.

On the other hand, private data is data that is sensitive to organisations and is, thus, not available in the public domain. Banking, telecom, retail and media are some of the key private sectors that rely heavily on data to make decisions.

Click on Data Sourcing Datasets

Public Data

Public data is available on the internet on various platforms. A lot of data sets are available for direct analysis, whereas some of the data have to be manually extracted and converted into a format that is fit for analysis.

Commodity prices, as the name suggests, are the prices of commodities, such as oil, gold, silver, wheat, cotton, coffee, etc., that are traded in global trade markets. Just as an increase in commodity prices is indicative of economic growth, a fall in the same is indicative of a slowdown.

Optional Exercise: Sourcing Sports Data

If you are interested in sports, you can take a look at the Awesome Public Datasets on GitHub, which contain a directory of sports data from tennis, cricket, football, basketball and other sports. For example, you can find the ball-by-ball data of all the IPL seasons (~600 matches). You may find it interesting to browse through it and do some EDA/modelling as a side project and add it to your resume. A project on IPL could serve as an interesting talking point in any job interview.

A large amount of data collected by the government or other public agencies is made public for the purposes of research. Such data sets do not require special permission for access and are therefore called public data. Public data is available on the internet on various platforms. A lot of data sets are available for direct analysis, whereas some of the data must be manually extracted and converted into a format that is fit for analysis.

Class Opinion

Think about an industry that you would love to analyse. How would you go about finding public data sets for this industry?

Some examples of useful datasets are:

  1. Daily oil and steel prices data (India) can be useful for someone working in the oil/transport/manufacturing industry
  2. Daily prices of air ticket prices for people working in air travel/tourism industry
  3. Daily price (conversion rate) of USD in units of INR per USD

Now, for this data set, what are some challenges that you would face while trying to convert the data set into formats that you could analyse (e.g. you would need to scrape the website, the data could be unstructured and ill-formatted, etc.)? Think about how you’d solve these problems and write your answer in the text box below.

Census data for ‘Educational Level By Age And Sex For Population Age 7 And Above – 2011‘ is available here. Please download the file in Column C-08 in the first row, which is ‘India‘.

Let’s say you are working with a central government body and want to compare the literacy rates of the states in India. 

Your task is to understand the data, i.e., what the rows and attributes represent, and find the number of literates, illiterates, literacy rate, etc. as guided in the questions below. 

Sourcing Data

Suppose you are a consultant handling a government project to revamp the national government school system. What is the public source of data you can think of to start exploring?

How many people are included in the census data set?

1,21,08,54,977

1,29,08,23,010

1,32,34,56,920

1,24,32,51,298

What is the total number of illiterate and literate people in India in 2011?

Illiterate = 44,72,16,165; Literate = 76,36,38,812


Illiterate = 34,27,63,892; Literate = 25,66,38,812

Illiterate = 63,72,92,612; Literate = 35,29,62,923

Illiterate = 21,34,28,639; Literate = 37,62,97,392

Private Data

A large number of organisations seek to leverage data analytics to make crucial decisions. As organisations become customer-centric, they utilise insights from data to enhance customer experience while optimising their daily processes.

Private data is that which is sensitive to organisations and is thus not available in the public domain. Banking, telecom, retail, and media are some of the key private sectors that rely heavily on data to make decisions. A large number of organisations seek to leverage data analytics to make crucial decisions. As organisations become customer-centric, they utilise insights from data to enhance customer experience, while also optimising their daily processes.

1.Public data: Data collected by the government or other public agencies that are made public for the purposes of research are known as public data.

2.Private data: This is the data generated by banking, telecom, retail and media industries, which are some of the key private sectors that rely heavily on data to make decisions.

It is recommended that you keep the following data sources handy when looking for data sets.

Data sourcing is only the first step in the process. After being sourced, the data needs to be cleaned before you can analyse it. In the next session, you will learn how to clean the data for analysis.

Data Cleaning

Click for Comprehension: Cleaning census data – fixing rows and columns

There are various types of quality issues when it comes to data, and that’s why data cleaning is one of the most time- consuming steps of data analysis. For example, there could be formatting errors (e.g. rows and columns are merged), missing values, repeated rows, spelling inconsistencies etc. These issues could make it difficult to analyse data and could lead to errors or irrelevant results. Thus, these issues need to be corrected before data is analysed.

Though data cleaning is often done in a somewhat haphazard way and it is too difficult to define a ‘single structured process’, data cleaning is done in the following steps:

  1. Fix rows and columns
  2. Fix missing values
  3. Standardise values
  4. Fix invalid values
  5. Filter data

1.      Fix Rows Examples:

  • Fix Columns Examples:

Checklist for Fixing Rows:

  • Delete summary rows: Total, Subtotal rows
    • Delete incorrect rows: Header rows, Footer rows
    • Delete extra rows: Column number, indicators, Blank rows, Page No.

Checklist for Fixing Columns

  • Merge columns for creating unique identifiers if needed: E.g. Merge State, City into Full address
    • Split columns for more data: Split address to get State and City to analyse each separately
    • Add column names: Add column names if missing
    • Rename columns consistently: Abbreviations, encoded columns
    • Delete columns: Delete unnecessary columns
    • Align misaligned columns: Dataset may have shifted columns
  • Set values as missing values: Identify values that indicate missing data, and yet are not recognised by the software as such, e.g treat blank strings, “NA”, “XX”, “999”, etc. as missing.
    • Adding is good, exaggerating is bad: You should try to get information from reliable external sources as much as possible, but if you can’t, then it is better to keep missing values as such rather than exaggerating the existing rows/columns.
    • Delete rows, columns: Rows could be deleted if the number of missing values are insignificant in number, as this would not impact the analysis. Columns could be removed if the missing values are quite significant in number.
    • Fill partial missing values using business judgement: Missing time zone, century, etc. These values are easily identifiable.

Census

(This question is in continuation to the previous question.)

Which of the following kinds of information could be obtained from the data set?

Information about individuals such as age, literacy, etc.

Clock for Census 2011-census

Census

(This question is in continuation to the previous question.)

Which of the following kinds of information could be obtained from the data set?

Information about individuals such as age, literacy, etc.

Information about an age group at a district level such as the total number of literates in Chennai

Information about an age group in the entire country, such as the total number of literates in India

Invalid Columns

Mark the correct statement about columns in the data set.

There are no merged columns in the data set, i.e., each column represents one variable.

The data set contains merged column name headers, which should be broken down into separate column names.

✓ Correct

Feedback:

Ideally, every column should have a separate, unique name. There are merged column name headers, e.g., ‘Total’, ‘Illiterate’, ‘Literate’, etc. These should be broken down into separate column names for analysis, i.e., ‘Total’ should be broken down into ‘Total_Persons’, ‘Total_Males’ and ‘Total_Females’.

There is at least one empty column in the data set

Rows

Mark the correct statements about rows in the data set.

(More than one option may be correct.)

There are no empty rows in the data set.


There are header rows in the data set.


There are footer rows in the data set.


There are rows with column numbers.

Use the Courses data set given below to solve the following assessment.

Department of Maximum students

The courses dataset contains a column named ‘Course’ which has the data in the format courseID-Department. Find the department with the maximum number of students. 

MA

CSE

MEC

EE

Merge Two Columns

After solving the previous question the dataset contains four columns: CourseID, Dept, No of Students, and Credits 

Let’s say this modified dataset is given to you and asked to create a new column ‘Courses’ to the right of column ‘Dept’, by merging the ‘CourseID’ and ‘Dept’ (The first four rows of the final required dataset is given below). What steps do you follow to do so? Write those steps in the text box given below. 

Misaligned Columns

How many misaligned (shifted) columns are present in the census dataset given above? Select the correct option. 
After that, make sure to align those misaligned columns and proceed for the remaining questions.


5


6

7

8

Missing Values

In the previous lecture, you learnt how to fix rows and columns. Let’s now study another common data quality issue: how to treat missing values.

In the following lecture, Anand will walk you through some important points that you should keep in mind while treating missing values.

Let’s summarise how to deal with missing values:

  • Set values as missing values: Identify the values that indicate missing data and yet are not recognised by the software, e.g., treat blank strings, ‘NA’, ‘XX’, ’99’, etc. as missing and replace such data indicating missing values with the blank cell in Excel.
  • Adding is good, exaggerating is bad: You should try to get information from reliable external sources as much as possible; but if you cannot, then it is better to keep missing values as such rather than exaggerating the existing rows/columns.
  • Delete rows, columns: Delete rows if the number of missing values is insignificant, as this would not impact the analysis. Columns could be removed if the missing values are quite significant in number.
  • Fill partial missing values using business judgement: Missing time zone, century, etc. These values are easily identifiable.

Use the ‘Marks’ data set given below to solve the following assessments.

Missing Values Removal

Note: Download the ‘Marks’ data set given above and solve the quiz.

Create a column named ‘Count’ that maintains the number of missing values in each row of the ‘Marks’ data set given above. What is the sum of the cells of the ‘Count’ column?

33

22

44

55

Sum of Count

This question is in continuation of the previous question. Once you are done with the previous question, remove all the rows with five missing values.

You created the Count column in the previous question. Now, what is the sum of the Count column after removing the rows with five missing values? 

44

39

34

29

Average of Columns

What is the average of each of these columns, Assignment, Tutorial, Midterm, TakeHome and Final?
(Note: Precision up to two decimal places is required.)
 

67.62, 80.99, 68.31, 85.38, 90.79

80.99, 68.31, 85.38, 90.79, 67.62

85.38, 90.79, 67.62, 80.99, 68.31


90.79, 67.62, 80.99, 68.31, 85.38

Fill Empty

This question is in continuation of the previous question.

After performing the steps below, answer the following question:

  1. Replace all the empty cells in the data set with the average of their respective columns, which you calculated in the previous question.
  2. Create a column ‘Row Min’ and store the minimum element of each row, excluding the ‘Student ID’ column to calculate the minimum element.

What is the average of the column ‘Row Min’ up to 2 decimal places?

79.12

58.84

95.24

45.98

tandardising Values

You will now learn how to standardise values.

Scaling ensures that the values have a common scale, which makes analysis easier. For example, let’s take a data set containing the grades of students studying in different universities. Some of the universities give grades on a scale of 4, while others give grades on a scale of 10. Therefore, you cannot assume that a GPA of 3 on a scale of 4 is equal to a GPA of 3 on a scale of 10, even though they are quantitatively the same. Thus, for the purpose of analysis, these values need to be brought to a common scale, such as the percentage scale.


One of the concepts that surely caught your attention is outliers. Removing outliers is an important step in data cleaning. An outlier may disproportionately affect the results of your analysis. This may lead to faulty interpretations. It is also important to understand that there is no fixed definition of an outlier. It is left up to the judgment of the analyst to decide the criteria on which the data would be categorised as abnormal or an outlier. We will look into one such method in the next session.


Let’s summarise what you learnt about standardising variables. You could use this as a checklist for future data cleaning exercises.

  • Standardise units: Ensure that all observations under a variable have a common and consistent unit, e.g., convert lbs to kgs, miles/hr to km/hr, etc.
  • Scale values if required: Make sure that the observations under a variable have a common scale.
  • Standardise precision: This allows you to present the data better, e.g., 4.5312341 kgs to 4.53 kgs.
  • Remove outliers: Remove high and low values that would disproportionately affect the results of your analysis.

Use the ‘Customer’ data set to solve the following assessments:

Removing Extra Characters from a Column

The given data set ‘customer’ has a column ‘Cust_id’, which has values Cust_1, Cust_2 and so on.
Remove the repeated ‘Cust_’ from the column Cust_id so that the column Cust_id has just numbers like 1, 2, 3 and so on. Next, find the sum of all the elements of the column Cust_id.

17,79,028

16,79,028

1,67,902

6,79,028

Capitalise

Which function can be used to capitalise only the first letter of each word in the text and the letter that follows any symbol/character in the text?

PROPER


UPPER

LOWER

All the above

Invalid Values

In the previous segment, you learnt how to standardise values. When standardising values, you do not really pay attention to the validity of the actual values of the variables. This is what we will discuss now as you learn how to fix invalid values.

A data set can contain invalid values in various forms. Some of the values could be truly invalid, e.g., a string ‘tr8ml’ in a column containing mobile numbers would make no sense and, hence, should be removed. Similarly, a height of 11 feet would be an invalid value in a set containing heights of children.

On the other hand, some invalid values can be corrected. For example, a numeric value with a data type of string could be converted to its original numeric type. 

Let’s gain more insights into fixing invalid values.  

Let’s summarise what you learnt about fixing invalid values. You could use this as a checklist for future data cleaning exercises.

  • Convert incorrect data types: Correct the incorrect data types for ease of analysis. For example, if the numeric values are stored as strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are: string to number: ‘12,300’ to ‘12300’; string to date: “2013-Aug” to “2013/08”; number to string: “PIN Code 110001” to “110001”; etc.
  • Correct values that go beyond range: If some of the values are beyond the logical range, e.g., temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction or the value needs to be removed.
  • Correct values that are not in the list: Remove values that don’t belong to a list. For example, in a data set containing blood groups of individuals, strings ‘E’ or ‘F’ are invalid values and can be removed.
  • Correct wrong structure: Values that don’t follow a defined structure can be removed. For example, in a data set containing PIN codes of Indian cities, a PIN code of 12 digits would be an invalid value and needs to be removed. Similarly, a phone number of 12 digits would be an invalid value.
  • Validate internal rules: If there are internal rules, such as the date of a product’s delivery must definitely be after the date of the order, they should be correct and consistent.

Click on address learn how to filter data for the ease of analysis.

Question 1

Note: The number of digits in a pin code should be 6. Any pin code with the number of digits not equal to 6 is considered to be invalid. The first digit of the pin code should not be 0.

Are there any invalid values in column ‘Pincode’ of the ‘Address’ dataset?

No

Yes

Question 2

How many invalid values are present in the ‘Pincode’ column of the ‘Address’ dataset?

4

5

7


8

Question 3

What do you do to the invalid values in the Pincode column to make them valid? Write your approach in 200 words.

Question 4

Correct the invalid values in the Pincode column by using the approach you have seen as the suggested answer to the previous question.

After correcting the invalid values in the Pincode column, find out the number of rows with the Pincode ‘421601’. Select the correct option.

30

35

40

45

Filtering Data

How to fix invalid values. Let’s talk about how to filter data.

After you have fixed the missing values, standardised the existing values, and fixed the invalid values, you get to the last stage of data cleaning. Though you have a largely accurate data set by now, you might not need the entire data set for your analysis. It is important to understand what you need to infer from the data and then choose the relevant parts of the data set for your analysis. Thus, you need to filter the data to get what you need for your analysis.

checklist for future data cleaning exercises.

  • Deduplicate data: Remove identical rows, and remove rows where some columns are identical.
  • Filter rows: Filter by segment and date period to get only the rows relevant to the analysis.
  • Filter columns: Pick the columns that are relevant to the analysis.
  • Aggregate data: Group the data by the required keys and aggregate the rest.

Use the Movies data set given below to solve the following assessments.

Remove Duplicate Rows

Delete the duplicate rows in the ‘Movies’ data set given above, and find the number of rows in the final data set including the header row.

1,255

4,999


5,890

3,444

Value to Impute

There are some blanks in the ‘language’ column of the Movie data set given above. What could be the most preferred value to impute in those blanks?

Chinese

English

Hindi

Spanish

How to Fill Blanks

In the language column of the Movies data set, there are some blanks. You have to impute the blanks with ‘English’ in the data set. List the steps to fill the blanks.

Checklist:

Click for Data Cleaning Checklist

Standardise units: Ensure all observations under a variable have a common and consistent unit, e.g. convert lbs to kgs, miles/hr to km/hr, etc.

Scale values if required: Make sure the observations under a variable have a common scale

Standardise precision for better presentation of data, e.g. 4.5312341 kgs to 4.53 kgs.

Remove outliers: Remove high and low values that would disproportionately affect the results of your analysis.

A data set can contain invalid values in various forms. Some of the values could be truly invalid, e.g. a string “tr8ml” in a variable containing mobile numbers would make no sense and hence would be better removed. Similarly, a height of 11 ft would be an invalid value in a set containing heights of children. On the other hand, some invalid values can be corrected. E.g. a numeric value with a data type of string could be converted to its original numeric type.

  • Convert incorrect data types: Correct the incorrect data types to the correct data types for ease of analysis.

E.g. if numeric values are stored as strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are — string to number: “12,300” to “12300”; string to date: “2013-Aug” to “2013/08”; number to string: “PIN Code 110001” to “110001”; etc.

  • Correct values that go beyond range: If some of the values are beyond logical range, e.g. temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction, or if the value needs to be removed.
    • Correct values not in the list: Remove values that don’t belong to a list. E.g. In a data set containing blood

groups of individuals, strings “E” or “F” are invalid values and can be removed.

  • Correct wrong structure: Values that don’t follow a defined structure can be removed. E.g. In a data set containing pin codes of Indian cities, a pin code of 12 digits would be an invalid value and needs to be removed. Similarly, a phone number of 12 digits would be an invalid value.
    • Validate internal rules: If there are internal rules such as a date of a product’s delivery must definitely be after

the date of the order, they should be correct and consistent.

  • Deduplicate data: Remove identical rows, remove rows where some columns are identical
    • Filter rows: Filter by segment, filter by date period to get only the rows relevant to the analysis
    • Filter columns: Pick columns relevant to the analysis
    • Aggregate data: Group by required keys, aggregate the rest

 learnt how to recognise various data quality issues, such as:

  1. Formatting errors (e.g., ill-formatted rows and columns, unclear names, etc.), 
  2. Missing values
  3. Repeated rows
  4. Spelling inconsistencies, etc.

 Techniques to clean data, are as follows:

  1. Fix rows and columns
  2. Fix missing values
  3. Standardise values
  4. Fix invalid values
  5. Filter data

Election Data: Case Study

The Election Commission of India (ECI) is the constitutional authority responsible for conducting elections in India. The ECI website is a storehouse of information about the candidates who contest the elections and also those who finally come to power.

The ECI publishes statistics reports of every election, containing a large amount of data. Although the data is given in an easy-to-read format, it is not easy to analyse; it has to be extracted and cleaned before it can be analysed. 

Although data cleaning is a time-consuming process, it is equally rewarding in the later stages of the analysis.

Erroneous data will lead to an inaccurate analysis, which could undermine your ability as a business analyst.

Checklist for fixing rows

  • Delete summary rows: Total, subtotal rows
  • Delete incorrect rows: Header rows, footer rows
  • Delete extra rows: Column number, indicators, blank rows, page number

Checklist for fixing columns

  • Merge columns for creating unique identifiers if needed, e.g., merge state and city details into the full address.
  • Split columns for more data. Split the address to get state and city details so that you can analyse each separately.
  • Add missing column names.
  • Rename columns consistently, with abbreviations and encoded columns.
  • Delete unnecessary columns.
  • Align misaligned columns; the data set might have shifted columns.

Data Description

Given a data set, the first step is to understand what it contains. Information about a data set can be gained simply by looking at its metadata. Metadata, in simple terms, is the data that describes each variable in detail. Information such as the size of the data set, how and when the data set was created, what the rows and variables represent, etc. are captured in the metadata.  

Types of Variables

You learnt the difference between ordered and unordered categorical variables.

  • Ordered ones have some kind of ordering in them. Some examples are:
    • Salary = High-Medium-low
    • Month = Jan-Feb-Mar
  • Unordered ones do not have the notion of high-low, more-less, etc. Example:
    • Type of loan taken by a person = home, personal, auto, etc.
    • Organisational role of a person = Sales, marketing, HR, etc.

Apart from the two types of categorical variables, the other most common type is quantitative variables. These are simply numeric variables, which can be added up, multiplied, divided, etc. For example, salary, number of bank accounts, runs scored by a batsman, the mileage of a car, etc.

So far, we have discussed the following types of variables:

  1. Categorical variables
    • Unordered 
    • Ordered
  2. Quantitative / numeric variables

Categorical Variables

A survey variable called “StateofMind” contains values 1 to 4 denoting the different states of mind of the respondents.

 StateofMindMeaning
1Confused
2Happy
3Depressed
4Confident

Which of the following categories would the variable “StateofMind” fall into?

Ordered categorical

Unordered categorical

After getting a basic understanding of the variables in the data set, the next step is to gain insights from the data. In the next few lectures, you will learn how to perform univariate analysis on the following types of variables:

  • Categorical variables
    • Unordered categorical
    • Ordered categorical
  • Quantitative variables

Unordered Categorical Variables – Univariate Analysis

Now, let’s move to the most interesting part of EDA: getting useful insights from the data. So far, you have seen two types of variables: categorical (ordered / unordered) and quantitative (or numeric). In this segment, you will learn how to perform univariate analysis on unordered categorical variables. 

Before that, think a bit about how you would perform analysis on such a variable.

Unordered Categorical Variables – Univariate Analysis

You have worked with some unordered categorical variables in the past, for example:

  • The State in the Avalon MegaMart dataset
  • The City in the Avalon MegaMart dataset

Now, imagine someone (say a client) gives you the ‘Countries’ dataset (that is attached above) containing only an unordered categorical variable (and nothing else!), such as a column of size 200 named ‘country_of_person’ with 30 unique countries, and asks you, “Can you extract anything useful from just this one variable?”

Write down how you would analyse just that variable to get some meaningful insights out of the data. Note that you have only one column to analyse.

Categorising Insights

As seen in the previous question, if you are given only an unordered categorical variable (and nothing else), such as a column of size 200 named ‘country_of_person’ with 30 unique countries, you understood that you can count the frequency of each category in the column. So, what kind of pattern would the insight be categorised into?

Abnormal distribution

Significant outliers

Unknown result

Surprising extreme

It is important to note that rank-frequency plots enable you to extract meaning even from seemingly trivial unordered categorical variables such as country, name of an artist, name of a github user, etc.

The objective here is not to put excessive focus on power laws or rank-frequency plots, but rather to understand that non-trivial analysis is possible even on unordered categorical variables and that plots can help you in that process.

Let us now see how a power law distribution is created in Excel.

Download the employment data set used in the lecture here.

Why plotting on a log-log scale helps

The objective of using a log scale is to make the plot readable by changing the scale. For example, the first ranked item had a frequency of 29000; the second-ranked had 3500; the seventh had 700; and most others had very low frequencies such as 100, 80, 21, etc.  The range of frequencies is too large to fit in the plot.

Plotting on a log scale compresses the values to a smaller scale, which makes the plot easy to read.

This happens because log(x) is a much smaller number than x. For example, log(10) = 1, log(100) = 2, log(1000) = 3 and so on. Thus, log(29000) is now approx. 4.5; log(3500) is approx. 3.5; and so on. What was earlier varying from 29000 to 1 is now compressed between 4.5 and 0, making the values easier to read on a plot.

To summarise, the major takeaways from this lecture are:

  • Plots are immensely helpful in identifying hidden patterns in data. 
  • It is possible to extract meaningful insights from unordered categorical variables using rank-frequency plots.
  • Rank-frequency plots of unordered categorical variables, when plotted on a log-log scale, typically result in a power law distribution.

Ordered Categorical Variables – Univariate Analysis

Learn how to perform univariate analysis on unordered categorical variables. Let’s now see how to do the same on ordered categorical variables.

Before that, think about the type of analysis you would perform on ordered categorical variables.

Ordered Categorical Variables

You have already worked with ordered categorical variables before: There is a certain order or notion of ‘high-low’, ‘before-after’, etc. among the categories. For example, days of the week (Monday comes before Tuesday), grades of students (A is better than B), number of overs bowled by a bowler (3, 4, 9), etc.

Which of the following are other examples of ordered categorical variables? Choose all the correct options.


Dates in a year, example, 2 January, 15 March, etc.


Star rating of a restaurant on Zomato on a scale of 1-5

Type of education board: ICSE, CBSE, State Board, etc.

Country names

Ordered Categorical Variables – Univariate Analysis

Imagine someone gives you an ordered categorical variable such as the ‘number of centuries scored by batsmen across the world’ and asks you to extract interesting insights. 

How will you approach the task? List down a few steps in 100 words.

Let’s look at a few more examples of univariate analysis revealing hidden patterns.

I’m sure you remember the sleepless nights that exams you gave. For a student, the examiner is an antagonist most of the times, who prevents you from getting the scores you deserve. You might also have been intrigued by questions such as how many students obtained marks similar to yours; how many students were ahead of you; or how many lagged behind. And everyone has an opinion regarding when and where grace marks are justified.

So, let’s use this typical student experience to help you learn univariate analysis. Let us now look at an interesting analysis that conducted on the marks scored by class X students across Tamil Nadu.

Tendulkar Batting – Univariate Analysis

The dataset given below contains Sachin Tendulkar’s ODI batting statistics from about 1989 to 2012. Download the dataset and attempt the following questions.

The variables ‘Runs’ and ‘4s’ respectively represent the number of runs scored and the number of 4s hit by Tendulkar in each match. These are two ordered categorical variables that you’ll analyse in this exercise. 

‘Runs’, which should ideally be a numeric variable, contains some entries such as DNB (Did Not Bat), 8* (Not out at 8 runs), etc. You’ll need to clean these quality issues. 

Similarly, in ‘4s’, there are some missing entries represented by ‘-‘. You will need to clean these as well.

Download the data Tendulkar ODI Batting set below to answer the following questions.

Data Quality Issues in ‘Runs’

In the ‘Runs’ column of the given dataset, identify all the data quality issues. Choose all the correct options. 

Some values contain a ‘*’, such as 8*.


Some values are ‘DNB’.


Some values are ‘TDNB’


Some values are extremely large and represent an anomaly (maybe a data entry issue).

Delete the Rows

What is the sum of the Inns column after deleting all the rows with the value as DNB or TDNB?

459

926

867

444

Data Format

This question is the continuation of the previous assessment. After deleting the rows with ‘DNB’ or ‘TDNB’ in the Runs column, some values in the Runs column contain a ‘*’, such as 8*, so remove that * and convert the cell to number format. E.g., Convert 8* to 8. After performing the previous steps, what is the sum of the Runs column?

Note: You can create a new column.

11519

10519

11019

1519

You have seen how to conduct univariate analysis on categorical variables. Next, you will learn to conduct univariate analysis on quantitative (or numeric) variables.

Most Often Runs

Using the data set containing Sachin Tendulkar’s ODI batting statistics from about 1989 to 2012 find out in which bucket of size 10 has he scored runs most often. Select the corresponding bucket from the following options.

Hint: Use a histogram

0-10 runs


10-20 runs

20-30 runs

30-40 runs

Quantitative Variables – Univariate Analysis

You have learnt how to conduct univariate analysis on categorical variables. Now, let’s look at quantitative or numeric variables.

Prerequisites

In this segment, Anand will take you through various summary metrics. Knowledge of these concepts is very essential for this topic and the forthcoming topics in other modules, so make sure that you familiarise yourself with those concepts before moving ahead. 

Before going further, please go through some basics of statistics given below.

Mean: This is the sum of all the data values, divided by the total number of sample values.

  • Suppose you want to compare two cricket batsmen based on the number of runs that they score for their teams in every match. Mean is the best way to measure the central tendency in this case. It is commonly represented by the symbol 𝝁.    
  • Mean can be calculated using the Excel function AVERAGE(A1:A20) if the data is distributed over A1:A20 in the Excel workbook.
  • For instance, consider 5 people whose scores in a Mathematics test are shown below.
  • Student nameScore (out of 20 marks)Raj12Pawrush14Srijan19Anjali20Anamika20

In the above example, the mean value would be the sum of all the score values (85) divided by the number of values (5), which is 17.

Mode: In your sample data, the value that has the highest frequency is the mode.

  • In the case of categorical data, it is not possible to measure the mean for a central tendency. This is because mathematical operations cannot be performed on categorical data.
  •  If you consider democratic elections, the decision is made on the basis of who received the maximum number of votes. Essentially, the mode wins in this case.   
  • Note: There can be more than one mode in a sample. For instance, there can be elections in which three parties participate, two of those get 40% of the votes each, and the third party gets 20% of the votes. In this case, there are two modes since two parties have the highest (equal) number of votes.

Median: If you arrange the sample data in ascending order of frequency, from left to right, the value in the middle is called the median.

  • The reason why it is a good measure of central tendency is that the number of samples is the same on both its sides (left and right).
  • When there are extremes or outliers in a sample of numerical data, the median is a better measure of central tendency.
  • Note: For even number of data points or intervals, there are two medians, and for an odd number of data points, there is one median.

Let’s now learn how to analyse quantitative variables.

Mean and median are single values that give a broad representation of the entire data. As Anand clearly stated, it is very important to understand when to use these metrics to avoid inaccurate analysis.


While mean gives an average of all the values, median gives a typical value that could be used to represent the entire group. As a simple rule of thumb, always question someone if they use mean because median is almost always a better measure of ‘representativeness’.

Let’s now look at some other summary descriptive statistics such as mode, interquartile distance, standard deviation, etc.

Standard deviation and interquartile difference are both used to represent the spread of the data.

Interquartile difference is a much better metric than standard deviation if there are outliers in the data. This is because the standard deviation will be influenced by outliers, whereas the interquartile difference will simply ignore them.

You also saw how box plots are used to understand the spread of data.

 

Comprehension 1

Consider the following marks for a course exam (out of 100).

Percentileminimum1025507590maximum
Final Marks10485566788793

Check whether the statements given in the assessments below are true or false.

Percentile

About 1/4th of the class received a score of 55 or less.

True


False

Percentile

About 3/4 of the class received a score of 78 or less.

True


false

Comprehension 2:

Using this chart, answer the following questions.

Question 1

What is the approximate value of the 25th percentile (Bottom Quartile) of Plot 2 in the chart given above?

33


61

18

80

Question 2

What is the approximate value of the 75th percentile(Top Quartile) of Plot 2 in the chart given above?

33

61


18

80

Question 3

What is the approximate value of the median of Plot 2 in the chart given above?

33

51


18

80

Question 4

In the chart given above, which plot has the least variation?

Plot 1

Plot 2

Plot 3

Plot 4

Quantitative Variables – Summary Metrics

Let’s take some time to understand why summary metrics such as mean and standard deviation are not representative of the data in some cases. 

Comprehension:

Let’s say there is a company called ‘The News Company’, and the marketing team of the company is publishing some news articles and finds that the number of times an article is shared is not that high. So, they approach you for a solution to increase the number of times an article shared.

  1. Let’s say that you performed an adequate amount of research about the company, interviewed the company’s employees, formulated a hypothesis and found that the company published articles on weekdays. You verified this hypothesis through the data given by the company.
  2. Now, you are looking for the solution to the problem and think that the publishing date of the article affects the number of times it is shared, and let’s say you have also formulated the following hypotheses:
    • On average, a higher number of articles are shared on weekdays than weekends.
    • Among the weekdays, articles published on Wednesdays get shared more often than on any other weekday.
    • Articles of the type (or channel) ‘lifestyle’ and ‘social media’ are shared more often than the other types of articles on average.
  3. So, now you want to test whether the hypothesis mentioned above is correct or wrong by analysing the data on the news articles published in the past.
    • The dataset named ‘News Popularity’ containing data regarding the past news articles is given below.
    • Use this dataset to answer the following questions.

You can download EDA News Popularity the data set below.

News Popularity Data Set – Reading Box Plots 

You saw that quartiles are a better measure of the spread than the standard deviation. A good way to visualise quartiles is to use a box plot. The 25th percentile is represented by the bottom horizontal line of the box, the 75th percentile is the top line of the box, and the median is the line between them.

The box plots given below show the variation in the number of shares for a news article (on the y-axis) against three variables — weekend, weekday and the channel type. Read the charts carefully and answer the questions that follow.

Note: Right-click on the image and open in a new tab to get better clarity with the data on the image.

1. Shares vs Weekend  

Weekend
Weekend

2. Shares vs Weekday

Shares versus Weekday
Shares versus Weekday

3.  Shares vs Channel Type

Shares versus Channel Type
Shares versus Channel Type

Recall some of the metrics you learnt in this session and answer the questions below.

Shares vs Weekend

The variation in the number of shares is larger for:

Weekends


Weekdays

Shares Across Weekdays

Does publishing an article on different weekdays lead to a significant difference in the number of article shares?
Hint: We can decide this by analysing the number of times the articles were shared in the past.

Yes

No

Shares Across Channel Types

Let’s say the marketing team of ‘The News Company’ tells that they have four types of articles: Social Media, Technology, Lifestyle and World, and they are allowed to publish only one article at a time. Therefore, they ask you to suggest one article to publish such that the number of shares increases. Pick your choice from the following options.

Social Media


Technology

Lifestyle

World

Weekend Popularity

Are the articles published over the weekend significantly more popular?
Hint: Answer this by analysing the shares vs weekend box plot given above.

No

Yes

Mode

What is the most often used number of keywords in the News Popularity dataset?
Hint: You can get this by analysing the ‘num_keywords’ variable in the News Popularity dataset.

7

6

10

8

Average

What is the average number of times the articles in the data set were shared?

2427

7322

3395


5000

Median

What is the median value of the shares?

1,500

1,400


7,322

3,598

Median vs Average

Why do you think there is a huge difference between the mean and median of the shares? Which of these metrics is more representative of the shares? Write your answer in 200 words.

Standard Deviation

What is the standard deviation of shares?

95075

11626

29456

34456

Percentile

What is the 87th percentile of shares?

4000

5000

6000

7000

Summary

We learnt about the various elements of univariate analysis such as data distribution plots and summary metrics.

Let’s summarise what you learnt:

  1. Metadata description describes the data in a structured manner. You should make a habit of creating a metadata description for whatever data set you are working on. Not only will it serve as a reference point for you but it will also help other people understand the data better and thus save time.
  2. Distribution plots reveal interesting insights about data. You can observe various visible patterns in the plots and try to understand how they came to be.
  3. Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be used everywhere. Thus, it is important to understand the data and then choose what metric to use to summarise the data.

Introduction to Segmented Univariate Analysis

 some segmented univariate analysis in the previous lectures, such as in the news popularity example, where you wanted to test the following three hypotheses:

  1. On average, a higher number of articles are shared on weekdays than weekends.
  2. Among the weekdays, articles published on Wednesdays get shared more than on any other weekday.
  3. Articles of the type (or channel) ‘lifestyle’ and ‘social media’ are shared more than the other types on average.

In this case, the categorical variables used for grouping were channel type’, ‘day of the week’, etc.  Across these categories or groups, you had then performed segmented univariate analysis to compare the average number of shares across days, channel types, etc.

Segmented Analysis

How is segmented univariate analysis different from the classic univariate analysis that you learnt in the previous session?

The segmented univariate analysis allows you to compare subsets of data, which is a powerful technique because it helps you understand how a relevant metric varies across different segments.

✓ Correct

Apart from observing summary statistics, segmented univariate analysis also helps you observe the variables visually.

Segmented univariate analysis also considers unordered categorical variables, whereas simple univariate analysis is performed only on continuous/numeric or ordered categorical variables.

Basis of Segmentation

In the previous lecture, you got glimpses of segmented univariate analysis. Now, let’s go deeper into the segmentation process, which can be divided into four parts:

  1. Take raw data
  2. Group it by dimensions
  3. Summarise using a relevant metric such as mean, median, etc.
  4. Compare the aggregated metric across groups/categories

Let’s see how Anand explains the segmentation process

Note: At 2:42, to get the ‘Count of Maths%’ do the following.

  1. Click on the already created pivot table of ‘Average of Maths%’.
  2. On the right, you will see ‘PivotTable Fields’, drag and drop the field ‘Maths..’ to values.
  3. Go to values, then click on the drop-down box with label ‘Count of Maths..’ (Once you drag ‘Maths..’ field to values, it converts automatically to ‘Count of Maths..’), select ‘Value Field Settings..’ and choose the type of calculation as ‘Max’.

To summarise, the standard process of segmented univariate analysis is as follows:

  • Take raw data
  • Group by dimensions
  • Summarise using a relevant metric like mean, median, etc.
  • Compare the aggregated metric across groups/categories

With this, you have now performed segmented univariate analysis on a few variables, but what if you have a large number of variables in your data set? How would you go about analysing and explaining the results of hundreds of categorical variables to your client? Let’s see what such a table would look like in the next lecture.

Comprehension: National Achievement Survey

Let’s say you are working under India’s Ministry of Human Resource Development (MHRD) and are given the dataset containing the results of the National Achievement Survey (NAS) conducted by NCERT. Your job is to analyse the data and get some good insights. The NAS dataset is given below; answer the following questions by analysing the dataset.
In the examples you just saw, the variables used for grouping were gender, the number of siblings, handicap status, etc. The metric for comparison was the average marks in mathematics. Using this process, by choosing various grouping variables and summary metrics, you can answer the following questions.

Note: This data set is subsetted only for the state of Maharashtra. You can download the data from the link below:

Analyse the NAS data set to find answers to the following questions.

Segmented Analysis

What is the impact of the variable ‘Watch.TV’ on Science marks?

Students who watch TV daily get significantly better marks than others.

Students who watch TV once a week get significantly better marks than others.

The Watch.TV variable does not distinguish the students significantly in terms of Science marks.


Students who never watch TV get significantly better marks than others.

Segmented Analysis

What is the impact of the father’s education on the student’s Maths marks?

Students whose fathers have a degree and above get better marks on an average.

Students whose fathers are uneducated get better marks on an average.

The father’s education does not have a significant impact on the student’s average maths marks.

Segmented Analysis

What is the impact of ‘Play.Games’ on Reading marks?

Students who play games daily score high in Reading on average.

Students who play games once a week score high in Reading on averag

Students who never play games score high in Reading on average.

Students who plays games once a month score high in Reading on an average

Quick Way of Segmentation

In the last lecture, you learnt the process of segmentation. But, what if you have a large number of variables in your dataset. It is a very repetitive task to perform the same analysis on the large bunch of variables. One way of solving this problem is to make a table with the categorical variables on one axis and the numeric variables (or measures/facts) on the other. Let’s see what such a table would look like.

Comprehension: National achievement Survey Analysis

You saw how various factors have an impact on a student’s performance and marks. Let’s analyse how segmented univariate analysis can help you understand the factors affecting class VIII marks in the different states. You are required to go to the below website app, play around with the filters (gender, poverty, state, siblings, etc.) and answer the following questions.

In case the below data doesn’t work, please access it by clicking here.

https://gramener.com/nas

Note that the data used for developing this exercise is different from the dataset provided to you in the previous lectures.

National Achievement Survey

In Madhya Pradesh, which level of the father’s occupation is most likely to improve percentage in Maths?

Teacher/Lecturer

Farmer

Skilled worker

Professional

National Achievement Survey

In Madhya Pradesh, can you say that the mother’s occupation plays a significant role in improving the students’ total marks in class VIII?

Yes


No

National Achievement Survey

Among the students across India, which factor has the highest impact on the total marks?

Father occupation

Father edu

Mother edu

Mother occupation

Once you are done with segmentation, the next step is to compare your results within the category. You can either compare the means, or you can go for other descriptive statistics such as median, max, min, etc. We will discuss this in the next lecture.

Comparison of Averages


By now, you know how to group the data by categorical variables and compare the averages. But you should be careful while comparing averages, especially if the difference in average values is small. Let’s see why this is important.

You would have noticed that both the data sets created by Anand have different distributions of the scores of boys and girls. In the first data set, every girl scored higher marks than every other boy. The difference in averages is still 1, but in this case, you can say that girls get higher scores than boys.

Fig-1: Averages
Fig-1: Averages

On the other hand, in the second data set, the difference in averages was 1 again, but it is difficult to conclude that girls score higher than boys since the range of scores is much wider. Now, the difference is not as significant as in the previous case, since the variation in scores may cause small differences in the mean due to randomness as well.

Averages
Averages

Tip-1

“Don’t blindly believe in the averages of the buckets — you need to observe the distribution of each bucket closely and ask yourself if the difference in means is significant enough to draw a conclusion. If the difference in means is small, you may not be able to draw inferences. In such cases, a technique called hypothesis testing is used to ascertain whether the difference in means is significant or due to randomness.“ Don’t worry if you do not get the concept of hypothesis correctly, It will be dealt separately in hypothesis module.

In case, if you have not downloaded the National Achievement Survey dataset, you can download it from the link below. 

Segmented Analysis

You are supposed to use the given sample data set of class VIII students of Maharashtra state to attempt the following questions.

The variable ‘Solve.Maths’ has three levels: agree, disagree, neither. These levels indicate whether the child solves maths problems regularly.

Is there any significant difference in the mean of percentage maths marks among the students who agree or disagree on ‘Solve.Maths’? Write your answer in 200 words

Comparison of Other Metrics

Once you have identified the variables based on the business problem for analysing the segments, the next step is to know the distribution of segments and compare the average of each segment. But this is not the only way of comparing segments. There are various metrics that you can use to understand and explain your data easily.

Let’s see how Anand compares the other metrics.

So far in this session, you have learnt new metrics, concepts and techniques for performing segmented univariate analysis on the variables based on business understanding. In conclusion, the three steps of segmented univariate analysis are as follows:

  • Basis of segmentation
  • Comparison of averages
  • Comparison of other metrics

Besides finding the segments and comparing the metrics, your primary focus should be on understanding the results arising from the segments.

You can go through this link to understand how to read a box plot.

Comprehension: VIII Class Scores

In continuation of the earlier example of class VIII marks, let’s say now your task is to analyse the variation in ‘Reading’ percentage on the basis of Mother’s education level. The boxplot representing the variation in ‘Reading’ percentage with Mother’s education level is given below. By analysing the dataset and the chart given below, answer the following assessments.

The first boxplot in the image given below represents the variation in ‘Reading’ percentage

with ‘Mother.edu’, which is blank in the dataset.

Note: For a clear view of the image given below, please open the image given below in the new tab.

Fig-1: Mothers_edu
Fig-1: Mothers_edu

Boxplot

Across categories of mother’s education level: ‘Degree & Above’ and ‘Illiterate’, the 75th percentile varies more than the 25th percentile. Is this true or false?
Hint: Observe the boxplot given above.

True

False

Boxplot

Children whose mothers have a degree or above are less likely to get better marks in reading than other children
Hint: Observe the boxplot given above

True


False

Class Opinion

List down three observations from the box plot. Consider comparing the 25th percentile, the median and the 75th percentile across the mother’s education level. Write your answer in the text box below.

Bivariate Analysis on Continuous Variables

In the last session, you learnt how to perform segmented univariate analysis, e.g., how gender or father’s education impacts a student’s percentage in science, maths and reading. But you can also use data for several other purposes. Often, you can get interesting insights by analysing pairs of continuous variables at a time. For example, how the sales figures depend on marketing spends, or, for that matter, how any two continuous variables depend on each other. But is there a way or concept to identify the relationship between two variables?

Let’s start with the bivariate analysis on pairs of continuous variables to answer these questions.

To summarise, correlation is a number between -1 and 1, which quantifies the extent to which two variables ‘correlate’ with each other.

  • If one variable increases as the other increases, the correlation is positive.
  • If one variable decreases as the other increases, the correlation is negative
  • If one variable stays constant as the other varies, the correlation is zero.

In general, a positive correlation means that two variables will increase together and decrease together, for example, an increase in rain is accompanied by an increase in humidity. A negative correlation means that if one variable increases the other decreases, for example, in some cases, as the price of a commodity decreases, its demand increases.

A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as one variable moves either up or down, the other one moves in the same direction with the fixed proportion. Similarly, a perfect negative correlation means that two variables move in opposite directions with the fixed proportion, while a zero correlation implies no relationship at all. 

ComprehensionCorrelation

Let us take an example of gold and silver prices in Indian currency per troy ounce ( = 31.8 grams). Do you expect gold and silver price to be correlated? Let’s find out. The dataset provided below contains the gold and silver prices across various years and months. 

Bivariate Analysis

What is the correlation between Gold and Silver prices (round off the answer to two decimal places)? Also, enter a fraction not a percentage (e.g. 0.34, not 34.00%).

0.95

Bivariate Analysis

What is the correlation in Gold and Silver prices for the year 2008 (nearest two decimal places)?

0.01

Bivariate Analysis

After performing bivariate analysis on gold and silver prices in India, do you think that the prices are correlated? Why or why not?

Enter your answer in 200 words.

You can observe this from the correlation between the gold and silver prices found in the first question of the quiz, that is 0.95 which is close to 1.00.

Attempt 1 of 1Continue

Fig-1 Correlation Plots
Fig-1 Correlation Plots

Question 1/1


Plot -1


Plot -2

Plot -1 and Plot -2 seem to have same correlation

None

Business Problems Involving Correlation

There are two things that you would have noticed in the lecture:

  1. The correlation matrix of stock prices of different countries gives a real sense of the relationship between many variables.
  2. The correlated variables are grouped by similarities, and correlation can also be calculated for ‘groups of variables’. This is called ‘clustering’, which you will study in detail later in the next course, but the idea is to form a hierarchy of similar groups of variables.

Let’s take one example to learn how clustering helps in understanding customer behaviour of ordering food at restaurants.

Comprehension– Correlation Matrix

You saw the correlation metrics of food items in the above lecture. Using the image given below, attempt the following questions.

Correlation Matrix
Correlation Matrix

Correlation Matrix

Let’s say you are working for the improvement of the sales and profit of the restaurant that Anand has discussed in the previous video. The restaurant’s management approached you and asked you which category would be considered an evergreen one, the one that is most likely to be ordered by the people who ordered the other product?

Tiffin

Meals

Dessert

Breads

Correlation Matrix

The restaurant has asked you to recommend four items that it should sell together as a pre-packaged meal. Which four items would you recommend that the restaurant should package together, given the data in the correlation matrix? Why did you choose these four items?

Class Opinion

Let’s say you are working for the improvement of the sales and profit of the restaurant that Anand has discussed in the previous video. Using the insight you got after solving the first question (Desserts is most likely to be ordered by the people who ordered the other product), what all suggestion can you give to the restaurant to increase the sales. State at least one in the text box given below.

Bivariate Analysis on categorical variables

The categorical bivariate analysis is essentially an extension of the segmented univariate analysis to another categorical variable. In segmented univariate analysis, you compare metrics such as ‘mean of X’ across various segments of a categorical variable, e.g., mean marks are higher for students whose mothers’ education is of ‘degree and above’ level; or the median income of educated parents is higher than that of uneducated parents, etc.

In the categorical bivariate analysis, you extend this comparison to other categorical variables and ask — is this true for all categories of another variable, say, men and women? Take another categorical variable, such as state, and ask — is the median income of educated parents higher than that of uneducated ones in all states?

Thus, you are drilling down into another categorical variable and getting closer to the true patterns in data. In fact, you may also go to the next level and ask — is the median income of educated parents higher than that of uneducated ones (variable 1) in all states (variable 2) for all age groups (variable 3)? This is what you may call ‘trivariate analysis’, and though it gives you a more granular version of the truth, it gets a bit complex to make sense of and explain to others (and hence it is not usually done in EDA).

Thus, remember that performing segmented univariate analysis may deceive you into thinking that a certain phenomenon is true without asking the question — is it true for all sub-populations or is it true only when you aggregate information across the entire population?

So, in general, there are two fundamental aspects of analysing categorical variables:

  1. To see the distribution of two categorical variables. For example, if you want to compare the number of boys and girls who play games, you can make a ‘cross table’ as given below:
Fig-1: Cross Table
Fig-1: Cross Table

From this table, first, you can compare boys and girls across a fixed parameter of ‘play games’, e.g., a higher number of boys play games every day than girls, a higher number of girls never play games compared with boys, etc. And second, you can compare the parameter of ‘play games’ across a fixed value of gender, e.g., most boys play every day and very few play once a month or never.

  1. To see the distribution of two categorical variables with one continuous variable. For example, you saw how a student’s percentage in science is distributed based on the father’s occupation (categorical variable 1) and the income level (categorical variable 2).

Usually, you do not analyse more than two variables at a time, though there are ways to do that. Machine learning models are essentially a way to do that, some of which you will learn in the next course.

Practice Questions

The data set attached below contains the exchange prices of various currencies in the month of July 2016. We’ve taken this data set from the International Monetary Fund website.

You can download the dataset from the currencies link.

Comprehension:

Let’s say you work for a trading company and want to analyse the correlations between currencies using this data. The currencies you want to analyse are Euro, Japanese Yen, UK Pound Sterling, US Dollar, Australian Dollar and Indian Rupee. Therefore, analyse the dataset and answer the following questions.

Correlation

Video Solution

The Indian rupee is the most correlated with:

Australian dollar

US dollar

Japanese Yen

Euro

Correlation

Let’s say you have to invest in two currencies so that if you lose in one currency, you will gain in another currency. For this to happen, you should invest in currencies that are ___________ correlated?

Positively

Negatively

Correlation Matrix

A good risk mitigation practice in investing is ‘risk diversification’. In this context, it means that you should invest in two currencies such that if one goes up the other is likely to go down, thereby minimising the risk of losing all your money.

Which two currencies, among the given seven currencies, would you invest in?

Japanese Yen and US dollar

Kuwaiti Dir and Botswa Pula

Japanese Yen and UK Pound Sterling


Chinese Yuan and Euro

Summary

In bivariate analysis, you learnt broadly two types of analyses:  

  1. Bivariate analysis for continuous variables
    • Additional Readings: You can read about correlation in detailhere
  2. Bivariate analysis for categorical variables

Class Opinion

Now you’ve learnt about bivariate analysis in more depth and applied this analysis technique to multiple scenarios and data sets.

Which two variables from your industry you would like to or have analyzed using this technique?

What are Derived Metrics?

Sometimes, you would not get the most valuable insights by analysing the data available to you. You often need to create new variables using the existing ones to get meaningful insights.


New variables could be created based on your business understanding or they can be suggested by your clients. Let’s understand how business understanding plays an important role in deriving new variables.

To summarise, you saw two examples of different problems — analysing restaurant sales and understanding the marks distribution of class VIII students.

In the first example, the ‘day of the week’ was a derived variable which was not provided in the original data set (only the date was provided). Using this new variable (day of the week), you can ‘drill down’ to compare the total sales on each day of the week and present the results on the calendar, as shown in figure 1 below.

Fig-1: Restaurant Sales
Fig-1: Restaurant Sales

In the second example, by plotting the marks against the ‘month of birth’ (derived variable), it was observed that the children who were born after June would have missed the cutoff by a few days and gotten admission at the age of 5. The ones born after June (e.g., July, August, etc.) were intellectually and emotionally more mature than their peers because of their older age, resulting in better performance.

Fig-2: Marks Distribution
Fig-2: Marks Distribution

This is a classic example of how derived metrics can help you discover unexpected insights.

So far, we have discussed only how to derive a new variable from the date variable. Are there other types of derived metrics as well? And, is there a process of deriving new metrics?


In the next lecture, you will learn some new types of derived metrics.

Types of Derived Metrics: Type Driven Metrics

In the last lecture, you saw some examples of derived metrics. Now, let’s understand the various types of derived metrics. Watch the following lecture to get an idea of the different types of derived metrics.

Broadly, there are three different types of derived metrics:

1.    Type-driven metrics

2.    Business-driven metrics

3.    Data-driven metrics

Type-Driven Metrics

These metrics can be derived by understanding the variable’s typology. You have already learnt one simple way of classifying variables/attributes — categorical (ordered, unordered) and quantitative or numeric. Similarly, there are various other ways of classification, one of which is Steven’s typology.

Steven’s typology classifies variables into four types — nominal, ordinal, interval and ratio.

  • Nominal variables: Categorical variables, where the categories differ only by their names; there is no order among categories, e.g. colour (red, blue, green), gender (male, female), department (HR, analytics, sales)
    • These are the most basic form of categorical variables.
  • Ordinal variables: Categories follow a certain order, but the mathematical differencebetween categories is not meaningful, e.g., educational level (primary school, high school, college), height (high, medium, low), performance (bad, good, excellent), etc.
    • Ordinal variables are nominal as well.
  • Interval variables: Categories follow a certain order, and the mathematical difference between categories is meaningful but division or multiplication is not, e.g., temperature in degrees celsius (the difference between 40 and 30 degrees Celcius is meaningful, but 30 degrees x 40 degrees is not), dates (the difference between two dates is the number of days between them, but 25th May / 5th June is meaningless), etc.
    • Interval variables are both nominal and ordinal.
  • Ratio variables: Apart from the mathematical difference, the ratio (division/multiplication) is possible, e.g., sales in dollars ($100 is twice $50), marks of students (50 is half of 100), etc.
    • Ratio variables are nominal, ordinal and interval type.

Type of Variable

Which type of variable follow certain order, and the mathematical difference between them is meaningful but not the division or multiplication?

Nominal variables

Ordinal variables

Interval variables


Ratio variables

Ordinal Variables

Which of the following is an ordinal variable (but not interval or ratio)?

The name of a party e.g. BJP, INC, SP

The number of centuries scored by a batsman e.g. 10, 14, 19

The grades of children e.g. A+. A, B+ B, C+, C

The type of loan taken by a person e.g. home, auto, personal, education

Interval Type Variables

Which of the following is an interval type variable, but not ratio?

The name of a module in an UpGrad program e.g. module 1, 2, 3, 4, 5

The speed of cars in kilometres per hour

The grades of children e.g. A+. A, B+ B, C+, C

The designation of employees in a team e.g. junior analyst, senior analyst, team lead, partner

Types of Attributes

Which of the following data would come under the nominal category? Only one option is correct.

Ratings e.g. VERY_SATISFIED, SATISFIED, NOT_SATISFIED

Length expressed in centimeters

Date of birth

Educational degree names such as M.Tech, M.Com, MBA, MSc

Interval Variables

Interval variables can be ordered and their difference (or the distance between them) is meaningful. Which of the following is an ordinal variable but NOT interval variable?

Heights of people

Amount of pain experienced by 3 patients — low, bearable, severe


Pressures of 3 gases: 4 bars, 5 bars and 5.4 bars

Class Opinion

Ordinal variables have a notion of order, i.e. you can say that one occurs after the other. One example is the number of stars given in a review on TripAdvisor or Zomato. Can you think of another ordinal attribute? Write your answer in 200 words.

Type Driven Metric/s

Which of the following is/are the examples of the type driven metric/s?
 

Extracting the hour, minutes, AM or PM from the Time

Extracting First name or last name from the Name

Extracting Day, Month, or Year from the Date

All of the above

Types of Derived Metrics: Business Driven Metrics

To summarise, there are three types of derived metrics:

  1. Type-driven
  2. Business-driven
  3. Data-driven

Deriving metrics from the business perspective is not an easy task. It requires a decent domain experience. Without understanding the domain correctly, deriving insights becomes difficult and prone to errors. 

Let’s try out an exercise. You have been using the National Achievement Survey data set starting since EDA. In case you don’t have the data set yet, you can download the National Achievement Survey from the link below to answer the question that follows.

Class Opinion

In the class VIII marks (National Achievement Survey) case study that you saw previously, what new data or metrics can you derive from the data set using the business-driven method? State at least two in the text box given below.

So you learnt how type-driven and business-driven metrics are created and used for deriving useful insights from existing variables. In the next lecture, you will learn about the third type of derived metrics – data-driven metrics.

Question 1

Which of the following is/are the business-driven metric/s?

Extracting Pass/Fail from the Students marks

In the Banking sector, Let’s say a minimum balance in the bank is Rs 1000. Therefore, extracting whether the minimum balance is maintained or not (Yes or No).

All the above

Types of Derived Metrics: Data Driven Metrics

To summarise, data-driven metrics can be created based on the variables present in the existing data set. For example, if you have two variables in your data set, such as ‘weight’ and ‘height’, which show a high correlation, then instead of analysing these variables separately, you can think of deriving a new metric “Body Mass Index (BMI)“. Once you get the BMI, you can easily categorise people based on their fitness, e.g., a BMI below 18.5 should be considered an underweight category, whereas BMI above 30.0 is considered obese, by standard norms. This is how data-driven metrics can help you discover hidden patterns in the data.

Comprehension:

The ‘Ratings’ data set given below contains various ratings for the items in a restaurant. Use the data set to solve the following Rating assessments.

Derived Metrics

Which of the following is/are Data-driven metric/s?

Deriving the total marks from the given various subject marks

Deriving the run rate from the given total balls and total runs

Deriving profit from the given cost price and selling price

All of the above

Derived Metrics

The ‘Ratings’ dataset that is given above has 5 columns among which 3 columns are ‘rating’, ‘ food_rating’, ‘service_rating’. What new variable can you derive from the data set? State atleast one.

Derived Metrics

The ‘Ratings’ dataset that is given above has 5 columns among which 3 columns are ‘rating’, ‘ food_rating’, ‘service_rating’. One of the derived columns from the dataset would be ‘avg_rating’ such that it stores the average of 3 columns that are mentioned. Create ‘avg_rating’ column, round its value to two decimals, and answer the following.

What is the nearest value to the sum of ‘avg_rating’ column?

4641

1464

3123

2846

Derived Metrics

On which day(Monday, Tuesday, …  etc.) the maximum number of orders are placed? Select the appropriate option.

Wednesday

Thursday


Friday

Saturday

Derived Metrics

On which date(MM/DD/YYYY) the maximum number of orders are placed?

7/23/2017


3/9/2018

7/19/2018

7/26/2018

Practice Questions

Let’s say you are working for the ICC, and you are given an ODI cricket data set (the link to the same is given below), which contains the performance information for each player who played between 1971 and 2011. Using this data, your task is to find out the player who scored the highest number of centuries, the year in which Indian players scored the maximum number of centuries, and so on. Therefore, analyse the data set EDA odi_batting dataset and complete the following assessments.

Cricket Analysis

Choose the correct option/s.

Business-driven metrics can be created from the ‘Runs’ column

Type-driven metrics can be created from the ‘Match Date’ column

Data-driven metrics can be created from the combination of ‘Runs’ and ‘Balls’ columns

Cricket Analysis

Video Solution

Which player scored the highest number of centuries?

Sachin R Tendulkar – 48 centuries

Sachin R Tendulkar – 49 centuries

Ricky T Ponting – 30 centuries

Ricky T Ponting – 35 centuries

Cricket Analysis

Video Solution

In which year were the maximum number of centuries scored by Indian players?

1992

2011

1999

1998

Cricket Analysis

Video Solution

The ‘strike rate’ of a batsman is defined as the average number of runs scored per 100 balls faced. It is a measure of ‘how fast a batsman has played’.For example, if a batsman scores 40 runs in 50 balls, the strike rate is 80 runs/100 balls.

In the given dataset, Among all the batsmen who scored a century, and has the highest strike rate?

Shahid Afridi – 255 runs / 100 balls

Mark V Boucher – 221.74 runs / 100 balls

Sanath T Jayasuriya – 206 runs / 100 balls

Shane R Watson – 192 / 100 balls

Let’s say you are analysing the sales of a restaurant (daily, monthly and yearly sales). Your data set contains a date column from which you want to extract a value in order to aggregate the sales based on that column (a day, a month or a year). The date column is in the format “mm/dd/yyyy”, for example, “12/31/2014”.

Now, your task is to check whether there is any significant difference between the number of orders placed on various weekdays (e.g., Monday, Tuesday, etc.,), including Saturday and Sunday. Therefore, analyse the data orders set given below and complete the following assessment.

Maximum Sum of Cost

On which date(MM/DD/YYYY), the sum of the cost of all the orders is maximum?

7/4/2010

12/11/2012

9/19/2010

12/31/2011

No of Orders

On which weekday(Monday, Tuesday, etc., including Saturday and Sunday), the maximum and the minimum number of orders are placed, respectively?

Saturday(806) and Tuesday(761)


Sunday(791) and Wednesday(764)

Sunday(806) and Wednesday(761)

Saturday(791) and Tuesday(764)

So far in this session, you have learnt three types of derived metrics:

  1. Type-driven metrics
  2. Business-driven metrics
  3. Data-driven metrics

.

Module Summary

Exploratory data analysis (EDA) helps you, as a business analyst, to look beyond data. It is a never-ending process: the more you explore the data, the more the insights that you derive. As a data analyst, you would spend almost 80% of your time understanding the data and solving various business problems through EDA. Therefore, if you have a good understanding of EDA, half the battle is won.

So far in this module, you have learnt the five most crucial aspects of any kind of analysis, that is:

  1. Understanding domain
  2. Understanding the data and preparing it for analysis
  3. Univariate analysis and segmented univariate analysis
  4. Bivariate analysis
  5. Deriving new metrics from the existing data

Univariate Analysis

As the term “univariate” suggests, this session deals with analysing variables one at a time. It is important to separately understand each variable before moving on to analysing multiple variables together. The agenda of univariate analysis is to understand:

  • Metadata description
    • Data distribution plots
    • Summary metrics

Given a data set, the first step is to understand what it contains. Information about a data set can be gained simply by looking at its metadata. Metadata, in simple terms, is the data that describes the each variable in detail. Information such as the size of the data set, how and when the data set was created, what the rows and variables represent, etc. are captured in metadata.

Types of Variables

  • Ordered and unordered categorical variables
    • Ordered ones have some kind of ordering. Some examples are
      • Salary = High-Medium-low Month = Jan-Feb-Mar etc.
    • Unordered ones do not have the notion of high-low, more-less etc. Example:
      • Type of loan taken by a person = home, personal, auto etc.
      • Organisation of a person = Sales, marketing, HR etc.

Apart from the two types of categorical variables, the other most common type is quantitative variables. These are simply numeric variables which can be added up, multiplied, divided etc. For example, salary, number of bank accounts, runs scored by a batsman, the mileage of a car etc.

Distribution plots reveal interesting insights about the data. You can observe various visible patterns in the plots and try to understand how they came to be.

Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be used everywhere. Thus, it is important to understand the data and then choose what metric to use to summarise the data.

Segmented Univariate Analysis

The broad agenda of “Segmented Univariate Analysis” is as follows:

Basis of segmentation:

The entire segmentation process can be divided into four parts:

  • Take raw data
    • Group by dimensions
    • Summarise using a relevant metric such as mean, median, etc.
    • Compare the aggregated metric across groups/categories

Comparison of Averages

Don’t blindly believe in the averages of the buckets — you need to observe the distribution of each bucket closely and ask yourself if the difference in means is significant enough to draw a conclusion. If the difference in means is small, you may not be able to draw inferences. In such cases, a technique called hypothesis testing is used to ascertain whether the difference in means is significant or due to randomness. “Don’t worry if you do not get the concept of hypothesis correctly, It will be dealt separately in hypothesis module.

Bivariate Analysis

  • Bivariate analysis on continuous variables

Correlation is a metric to find the relationship between the variables. It is a number between -1 and 1 which quantifies the extent to which two variables ‘correlate’ with each other.

  • If one increases as the other increases, the correlation is positive
    • If one decreases as the other increases, the correlation is negative
    • If one stays constant as the other varies, the correlation is zero

In general, a positive correlation means that two variables will increase together and decrease together, e.g. an increase in rain is accompanied by an increase in humidity. A negative correlation means that if one variable increases the other decreases, e.g. in some cases, as the price of a commodity decreases its demand increases.

A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as one variable moves, either up or down, the other one moves in the same direction. A perfect negative correlation means that two variables move in opposite directions, while a zero correlation implies no relationship at all.

  • Bivariate analysis on categorical variables

The categorical bivariate analysis is essentially an extension of the segmented univariate analysis to another categorical variable. In univariate analysis, you compare metrics such as ‘mean of X’ across various segments of a categorical variable, e.g. mean marks of a student are higher for ‘degree and above’ than other levels of the mother’s education; or the median income of educated parents is higher than that of uneducated ones, etc.

In the categorical bivariate analysis, you extend this comparison to other categorical variables and ask — is this true for all categories of another variable, say, men and women? Take another categorical variable, such as state, and ask

  • Is the median income of educated parents higher than that of uneducated ones in all states? Thus, you are drilling down into another categorical variable and getting closer to the true patterns in the data. In fact, you may also go to the next level and ask — is the median income of educated parents higher than that of uneducated ones (variable 1) in all states (variable 2) for all age groups (variable 3)? This is what you may call ‘trivariate analysis’, and though it gives you a more granular version of the truth, it gets a bit complex to make sense of and explain to others (and hence it is not usually done in EDA). Thus, remember that doing only conducting segmented univariate analysis may deceive you into thinking that a certain phenomenon is true without asking the question — is it true for all subpopulations or is it true only when you aggregate information across the entire population?

So in general, there are two fundamental aspects of analysing categorical variables:

  1. To see the distribution of two categorical variables. For example, if you want to compare the number of boys

and girls who play games, you can make a ‘cross table’ as given below:

From this table, firstly, you can compare boys and girls across a fixed level of ‘play games’, e.g. a higher number of boys play games every day than girls, a higher number of girls never play games than boys, etc. And secondly, you can compare the levels of ‘play games’ across a fixed value of gender, e.g. most boys play every day and very few play once a month or never.

  • To see the distribution of two categorical variables with one continuous variable. For example, you saw that how a student’s percentage in science is distributed based on the father’s occupation (categorical variable 1) and the poverty level (categorical variable 2).

Derived Metrics

There are three different types of derived metrics:

  • Type-driven metrics
  • Business-driven metrics
  • Data-driven metrics

Type-driven metrics

These metrics can be derived by understanding the variable’s typology. You have already learnt one simple way of classifying variables/attributes — categorical (ordered, unordered) and quantitative or numeric. Similarly, there are various other ways of classification, one of which is Steven’s typology.

Steven’s typology classifies variables into four types — nominal, ordinal, interval and ratio:

  • Nominal variables: Categorical variables, where the categories differ only by their names; there is no order among categories, e.g. colour (red, blue, green), gender (male, female), department (HR, analytics, sales)
  • These are the most basic form of categorical variables
    • Ordinal variables: Categories follow a certain order, but the mathematical difference between categories is not meaningful, e.g. education level (primary school, high school, college), height (high, medium, low), performance (bad, good, excellent), etc.
  • Ordinal variables are nominal as well
    • Interval variables: Categories follow a certain order, and the mathematical difference between categories is meaningful, e.g. temperature in degrees celsius ( the difference between 40 and 30 degrees C is meaningful), dates ( the difference between two dates is the number of days between them), etc.
  • Interval variables are both nominal and ordinal
    • Ratio variables: Apart from the mathematical difference, the ratio (division/multiplication) is possible, e.g. sales in dollars ($100 is twice $50), marks of students (50 is half of 100), etc.
      • Ratio variables are nominal, ordinal and interval type

Understanding types of variables enables you to derive new metrics of types different from the same column. For example, age in years is a ratio attributes, but you can convert it into an ordinal type by binning it into categories such as children (< 13 years), teenagers (13-19 years), young adults (20-25 years), etc. This enables you to ask questions, e.g. do teenagers do X better than children, are young adults more likely to do X than the other two types, etc. Here, X is an action you are interested in finding.

Types of Derived Metrics: Type Driven Metrics

Broadly, there are three different types of derived metrics:

1.    Type-driven metrics-Difference between ordinal, interval and ratio variables

2.    Business-driven metrics

3.    Data-driven metrics

Type-Driven Metrics

These metrics can be derived by understanding the variable’s typology. You have already learnt one simple way of classifying variables/attributes — categorical (ordered, unordered) and quantitative or numeric. Similarly, there are various other ways of classification, one of which is Steven’s typology.

Steven’s typology classifies variables into four types — nominal, ordinal, interval and ratio.

  • Nominal variables: Categorical variables, where the categories differ only by their names; there is no order among categories, e.g. colour (red, blue, green), gender (male, female), department (HR, analytics, sales)
    • These are the most basic form of categorical variables.
  • Ordinal variables: Categories follow a certain order, but the mathematical differencebetween categories is not meaningful, e.g., educational level (primary school, high school, college), height (high, medium, low), performance (bad, good, excellent), etc.
    • Ordinal variables are nominal as well.
  • Interval variables: Categories follow a certain order, and the mathematical difference between categories is meaningful but division or multiplication is not, e.g., temperature in degrees celsius (the difference between 40 and 30 degrees Celcius is meaningful, but 30 degrees x 40 degrees is not), dates (the difference between two dates is the number of days between them, but 25th May / 5th June is meaningless), etc.
    • Interval variables are both nominal and ordinal.
  • Ratio variables: Apart from the mathematical difference, the ratio (division/multiplication) is possible, e.g., sales in dollars ($100 is twice $50), marks of students (50 is half of 100), etc.
    • Ratio variables are nominal, ordinal and interval type.

Type of Variable

Which type of variable follow certain order, and the mathematical difference between them is meaningful but not the division or multiplication?

Nominal variables

Ordinal variables

Interval variables

Ratio variables

Ordinal Variables

Which of the following is an ordinal variable (but not interval or ratio)?

The name of a party e.g. BJP, INC, SP

The number of centuries scored by a batsman e.g. 10, 14, 19

The grades of children e.g. A+. A, B+ B, C+, C

The type of loan taken by a person e.g. home, auto, personal, education.

Ordinal variables have a notion of order, i.e. you can say that one occurs after the other. One example is the number of stars given in a review on TripAdvisor or Zomato. Can you think of another ordinal attribute? Write your answer in 200 words.

Interval Type Variables

Which of the following is an interval type variable, but not ratio?

The name of a module in an UpGrad program e.g. module 1, 2, 3, 4, 5

The speed of cars in kilometres per hour

The grades of children e.g. A+. A, B+ B, C+, C

The designation of employees in a team e.g. junior analyst, senior analyst, team lead, partner

Types of Attributes

Which of the following data would come under the nominal category? Only one option is correct.

Ratings e.g. VERY_SATISFIED, SATISFIED, NOT_SATISFIED

Length expressed in centimeters

Date of birth

Educational degree names such as M.Tech, M.Com, MBA, MSc

Interval Variables

Interval variables can be ordered and their difference (or the distance between them) is meaningful. Which of the following is an ordinal variable but NOT interval variable?

Heights of people

Amount of pain experienced by 3 patients — low, bearable, severe

Pressures of 3 gases: 4 bars, 5 bars and 5.4 bars

Understanding the types of variable enables you to derive new metrics of types different from the same column.

For example, age in years is a ratio attribute, but you can convert it into an ordinal type by binning it into categories such as children (< 13 years), teenagers (13-19 years), young adults (20-25 years), etc. This enables you to ask questions, e.g., do teenagers do X better than children, are young adults more likely to do X than the other two types, etc. Here, X is an action you are interested in measuring. 

Let’s look at some more examples of type-driven derived metrics.

Deriving metrics from text

You can derive several types of columns from text. Let’s say you have a column having the ‘customer feedback’ given on websites such as Amazon or Zomato. 

Each value in the column is a feedback given in English by a customer or a user. As such, this is a nominal type variable.

What other variables can you derive from this column? Also think about the type of the variable you are deriving – is it possible to derive ordinal, interval and ratio type variables from this column?

Using a certain logic such as the number of ‘positive or negative words’ in a feedback, you can derive another metric – the sentiment of customer with categories such as negative, neutral, positive. This is a variable of type:

Nominal but not ordinal

Ordinal but not interval

Interval but not ratio

Ratio

One metric you can derive from this is the ‘length of the feedback’ measured in number of words. Length of feedback is a variable of type:

Nominal

Ordinal but not interval

Interval but not ratio

Ratio

Type Driven Metric/s

Which of the following is/are the examples of the type driven metric/s?
 

Extracting the hour, minutes, AM or PM from the Time

Extracting First name or last name from the Name

Extracting Day, Month, or Year from the Date

All of the above

Types of Derived Metrics: Business Driven Metrics

Business Driven Metrics:

It is derived from the existing variables, but it requires the domain expertise. Driving metrics from the business perspective is not an easy task. Without understanding the domain correctly, deriving insights becomes difficult and prone to errors.

Click below link to explore data or metrics can you derive

Comprehension: National achievement Survey Analysis

National Achievement Survey for Class VIII

In the class VIII marks (National Achievement Survey) case study that you saw previously, what new data or metrics can you derive from the data set using the business-driven method? State at least two.

Question 1

Which of the following is/are the business-driven metric/s?

Extracting Pass/Fail from the Students marks

In the Banking sector, Let’s say a minimum balance in the bank is Rs 1000. Therefore, extracting whether the minimum balance is maintained or not (Yes or No).

All the above

Types of Derived Metrics: Data Driven Metrics

Data Driven Metrics:

Data-driven metrics can be created based on the variables present in the existing data set. For example, if you have two variables in your data set such as “weight” and “height” which shows a high correlation. So, instead of analysing “weight” and “height” variables separately, you can think of deriving a new metric “Body Mass Index (BMI)”. Once you get the BMI, you can easily categorise people based on their fitness, e.g. a BMI below 18.5 should be considered as an underweight category, while BMI above 30.0 is considered as obese, by standard norms. This is how data-driven metrics can help you discover hidden patterns out of the data.

Comprehension:

Click on Rating

The ‘Ratings’ data set given below contains various ratings for the items in a restaurant. Use the data set to solve the following assessments.

Derived Metrics

Which of the following is/are Data-driven metric/s?

Deriving the total marks from the given various subject marks

Deriving the run rate from the given total balls and total runs

Deriving profit from the given cost price and selling price

All the above.

Derived Metrics

Which of the following is/are Data-driven metric/s?

Deriving the total marks from the given various subject marks

Deriving the run rate from the given total balls and total runs

Deriving profit from the given cost price and selling price

The ‘Ratings’ dataset that is given above has 5 columns among which 3 columns are ‘rating’, ‘ food_rating’, ‘service_rating’. One of the derived columns from the dataset would be ‘avg_rating’ such that it stores the average of 3 columns that are mentioned. Create ‘avg_rating’ column, round its value to two decimals, and answer the following.

What is the nearest value to the sum of ‘avg_rating’ column?

4641

1464

3123

2846

Derived Metrics

On which day(Monday, Tuesday, …  etc.) the maximum number of orders are placed? Select the appropriate option.

Wednesday

Thursday


Friday

Saturday

Derived Metrics

On which date(MM/DD/YYYY) the maximum number of orders are placed?

7/23/2017


3/9/2018

7/19/2018

7/26/2018

Practice Questions

Let’s say you are working for the ICC, and you are given an ODI cricket data set (the link to the same is given below), which contains the performance information for each player who played between 1971 and 2011. Using this data, your task is to find out the player who scored the highest number of centuries, the year in which Indian players scored the maximum number of centuries, and so on. Therefore, analyse the data set and complete the following assessments.

Click on ODI batting Data set link

Cricket Analysis

Choose the correct option/s.

Business-driven metrics can be created from the ‘Runs’ column

Type-driven metrics can be created from the ‘Match Date’ column

Data-driven metrics can be created from the combination of ‘Runs’ and ‘Balls’ columns

Cricket Analysis

Which player scored the highest number of centuries?

Sachin R Tendulkar – 48 centuries

Sachin R Tendulkar – 49 centuries

Ricky T Ponting – 30 centuries

Ricky T Ponting – 35 centuries

Cricket Analysis

Video Solution

In which year were the maximum number of centuries scored by Indian players?

1992

2011

1999

1998

Cricket Analysis

Video Solution

The ‘strike rate’ of a batsman is defined as the average number of runs scored per 100 balls faced. It is a measure of ‘how fast a batsman has played’.For example, if a batsman scores 40 runs in 50 balls, the strike rate is 80 runs/100 balls.

In the given dataset, Among all the batsmen who scored a century, and has the highest strike rate?

Shahid Afridi – 255 runs / 100 balls

Mark V Boucher – 221.74 runs / 100 balls

Sanath T Jayasuriya – 206 runs / 100 balls

Shane R Watson – 192 / 100 balls

Let’s say you are analysing the sales of a restaurant (daily, monthly and yearly sales). Your data set contains a date column from which you want to extract a value in order to aggregate the sales based on that column (a day, a month or a year). The date column is in the format “mm/dd/yyyy”, for example, “12/31/2014”.

Now, your task is to check whether there is any significant difference between the number of orders placed on various weekdays (e.g., Monday, Tuesday, etc.,), including Saturday and Sunday. Therefore, analyse the data set given below and complete the following assessment.

Click on Orders to abswer questions

Maximum Sum of Cost

On which date(MM/DD/YYYY), the sum of the cost of all the orders is maximum?

7/4/2010

12/11/2012

9/19/2010

12/31/2011

No of Orders

On which weekday(Monday, Tuesday, etc., including Saturday and Sunday), the maximum and the minimum number of orders are placed, respectively?

Saturday(806) and Tuesday(761)


Sunday(791) and Wednesday(764)

Sunday(806) and Wednesday(761)

Saturday(791) and Tuesday(764)

Cricket Analysis

Choose the correct option/s.

Business-driven metrics can be created from the ‘Runs’ column

Type-driven metrics can be created from the ‘Match Date’ column

Exploratory data analysis helps you, as a business analyst, to look beyond the data. It is a never-ending process — the more you explore the data, the more insights you get. Almost 80% of the time, you would spend your time as a data analyst understanding the data and solving various business problems through EDA. If you understand EDA properly, then half the battle is won.

So far in this module, you have learnt the five most crucial topics for any kind of analysis. They are as follows:

  • Understanding domain
    • Understanding data and preparing it for analysis
    • Univariate analysis and segmented univariate analysis
    • Bivariate analysis
  • Deriving new metrics from the existing data


EDA_Data_Sourcing

Awesome Public Datasets on GitHub,

https://data.gov.in

Educational Level By Age And Sex For Population Age 7 And Above – 2011

Comprehension: National achievement Survey Analysis

arrow

My progress is stuck / I am unable to proceed to the next section