Data Analytics Using MS Excel
We know MS Excel or workbook in general is widely people of various walks of life ranging from students at school to professionals at various institutions. Assuming you haven’t taken any professional excel training, majority of us during our schooling days have been taught the basics of excel, like copying a formula by dragging the fill handle or using the inbuilt SUM function to sum the rows of a column and other easy tabular data functionalities.
However, so far we have only touched the surface and need to know that there are numerous business institutions out there which use the advanced features of excel for their daily tasks. For few of these institutions, excel usage can comprise less than 10% for their total business while few institutions may use excel for more 80% of their business tasks. What is known is that more that 1 billion users use excel, so much so that we can say that out of 720 million users on LinkedIn, more than 75 million list excel as one of their primary skills, which is nearly 10% of the users.
We will have a glimpse of on how these institutions use some of the advance excel features to answer their business problems using a real world data set of the Indian Premier League (IPL) which is a globally dominant Cricket league from India. The data is obtained from Kaggle.com. You will see ahead how a question can be answered in different ways using basic as well as advanced MS Excel features.
The Dataset
As we see the dataset, we have one workbook and 2 sheets within it. One sheet contains the data for IPL matches held for the years from 2008 to 2020. While, the other sheet contains ball by ball data of all these matches. (you can skip the below attribute descriptions if you are well versed with cricket terms)
The attributes or columns present in this sheet are as follows:
- id : Unique identifier for each match held
- city : City where the match took place
- date : Date when the match was held
- player_of_match : most valuable player of that match
- venue : the stadium where the match was held
- neutral venue : if the venue was from a city different to the cities where the two playing teams reside.
- team 1, team 2 : the two playing teams
- toss_winner : the team which was the toss
- toss_decision : decision (bat/bowl) made by the toss winner
- winner : team which won the match
- result : If team batting second wins than ‘wickets’ else if team bowling second wins than ‘runs’
- margin : result margin by which the team won
- elimination : was the match an eliminator (Y/N)
- method : if Duckworth lewis stern method was used due to weather conditions.
- umpire 1, umpire 2 : the two umpires for the match
The attributes or columns present in this sheet are as follows:
- id : Unique identifier of the match (same as present in sheet 1)
- inning : Team batting first — 1, Team batting second — 2
- Over : Bowling over for than inning
- ball : which ball of that over
- batsmen : the person batting at the strikers end
- non-striker : the person at non strikers end
- bowler : the person bowling
- batsmen_runs : runs scored by the batsmen of that ball
- extra_runs : all runs not scored from boundary or running between the wickets or from any legal ball deliveries. e.g. wide balls
- total_runs : batsmen_runs + extra_runs
- non_boundary : If there were overthrows (if fielder throws and the ball goes outside the boundary by mistake)
- is_wicket : is the batsmen out/dismissed (fall of wicket) (1/0)
- dismissal_kind : type of dismissal e.g. caught, bowled, run out.
- player_dismissed : name of the batsmen dismissed
- fielder : name of the fielder if batsmen dismissed via catch
- extras_type : type of extra runs/balls
- batting_team, bowling_team : the two teams
The Problem
Let us take the IPL final of the year 2020 which was between the Mumbai Indians (MI) and the Delhi Capitals (DC) which the Delhi Capitals lost. Let’s says Delhi Capitals want you to analyse this data and know the reasons for them loosing this match.
Let’s answer the first basic question, how many runs were scored by MI and DC in the final ?
For this, we will first need to filter the data for MI and DC for the year 2020 for the final match. The easy way is to filter data from Sheet 1 for year 2020 and the last month i.e. November. Check the last row which is the final match and get the id.
As we can see the final match for year 2020 was held on 10th November 2020 in Dubai between the two teams with Mumbai Indians winning and its match id is 1237181. (Find more information here on filtering data)
Once you got the id, apply filter to the second sheet and search the id in the id column and filter the ball by ball data for the final match.
Just by these few steps we found the ball by ball data for the final match. From the above filtered data image, we can see at the bottom left — says 235 which is number of rows, or over here its number of balls bowled in the final. Also, in the footer bar you can see some arithmetic values like count which is 236. This 236 also is count of total rows including the header row.
Now to get the total runs for the two team we need to filter this data again for the batting teams separately using the batting_team column and then use the predefined SUM function on the total runs.
We can also just filter it click on the total_runs column and in the footer we will get the sum.
We found Delhi Capital score 156 and we can do similar process for Mumbai Indians and we find that Mumbai Indians scored 157.
Now although this was the easiest approach, we had to filter the data a lot to get out answer. We should know that there are plenty of other ways in excel to get this answer, one of which we are going to explore ahead.
For this, we will first need to format the raw data from the two sheets as Tables. This simply done by selecting the data from the sheet, going to the Insert tab, selecting Table option followed by ok and your table is ready. Once your table is ready, can name the table from the Table.
Note that while selecting the entire data to convert to table, an easy option is to click on the first A1 cell, select cmd + Shift + right arrow, and then cmd + Shift + down arrow. This will select your entire data (windows users use control instead of cmd).
Now we have 2 tables MatchTable (sheet 1) and BallTable (sheet 2), the two team which played the final match the team which won, and the date when the final match was held. For this I have created a solutions sheet and I will be cross referencing the sheets. Since I have my table names, provided being unique, I can directly use them as input for my excel formulas.
Let us see how we can get the match id for the finals. For this, we will using INDEX MATCH formula. The INDEX MATCH Formula is the combination of two functions in excel: INDEX and MATCH. INDEX() returns the value of a cell in a table based on the column and row number. MATCH() returns the position of a cell in a row or column. Combined, the two formulas can look up and return the value of a cell in a table based on vertical and horizontal criteria.
With this, again we have 2 approaches to get the id and for both, we will be using out MatchTable.
For the first approach,
- we will first find the position of the row by date when the match was held by using the MATCH function.
- MATCH(C5, MatchTable[date], 0) where C5 is the date cell (10/11/2020). The 0 is for exact match.
- Followed by position of the id column
- MATCH(“id”, MatchTable[#Headers], 0). Here we find the id column position from the header row.
- Once we have to row-column positions/indices, we use the index column to get the value at the intersection of the row and column from the MatchTable.
- INDEX(MatchTable, MATCH(C5,MatchTable[date], 0), MATCH(“id”, MatchTable[#Headers], 0))
Although we get the correct match id, there is a catch here. The method we used is a bit ambiguous. It works here correctly because we had only one match during that date. What if we had multiple matches held during that date which normally happen in IPL. In this case, the MATCH function will return the row position of the first match only, which might not be the one we wanted. Hence we need an alternate approach to check multiple conditions when finding the position. For this, we use MATCH with multiple conditions.
- We check the row position for the match date and the winning team.
- MATCH(1, (MatchTable[date]=C5)*(MatchTable[winner]=C4), 0)
- The 1 argument for MATCH basically means look through the rows in the data and return the row number where all of the criteria are TRUE and all the criteria are mentioned as the second argument separated by ‘*.’ The rest is similar procedure.
Now once we get the ID we can get the total run scored by the two teams using SUMIFS function. (Note: we use SUMIFS instead of SUMIF as we need to check multiple conditions).
We use the BallTable for this purpose. We sum the total runs, two conditions — a) check for matching batting team and b) check for matching id which we calculated earlier.
- SUMIFS(BallTable[total_runs], BallTable[batting_team], C18, BallTable[id], C15)
After using this formula for the two teams we get the total runs score by them in that final match.
Although so far we have only answered the most basic question in our analysis, we got to know how one problem can be solved using different ways and how formulas using excel functions makes our work easier, cleaner and efficient. With the above used same formulae for different teams rather than individually filtering the dataset each time and that’s what makes use of excel formulas better for analysis.
So far we have only seen the basic use of excel features for our analysis, in days ahead, I will update this article with more data analysis using more advanced excel functions. In the mean time do feel free to try some yourself using this IPL data from Kaggle.com.
Stay Tuned…