With a spreadsheet opened in front of you, you stare at mountains of raw data without a clue what to do, feeling like you’re drowning in the data.
You’ve heard marketers talking about data-driven marketing and “Big Data,” and having learned that many companies such as Facebook are using third party data, you sent out surveys and collected tons of data in order to do some of that “data-driven marketing” you’ve heard so much about. However, data is useless if you don’t know how to analyze it correctly and effectively.
Since market research plays a big part at iAcquire, there is always a bunch of data collected to learn about the market. As an analyst, I won’t let myself “drown” in the data. There are two secrets for this. First, I use the “backward market research method.” I have clear objectives before collecting the data and collect the data accordingly. After the data is collected, I think about what possible findings and conclusions I can get and analyze the data based on the possible outcomes. Also, I get familiar with the data analysis techniques available. In this blog post, I will introduce to you the seven most common and useful data analysis techniques for survey analysis, and then walk you through their processes in Excel. Note: The following examples will be shown in Excel 2010.
Data Analysis Technique 1: Frequency Distribution (Histogram in Excel)
Frequency distribution is a simple data analysis technique which allows you to get a big picture of the data. From frequency distribution, you can see how frequently the specific values are observed and what their percentages are for the same variable. For example: for variable of “age,” you can use frequency distribution to figure out how many people in the survey are aged 18 to 25, and how many are aged 26 to 33, etc… Histogram is a great tool in Excel to recognize frequency distribution in data like this.
How to use the Histogram feature in Excel:
1. Analysis Toolpak is a hidden add-in. So you need to unhide it. Here is how to unhide the analysis Toolpak in different versions of MS Excel: http://www.add-ins.com/Analysis_ToolPak.htm
2. Click “Data” tab – “Data Analysis” – Histogram
3. Enter the “Input Range” data in the dialog box
4. Specify the “bin” range and enter “bin” range in the dialog box (“bin” range is the upper boundary category labels on the X-axis of the histogram.)
See example below:
5. Click the “Labels” box
6. Click the “Chart output” box
The output will be like this:
In the chart, “more” indicates that the value is higher than the biggest value in the “bin.” The histogram is a great tool for computing the frequencies for numeric variables, such as the weight, height, income and age, etc. For nominal data, such as “gender” and “marital status” you need to recode the variables into numbers, such as “male = 1” and “female = 0”, etc. You can also use Pivot Tables to compute the frequencies of the nominal data which will be easier. This technique will be introduced later.
Data Analysis Technique 2: Descriptive Statistics
From the frequency distribution we can figure out the frequency of the values observed, as shown in the “age example” above. We can use the measures of central tendency and dispersion to learn more about the data for “age.”
Mean, median and mode are the three measures of central tendency. “Mean” is the average value. It equals to the sum of all the values divided by the numbers of observations. It is the most popular measure of central tendency, especially when the data set does not have an outlier. “Median” is the value in the middle when all the values are lined in order (assuming there is an odd number of values). If there are even numbers of values, the median is the average of the two numbers in the middle. It is useful when the data set has an outlier and values distribute very unevenly. “Mode” is the value which is observed most often. It is useful when the data is non-numeric or when asked to find the most popular item.
Range and standard deviation are the basics measures of dispersion. The bigger the range and bigger the standard deviation, the more dispersed the values are. “Range” is the difference of the maximum value and the minimum value for the variable. For the “age example”, the maximum value is 54 and the minimum value is 19. So the range is 35. “Standard deviation” shows how much variation the value exits from the mean. Variance is the average of squared difference from the mean. Standard deviation is the square root of variance.
How to measure central tendency and dispersion in Excel
You can use the function wizard. For example, you want to know the maximum value among all the values:
1. Double click on an empty cell
2. Type in “= MAX.” The function wizard will be activated by “=” sign
3. Select all the values for the variable
For mean, type in “= AVERAGE”; for minimum value, type in “= MIN”; for median, type in “= MEDIAN”; for mode, type in “= MODE”, for standard deviation, type in “STDEV.” You may find out that you cannot get “range” from the function wizard. That’s right. But there is an easier way to compute all these in Excel:
1. Click Data – Data Analysis – Descriptive Statistics
2. Specify the cell range in the Input Box
3. Click “Labels” box
4. Click “Summary Statistics” box
The output will look like this:
From this, you can easily learn about the central tendency and dispersion of the values for the variable.
Data Analysis Technique 3: Comparing Means – Statistical Testing
Heads up! This requires some knowledge of statistics! If you are not familiar with complicated statistics, you can start your free course at Udacity.
T-Tests are used to test if the difference of means is statistically significant. It tests if the sample is representative of the populations. For example, if the mean for variable 1 is 20 and the mean for variable 2 is 28, you may say the means are different. T-Tests may show you that they are not significantly different, however, and you can’t base your conclusion on the means’ difference since the difference in the sample is not representative for the population.
In Excel, there are three types of t-Tests: t-Test: Paired Two Sample for Means, tTest: TwoSample Assuming Equal Variances and tTest: TwoSample Assuming Unequal Variances. Usually, we only need to use t-Test: Paired Two Sample for Means and tTest: TwoSample Assuming Unequal Variances.
Paired t-Test, also called dependent t test, is used when the data of variable 1 and data of variable 2 were collected in parallel from each individual, such as “before versus after” cases. Suppose the marketer collected the ratings data before changing the product packaging and after changing it. The data is shown as below:
How to use paired t-Test in Excel:
1. Click “Data” tab – “Data Analysis” – t-Test: Paired Two Sample for Means
2. Specify variable 1 range
3. Specify variable 2 range
4. Click “Labels”
5. Specify the Alpha value as 0.05, which indicates the confidence level is 95%, meaning you are 95% confident about the statistical test results
The output will look like this:
For these statistics, we only need to look into the P-value, “P (T<=t) two-tail.” If it is bigger than 0.05, the means are not significantly different.
A Two-Sample t-Test, also called an independent t test, is used to compare the means of two independent groups. The way to do a Two-Sample t-Test is similar to the paired t-Test, except that you need to choose “TTest: TwoSample Assuming Unequal Variances” in the Tool Box.
Data Analysis Technique 4: Cross-Tabulation (Pivot Table in Excel)
Cross-tabulation, also called Pivot Table in Excel, is one of the most popular techniques for data analysis. It helps you to understand what the relations are among different variables. For example, you can see the ratings from male respondents and the ratings from female respondents.
See the example below:
How to do cross-tabulation (Pivot Table) in Excel：
1. Click Insert – PivotTable
2. Highlight the range data
3. Drag the variables fields to the “Legend Fields” box and the “Axis Fields” box as needed. In this example, I dragged “Gender” in the “Legend Fields” box and Ratings in the “Axis Fields” box so I can compare each specific rating value between male and female
4. Drag the Gender in the “Values” box. It usually automatically appears as “count” but if not, or you want to change to “sum” or others, you can click the arrow next to the “Count of Gender” and click “Value Field Settings” and choose the type of calculation under “Summarize Values By” tab
The output will look like this:
5. To make the output easier to understand, it is always to good idea to visualize it with charts. Select the table and click the “Insert” tab and insert the chart type as needed. Here is what the column chart looks like:
You can easily infer that males is more likely to rate the XXX product as 3 points and females are more likely to rate it as 2 and 4 points. You can also test the statistical significance using Chi-test for the cross-tabulation variables.
Note: if you edit the data after the pivot table and the graph are generated, you need to refresh it. Click Data — Click “Refresh All”
When there are too many variations among the value for a variable for which you want to do cross-tabulation, you can recode it or filter the values. You can recode the values into smaller categories using the “IF” function. For example, for the values that are ranged from 1 to 100, you can recode the values in a new column and use “IF” function to break the values into “low” and “high.” The “IF formulas” may look like this, “=IF(B2:B17>50,1,0)” or “=IF(B2:B17>50,”High”,”Low”)”. Remember to add quotation marks to the text if you indicate the new categories as text such as “high” and “low.”
Data Analysis Technique 5: Correlations
Correlations are used when you want to know about the relationship between two variables. For example, you want to know consumers’ willingness to pay and their ratings for the product quality. If the correlation is 1, meaning the willingness to pay and the ratings for the product quality are completely positively correlated and if the correlation is 0, meaning there is no correlation between these two variables. If the correlation is -1, it shows they are completely negatively correlated, meaning the higher one variable, the lower the other variable. If the absolute value of the variables is bigger than 0.5, they are usually significant.
How to use correlations in Excel:
Note: you can only use correlations for numeric data in Excel.
1. Click on any empty cell.
2. Type = sign in the empty cell and type in “CORREL”
3. Click in the box next to “Array 1” and highlight the first column of numbers.
4. Click in the box next to “Array 2” and highlight the second column of numbers.
Data Analysis Technique 6: Linear Regressions
Regression is a more accurate way to test the relationship between the variables compared with correlations since it shows the goodness of fit (Adjusted R Square) and the statistical testing for the variables. The formulas for one-variable regressions is y = ax + b and for multiple regressions is y = ax12 + bx2 + c.
For y = ax + b, y is the dependent variable, x is the causal variable and the intercept is a, indicating the correlation between x and y. If “a” is 0.2 for example, it means when x variable increases 1 unit, y increases 0.2 units. If “a” is negative, meaning y decreases as x increases.
For y = ax12 + bx2 + c, y is the dependent variable, x1 is causal variable 1 and x2 is causal variable 2. “a” is the intercept for variable 1 and “b” for variable 2. For example, if y = 0.6 x12 – 0.4 x2 + 0.23, it means when x1 increases 1 unit, y increases 0.6 units and when x2 increases 1 unit, y decreases 0.4 units. (Given the variables are statistically significant.)
How to do regression analysis in Excel:
I use the example of a multiple regression of ratings for product quality and ratings for packaging on the willingness to pay. I am trying to find out how the willingness to pay is correlated to these two variables.
1. Click “Data” tab – “Data Analysis Tools” – Regression
2. Specify the cell range in the Input Y Range box as the dependent variable
3. Specify the cell range in the Input X Range box as the dependent variable or choose more than one column of variables if you are doing multiple regressions
4. Check “Labels” box
5. Check “Confidence Level,” we usually set it as 95%
The output is shown as below:
In the output, the Adjusted R Square measures the proportion of the variation in the dependent variable accounted for by the explanatory variables. In the example, it means 37% can be explained by the regression. If the adjusted R square is too low, the regression is not significant thus invalid. The coefficient for product quality is 0.89, and the P-value is 0.003 which is less than 0.05, indicating it is statistically significant. (Since we set the confidence level as 95%). So we can infer that the product quality is highly positively correlated to the willingness to pay. For the coefficient for ratings for packaging, it is = 0.015 and the P-value is 0.96 which is more than 0.05, indicating it is not statistically significant.(Confidence level as 95%). So we can’t infer that the packaging has anything to do with the willingness to pay. Therefore, when making marketing decision, marketers should focus on the product quality according to this survey result.
Data Analysis Technique 7: Text Analytics
In the survey, there are always some open questions which will allow respondents to fill in their own answers. When the sample is big, it is difficult to analyze each respondent’s answers. It seems that there are no ways to do open questions analysis in Excel. However, I will introduce a very useful way to do the text analytics.
How to do text analytics:
1. Go to Wordle
2. Click “Create” tab
3. Copy and paste the column of answers to the open questions. (Once for each question).
4. Click “Go”
An example of job title of the respondents will look like this:
Bigger font of a word indicates higher frequency of this word in the answers. In this example, it indicates the respondents’ job titles are related to Marketing, Manager, SEO and Director, etc.
After learning these techniques for data analysis, I bet you won’t feel like drowning any more when looking into the spreadsheet with tons of data.