Chapter 3

Type of Data, Distributions, Graphs

Up to this point we have been working with continuous data. Continuous data can be any value within a range. Good examples of continuous data are blood pressure, height, and weight. Categorical data, on the other hand, can only take specific values that define certain categories. For example, fruit. In SPSS it is possible to have a variable called fruit with 1=apple, 2=banana, 3=orange, etc. In this context a 1.5 would make no sense. It makes more sense to think of these numbers separately as different categories that a fruit (or something else) can fall into. Contrast this with height. A person who is 65 inches tall is just a bit shorter than a 65.5 inch tall person, who is in turn also just a bit shorter than a 66 inch tall person. This type of data gradually builds without abrupt breaks, so it is called continuous data. The ways we handle data depend heavily on what type of data it is, continuous or categorical. This next section we will learn how to show continuous data graphically.

One of the most commonly used ways to graphically show continuous data is through histograms.

Figure 3.1 shows the histogram of how positive the climate was in a kindergarten classroom, as recorded by an outside observer. It is produced by SPSS.

Figure 3.1

Positive Climate histogram

The histogram shows the distribution of scores. A distribution is defined as all the possible values a variable can take on and how often those values occur. This histogram has, on the x-axis the possible values (and more) that positive climate can take on and the bars show how often those values occur on the y-axis. The bars only appear for values for 3 through 7 with most of the values appearing in the ‘6’ bar. That means that no outside observer recorded the Kindergarten classroom had a climate of 2 or 8. These bars, also called bins, are ranges of values that SPSS combines for easier presentation. Thus the 6 bar or bin represents all the values from 6.0 to 6.9999. SPSS also provides helpful values in the top right hand of the histogram graphic. Some of these values are the mean, the standard deviation, and n which refers to the number of data points in your study. You can see these numbers in the top right hand of the graphic.

The histogram in 3.1 is just one example of a shape a distribution can take on. Another example is found in figure 3.2

Figure 3.2

Example of a left skewed distribution

A left skewed distribution like seen in Figure 3.2 means most of the data is clustered towards high values of the variable with a ‘tail’ to the left. The downward slope of the data on the left hand side sort of resembles a tail and is often referred to this way.^{[b]}^{[c]} Remember, left skewed means the tail is on the left. Notice how the Mean is drawn towards the tail more than the Median. This is similar to the behavior of the Mean and Median in the presence of outliers^{[d]} like we discussed in the previous chapter. Similar to that case, the Median is considered a more robust measure of center in this case than the Mean. The Mode is the value that is repeated the most in the dataset and is generally not used as a measure of center.

Figure 3.3

Example of right skewed distribution

Figure 3.4

Example of symmetric distribution with no skew

Figures 3.3 and 3.4 show other examples of shapes of a distribution, where figure 3.3 shows a right skewed distribution with its tail on the right.^{[e]} Figure 3.4 shows a symmetric distribution with no skew where the mean, median and mode are all equal. Figure 3.4 shows a special distribution called a ‘normal’ or ‘gaussian’ distribution it is a symmetrical and bell-shaped. This type of distribution occurs frequently.

Click on this link to download this .sav (which is an SPSS datafile) file.

Once you have downloaded this dataset open it in SPSS, go to the variable view and you should see the following as shown in figure 3.5

Figure 3.5

Screenshot of example dataset 3.1

Now go to Analyze→Descriptive Statistics→Frequencies and move the third variable ‘wk continuous’ into the variables box, click on charts and select histograms and click continue.

Figure 3.6

Graphic User Interface (GUI) for frequencies in SPSS.

Figure 3.7

Graphic User Interface (GUI) for the charts submenu is SPSS.

Now uncheck the ‘Display frequency tables’ box on the lower left side and click ‘Ok.’ This will produce the output seen in Figure 3.8.

Figure 3.8

Histogram from example dataset.

This is a measure of the childrens ‘SES’ measure or socioeconomic status which is a measure of wealth. How would you describe the shape of this distribution? Would you use the mean or median to describe ^{[f]}^{[g]}its center?

Now you will produce a ‘bar chart’, which is a way to describe categorical data^{[h]}. Just like we saw in Figure 3.1, a bar chart is a representation of how often different values occur in your data. This is a great tool to be able to tell at a glance the general story of the data. Go back to example dataset 3.1 and repeat the process of Analyze→Descriptive Statistics→Frequencies but this time pick the second variable instead of the third that starts with ‘Child Composite Race’^{[i]}. Depending on the width, you may only see the words “Child Composite.” Go to Charts, but instead of selecting histogram select bar chart. You will produce the output seen in Figure 3.9.^{[j]}

Figure 3.9

Selected output of Bar Chart produced from Example data 3.1 of student race.

An important distinction between a bar chart and a histogram can be seen in the x-axis. Instead of numbers put into bins you have categories. The order does not matter in a bar chart while it does matter in a histogram. In this case we can tell at a glance that the vast majority of the students in this dataset are ‘White, non-hispanic’ followed by a distant second of ‘Black or African American, Non-hispanic’. The smallest category is ‘Native Hawaiian, other Pacific Islander’.

Now we will combine both categorical and continuous data to produce Boxplots. Boxplots are a birds eye view of a histogram where you can compare different categories side-by-side. For example, refer to figure 3.9 which is the SES data shown in figure 3.6 but where the viewer of the data is above it.

Figure 3.10

SES data as seen in a boxplot rather than a histogram (figure 3.6).

The bottom line of the data is at -1.00. This horizontal line is called a ‘whisker’. The line that connects the whisker to the blue box represents the lower 25% of the data. The blue box shows where the majority of the data or where the middle 50% of the data lie. The line in the middle of the blue box is the median. The last line up to the last whisker is the upper 25% of the data. Some of the data in the tail is determined to be outliers according to an arbitrary standard that SPSS uses. These are represented by dots. The numbers associated with those dots are the row numbers of those datapoints in the SPSS spreadsheet. This boxplot shows you that SES is right skewed.

Go to our example dataset 3.1 and go to Graphs→Legacy Dialogs→Box plots. You will see the following as shown in Figure 3.11

Figure 3.11

Graphic User interface (GUI) for boxplots in SPSS part one.

Don’t change the defaults and click on ‘Define’. In the next GUI put the 2nd variable “Child Composite Race” in ‘Category axis’ and the third variable “WK Continuous SES” in ‘Variable’ as seen in figure 3.9.

Figure 3.12

Graphic User interface (GUI) for boxplots in SPSS part two.

You will produce the output as shown in figure 3.13.

Figure 3.13

Boxplot of SES vs. Race from example dataset 3.1.

As mentioned previously, the boxplot is a birdseye view of a histogram set side-by-side according to categories, in this case SES (wealth) vs. Race. From Figure 3.13 we can tell the ‘Asian’ group has the highest median wealth followed by ‘White, non-hispanic’ and all the races SES are right skewed.

[a]Put in SES.

[b]I felt the reference to tail wasn't clear so I added this sentence. Not certain if this helped.

[c]This might be better done with an arrow labeling the tail in the graphic than with my not-overly clear clarification sentence.

[d]I just realized we didn't define outlier explicitly in chapter 2. We have a good example, but with such a small data set, I'm not super confident that everyone will catch what it means.

[e]To me it feels counter intuitive that the tail on the right means right skewed when the preponderance of data is on the left, that's why I'm pointing this out in both of the skewed graphs. Don't include this if you don't like it.

[f]right skewed, median

[g]Can give them the answer if you want.

[h]Might be helpful to discuss why you would want to do, like what it's useful for, or what it means, before jumping in to doing it.

[i]you wanted race not gender right?

[j]Tell them to click frequency tables on if you want the exact chart you have to come up.

[k]I got the same blue bars, but not the chart of numbers above it.

[l]Show how to get this in SPSS

This content is provided to you freely by EdTech Books.

Access it online or download it at https://edtechbooks.org/sem/data_distributions_graphs.