## Bivariate Data: Examples, Definition and Analysis

On this page:

- What is bivariate data? Definition.
- Examples of bivariate data: with table.
- Bivariate data analysis examples: including linear regression analysis, correlation (relationship), distribution, and scatter plot.

Let’s define bivariate data:

We have bivariate data when we studying two variables . These variables are changing and are compared to find the relationships between them.

For example, if you are studying a group of students to find out their average math score and their age, you have two variables (math score and age).

If you are studying only one variable, for example only math score for these students, then we have univariate data .

When we are examining bivariate data, the two variables could depend on each other. One variable could influence another. In this case, we say that the bivariate data has:

- an independent variable and
- a dependent variable .

A classical example of dependent and independent variables are age and heights of the babies and toddlers. When age increases, the height also increases.

Let’s move on to some real-life and practical bivariate data examples.

Look at the following bivariate data table. It represents the age and average height of a group of babies and kids.

3 months | 58.5 |

6 months | 64 |

9 months | 68.5 |

1 years | 74 |

2 years | 81.2 |

3 years | 89.1 |

4 years | 95 |

5 years | 102.5 |

Commonly, bivariate data is stored in a table with two columns.

There are 2 types of relationship between the dependent and independent variable:

- A positive relationship (also called positive correlation) – that means if the independent variable increases, then the dependent variable would also increase and vice versa. The above example about the kids’ age and height is a classical positive relationship.
- A negative relationship (negative correlation) – when the independent variable increases and the dependent variable decrease and vice versa. Example: when the car age increases, the car price decreases.

So, we use bivariate data to compare two sets of data and to discover any relationships between them.

Bivariate Data Analysis

Bivariate analysis allows you to study the relationship between 2 variables and has many practical uses in the real life. It aims to find out whether there exists an association between the variables and what is its strength.

Bivariate analysis also allows you to test a hypothesis of association and causality. It also helps you to predict the values of a dependent variable based on the changes of an independent variable.

Let’s see how the bivariate data work with linear regression models .

Let’s say you have to study the relationship between the age and the systolic blood pressure in a company. You have a sample of 10 workers aged thirty to fifty-five years. The results are presented in the following bivariate data table.

1 | 37 | 130 |

2 | 38 | 140 |

3 | 40 | 132 |

4 | 42 | 149 |

5 | 45 | 144 |

6 | 48 | 157 |

7 | 50 | 161 |

8 | 52 | 145 |

9 | 53 | 165 |

10 | 55 | 162 |

Now, we need to display this table graphically to be able to make some conclusions.

Bivariate data is most often displayed using a scatter plot. This is a plot on a grid paper of y (y-axis) against x (x-axis) and indicates the behavior of given data sets.

Scatter plot is one of the popular types of graphs that give us a much more clear picture of a possible relationship between the variables.

Let’s build our Scatter Plot based on the table above:

The above scatter plot illustrates that the values seem to group around a straight line i.e it shows that there is a possible linear relationship between the age and systolic blood pressure.

You can create scatter plots very easily with a variety of free graphing software available online.

What does this graph show us?

It is obvious that there is a relationship between age and blood pressure and moreover this relationship is positive (i.e. we have positive correlation). The older the age, the higher the systolic blood pressure.

The line that you see in the graph is called “line of best fit” (or the regression line). The line of best fit aims to answer the question whether these two variables correlate. It can be used to help you determine trends within the data sets.

Furthermore, the line of best fit illustrates the strength of the correlation .

Let’s investigate further.

We constated that in our example, there is a positive and strong linear relationship between age and blood pressure. However, how strong is that relationship? What is its strength?

This is where the correlation coefficient comes to answer this question.

The correlation coefficient (R) is a numerical value measured between -1 and 1 . It indicates the strength of the linear relationship between two given variables. For describing a linear regression, the coefficient is called Pearson’s correlation coefficient.

When the correlation coefficient is closer to 1 it shows a strong positive relationship. When it is close to -1, there is a strong negative relationship. A value of 0 tells us that there is no relationship.

We need to calculate our correlation coefficient between age and blood pressure. There is a long formula (for Pearson’s correlation coefficient) for this but you don’t need to remember it.

All you need to do is to use a free or premium calculator such as those on www.socscistatistics.com . When we put our bivariate data on this calculator we got the following result:

The value of the correlation coefficient (R) is 0.8435. It shows a strong positive correlation.

Now, let’s calculate the equation of the regression line (the best fit line) to find out the slope of the line.

For that purpose let’s remind the simple linear regression equation :

Y = Β 0 + Β 1 X

X – the value of the independent variable, Y – the value of the dependent variable. Β 0 – is a constant (shows the value of Y when the value of X=0) Β 1 – the regression coefficient (shows how much Y changes for each unit change in X)

Again, we will use the same online software ( socscistatistics.com ) to calculate the linear regression equation. The result is:

Y = 1.612*X + 74.35

More on linear regression equation and explanation, you can see in our post for linear regression examples .

So, from the above bivariate data analysis example that includes workers of the company, we can say that blood pressure increased as the age increased. This indicates that age is a significant factor that influences the change of blood pressure.

Other popular positive bivariate data correlation examples are: temperature and the amount of the ice cream sales, alcohol consumption and cholesterol levels, weights and heights of college students, and etc.

Let’s see bivariate data analysis examples for a negative correlation.

The below bivariate data table shows the number of student absences and their final grades in a class.

1 | 0 | 90 |

2 | 1 | 85 |

3 | 1 | 88 |

4 | 2 | 84 |

5 | 3 | 82 |

6 | 3 | 80 |

7 | 4 | 75 |

8 | 5 | 60 |

9 | 6 | 72 |

10 | 7 | 64 |

It is quite obvious that these two variables have a negative correlation between them.

When the number of student absences increases, the final grades decrease.

Now, let’s plot the bivariate data from the table on a scatter plot and to create the best-fit line:

Note how the regression line looks – it has a downward slope.

This downward slope indicates there is a negative linear association.

We can calculate the correlation coefficient and linear regression equation. Here are the results:

- The value of the correlation coefficient (R) is -0.9061 . This is a strong negative correlation.
- The linear regression equation is Y = -3.971*X + 90.71.

We can conclude that the least number of lessons the students skip, the higher grade could be reached.

Conclusion:

The above bivariate data examples aim to help you understand better how does the bivariate analysis work.

Analyzing two variables is a common part of the inferential statistics types and calculations. Many business and scientific investigations include only two continuous variables.

The main questions that bivariate analysis has to answer are:

- Is there a correlation between 2 given variables?
- Is the relationship positive or negative?
- What is the degree of the correlation? Is it strong or weak?

If you need other practical examples in the area of management and analysis, our posts Venn diagram examples and decision tree examples might be helpful for you.

## About The Author

## Silvia Valcheva

Silvia Valcheva is a digital marketer with over a decade of experience creating content for the tech industry. She has a strong passion for writing about emerging software and technologies such as big data, AI (Artificial Intelligence), IoT (Internet of Things), process automation, etc.

EXCELLENT and illustrative presentation

Excellent. Simple and effective explanations

Thanks. Happy to help!

Clears the Whole concept in very simple and easy way,thanks a lot

## Leave a Reply Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

## Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

- Knowledge Base
- Starting the research process
- How to Write a Research Proposal | Examples & Templates

## How to Write a Research Proposal | Examples & Templates

Published on October 12, 2022 by Shona McCombes and Tegan George. Revised on November 21, 2023.

A research proposal describes what you will investigate, why it’s important, and how you will conduct your research.

The format of a research proposal varies between fields, but most proposals will contain at least these elements:

## Introduction

Literature review.

- Research design

## Reference list

While the sections may vary, the overall objective is always the same. A research proposal serves as a blueprint and guide for your research plan, helping you get organized and feel confident in the path forward you choose to take.

## Table of contents

Research proposal purpose, research proposal examples, research design and methods, contribution to knowledge, research schedule, other interesting articles, frequently asked questions about research proposals.

Academics often have to write research proposals to get funding for their projects. As a student, you might have to write a research proposal as part of a grad school application , or prior to starting your thesis or dissertation .

In addition to helping you figure out what your research can look like, a proposal can also serve to demonstrate why your project is worth pursuing to a funder, educational institution, or supervisor.

Show your reader why your project is interesting, original, and important. | |

Demonstrate your comfort and familiarity with your field. Show that you understand the current state of research on your topic. | |

Make a case for your . Demonstrate that you have carefully thought about the data, tools, and procedures necessary to conduct your research. | |

Confirm that your project is feasible within the timeline of your program or funding deadline. |

## Research proposal length

The length of a research proposal can vary quite a bit. A bachelor’s or master’s thesis proposal can be just a few pages, while proposals for PhD dissertations or research funding are usually much longer and more detailed. Your supervisor can help you determine the best length for your work.

One trick to get started is to think of your proposal’s structure as a shorter version of your thesis or dissertation , only without the results , conclusion and discussion sections.

Download our research proposal template

## Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

- Academic style
- Vague sentences
- Style consistency

See an example

Writing a research proposal can be quite challenging, but a good starting point could be to look at some examples. We’ve included a few for you below.

- Example research proposal #1: “A Conceptual Framework for Scheduling Constraint Management”
- Example research proposal #2: “Medical Students as Mediators of Change in Tobacco Use”

Like your dissertation or thesis, the proposal will usually have a title page that includes:

- The proposed title of your project
- Your supervisor’s name
- Your institution and department

The first part of your proposal is the initial pitch for your project. Make sure it succinctly explains what you want to do and why.

Your introduction should:

- Introduce your topic
- Give necessary background and context
- Outline your problem statement and research questions

To guide your introduction , include information about:

- Who could have an interest in the topic (e.g., scientists, policymakers)
- How much is already known about the topic
- What is missing from this current knowledge
- What new insights your research will contribute
- Why you believe this research is worth doing

As you get started, it’s important to demonstrate that you’re familiar with the most important research on your topic. A strong literature review shows your reader that your project has a solid foundation in existing knowledge or theory. It also shows that you’re not simply repeating what other people have already done or said, but rather using existing research as a jumping-off point for your own.

In this section, share exactly how your project will contribute to ongoing conversations in the field by:

- Comparing and contrasting the main theories, methods, and debates
- Examining the strengths and weaknesses of different approaches
- Explaining how will you build on, challenge, or synthesize prior scholarship

Following the literature review, restate your main objectives . This brings the focus back to your own project. Next, your research design or methodology section will describe your overall approach, and the practical steps you will take to answer your research questions.

? or ? , , or research design? | |

, )? ? | |

, , , )? | |

? |

To finish your proposal on a strong note, explore the potential implications of your research for your field. Emphasize again what you aim to contribute and why it matters.

For example, your results might have implications for:

- Improving best practices
- Informing policymaking decisions
- Strengthening a theory or model
- Challenging popular or scientific beliefs
- Creating a basis for future research

Last but not least, your research proposal must include correct citations for every source you have used, compiled in a reference list . To create citations quickly and easily, you can use our free APA citation generator .

Some institutions or funders require a detailed timeline of the project, asking you to forecast what you will do at each stage and how long it may take. While not always required, be sure to check the requirements of your project.

Here’s an example schedule to help you get started. You can also download a template at the button below.

Download our research schedule template

Research phase | Objectives | Deadline |
---|---|---|

1. Background research and literature review | 20th January | |

2. Research design planning | and data analysis methods | 13th February |

3. Data collection and preparation | with selected participants and code interviews | 24th March |

4. Data analysis | of interview transcripts | 22nd April |

5. Writing | 17th June | |

6. Revision | final work | 28th July |

If you are applying for research funding, chances are you will have to include a detailed budget. This shows your estimates of how much each part of your project will cost.

Make sure to check what type of costs the funding body will agree to cover. For each item, include:

- Cost : exactly how much money do you need?
- Justification : why is this cost necessary to complete the research?
- Source : how did you calculate the amount?

To determine your budget, think about:

- Travel costs : do you need to go somewhere to collect your data? How will you get there, and how much time will you need? What will you do there (e.g., interviews, archival research)?
- Materials : do you need access to any tools or technologies?
- Help : do you need to hire any research assistants for the project? What will they do, and how much will you pay them?

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

Methodology

- Sampling methods
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Likert scales
- Reproducibility

Statistics

- Null hypothesis
- Statistical power
- Probability distribution
- Effect size
- Poisson distribution

Research bias

- Optimism bias
- Cognitive bias
- Implicit bias
- Hawthorne effect
- Anchoring bias
- Explicit bias

Once you’ve decided on your research objectives , you need to explain them in your paper, at the end of your problem statement .

Keep your research objectives clear and concise, and use appropriate verbs to accurately convey the work that you will carry out for each one.

I will compare …

A research aim is a broad statement indicating the general purpose of your research project. It should appear in your introduction at the end of your problem statement , before your research objectives.

Research objectives are more specific than your research aim. They indicate the specific ways you’ll address the overarching aim.

A PhD, which is short for philosophiae doctor (doctor of philosophy in Latin), is the highest university degree that can be obtained. In a PhD, students spend 3–5 years writing a dissertation , which aims to make a significant, original contribution to current knowledge.

A PhD is intended to prepare students for a career as a researcher, whether that be in academia, the public sector, or the private sector.

A master’s is a 1- or 2-year graduate degree that can prepare you for a variety of careers.

All master’s involve graduate-level coursework. Some are research-intensive and intend to prepare students for further study in a PhD; these usually require their students to write a master’s thesis . Others focus on professional training for a specific career.

Critical thinking refers to the ability to evaluate information and to be aware of biases or assumptions, including your own.

Like information literacy , it involves evaluating arguments, identifying and solving problems in an objective and systematic way, and clearly communicating your ideas.

The best way to remember the difference between a research plan and a research proposal is that they have fundamentally different audiences. A research plan helps you, the researcher, organize your thoughts. On the other hand, a dissertation proposal or research proposal aims to convince others (e.g., a supervisor, a funding body, or a dissertation committee) that your research topic is relevant and worthy of being conducted.

## Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. & George, T. (2023, November 21). How to Write a Research Proposal | Examples & Templates. Scribbr. Retrieved July 11, 2024, from https://www.scribbr.com/research-process/research-proposal/

## Is this article helpful?

## Shona McCombes

Other students also liked, how to write a problem statement | guide & examples, writing strong research questions | criteria & examples, how to write a literature review | guide, examples, & templates, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

## Top 12 Best Universities for Data Science in 2024

## Step-by-Step Guide to Creating a Volcano Plot RNA-Seq

## RNA-Seq Vs Microarray: Which is Right for Your Gene Expression Study?

## What is Single Cell Analysis? A Comprehensive Guide to Techniques and Tools

## Type and hit Enter to search

Bivariate analysis in data science: theory, tools and practical use cases.

Bivariate analysis is a fundamental technique in data science. It involves analyzing the relationship between two variables. Through bivariate analysis, data scientists can uncover patterns, correlations, and associations between variables, providing valuable insights into various fields, including biology, healthcare, genomics, the environment, and clinical research.

In this article we will explore concept behind the bivariate analysis, why is it important in data science, software and programming languages to perform bivariate analysis, and examples explained from data science in biology.

## What is Bivariate Regression Analysis?

Bivariate regression analysis is a specific type of regression analysis that involves the examination of the relationship between two variables: one independent variable and one dependent variable. It seeks to determine how changes in the independent variable are associated with changes in the dependent variable. Bivariate regression analysis is particularly useful for understanding simple cause-and-effect relationships between two variables.

Also learn about 5 Top Statistical Programming Languages and Software for Biological Data Science .

## What is the Theory of Bivariate Analysis?

In bivariate regression analysis, the relationship between the independent variable X and the dependent variable Y is modeled using a straight line equation:

In simple terms, it seeks to find the best-fitting line that describes the relationship between the two variables. This line is represented by an equation, typically in the form of

where y is the dependent variable, x is the independent variable, m is the slope, and b is the intercept.

When interpreting the results of bivariate analysis, the type of line or pattern observed provides insights into the relationship between the two variables:

- A straight line suggests a linear relationship, where changes in one variable are associated with proportional changes in the other variable.
- A positive slope indicates a positive relationship, where increases in one variable correspond to increases in the other variable.
- A negative slope indicates a negative relationship, where increases in one variable correspond to decreases in the other variable.
- A horizontal line suggests no change in the dependent variable as the independent variable changes.
- Clustered points or no apparent pattern suggest little to no relationship between the variables.

## What is Application of Bivariate Analysis in Data Science?

Bivariate analysis is crucial in data science for several reasons:

- Identifying Relationships: By examining the relationship between two variables, data scientists can identify patterns and correlations, enabling them to make informed decisions.
- Predictive Modeling: Bivariate regression analysis forms the basis for predictive modeling, where the relationship between variables is used to predict future outcomes.
- Variable Selection: Understanding the relationship between variables helps in selecting the most relevant features for building predictive models, leading to more accurate results.
- Hypothesis Testing: Bivariate analysis allows researchers to test hypotheses and determine the significance of relationships between variables.

Also explore 8 Data Science Portfolio Projects in Healthcare and Genomics: Step by Step Guidance and Resources .

## How Bivariate Analysis is done?

For bivariate regression analysis, several software, tools, and programming languages are available to researchers and analysts. These tools are specifically designed to handle multivariate regression models, including bivariate analysis. Here are some of the commonly used ones:

R: R provides various packages such as lm for linear models and glm for generalized linear models, which can handle bivariate regression analysis along with more complex multivariate regression models.

Python: Python offers libraries like statsmodels and scikit-learn, which provide functionalities for conducting multivariate regression analysis, including bivariate analysis.

Know more about Python for Bioinformatics: 11 Packages and Cheat Sheets for Biological Data.

MATLAB: MATLAB’s Statistics and Machine Learning Toolbox includes functions for fitting multivariate regression models, allowing users to perform bivariate regression analysis.

SPSS: IBM SPSS Statistics software offers capabilities for conducting multivariate regression analysis, allowing users to perform bivariate analysis along with other types of multivariate analyses.

SAS: SAS provides procedures like PROC REG and PROC GLM for fitting multivariate regression models, enabling users to conduct bivariate analysis as well.

STATA: STATA software offers commands such as regress for fitting multivariate regression models, making it suitable for conducting bivariate regression analysis and other multivariate analyses.

JMP: JMP statistical software provides tools for fitting multivariate regression models, allowing users to perform bivariate regression analysis and explore relationships between two variables.

SPSS Modeler: SPSS Modeler offers a graphical interface for building and deploying predictive models, including multivariate regression models, making it suitable for bivariate analysis tasks.

Minitab: Minitab statistical software includes features for fitting multivariate regression models and conducting bivariate regression analysis, providing tools for data analysis and interpretation.

Excel: While Excel is not specifically designed for advanced statistical analysis, it does offer functionalities for fitting regression models, including bivariate analysis, through the use of add-ins or custom functions.

These software, tools, and programming languages provide researchers and analysts with the capabilities to conduct bivariate regression analysis efficiently and effectively, enabling them to explore relationships between two variables and derive insights from their data.

## Examples of Bivariate Analysis in Biological Data Science

Bio-data science:.

Example: Studying the relationship between age and blood pressure in a population of individuals.

- Data Collection: Gather data on the age and blood pressure of individuals.
- Data Visualization: Create scatter plots to visualize the relationship between age and blood pressure.
- Bivariate Regression: Use statistical software like R to perform linear regression analysis to quantify the relationship between age and blood pressure.
- Interpretation: Analyze the regression coefficients to understand how age influences blood pressure.

## Healthcare Data Science:

Example: Investigating the association between smoking status and lung cancer risk.

- Data Collection: Collect data on smoking status (smoker/non-smoker) and the incidence of lung cancer in a population.
- Data Analysis: Conduct chi-square tests or logistic regression to assess the relationship between smoking status and lung cancer risk.
- Interpretation: Examine odds ratios or p-values to determine the strength and significance of the association.

## Genomic Data Science:

Example: Exploring the correlation between gene expression levels and disease susceptibility.

- Data Collection: Obtain data on gene expression levels and disease status from genomic databases.
- Data Processing: Preprocess the data to remove noise and normalize gene expression values.
- Correlation Analysis: Use Pearson correlation or Spearman rank correlation to quantify the relationship between gene expression and disease susceptibility.
- Visualization: Create heatmaps or scatter plots to visualize the correlation patterns.

## Environmental Data Science:

Example: Assessing the relationship between pollution levels and respiratory diseases in urban areas.

- Data Collection: Gather data on pollution levels (e.g., PM2.5 concentrations) and respiratory disease cases in different urban areas.
- Data Analysis: Perform regression analysis to examine the impact of pollution levels on respiratory disease incidence.
- Spatial Analysis: Use geographic information systems (GIS) to map pollution hotspots and disease clusters.
- Policy Implications: Provide insights for policymakers to implement measures for reducing pollution and protecting public health.

## Clinical Data Science:

Example: Investigating the relationship between drug dosage and treatment outcomes in patients with a specific medical condition.

- Data Collection: Collect data on drug dosage, patient characteristics, and treatment outcomes from clinical trials or electronic health records.
- Data Analysis: Conduct bivariate analysis to determine the relationship between drug dosage and treatment response.
- Stratified Analysis: Perform subgroup analysis to assess whether the relationship varies based on patient demographics or disease severity.
- Clinical Decision Making: Use the findings to optimize drug dosing strategies and improve treatment outcomes for patients.

## Pharmaceutical Data Science:

Example: Examining the association between drug dosage and therapeutic response in patients with a specific medical condition.

- Data Collection: Collect data on drug dosage (independent variable) and treatment outcomes (dependent variable) from clinical trials or electronic health records.
- Regression Analysis: Perform linear regression to evaluate how changes in drug dosage affect treatment response.
- Dose Optimization: Use bivariate analysis to determine the optimal drug dosage for maximizing therapeutic efficacy while minimizing adverse effects.

Bivariate analysis is a versatile tool that plays a vital role in data science across various domains. By examining the relationship between two variables, researchers can gain valuable insights into complex phenomena and make data-driven decisions.

Whether it’s understanding the impact of environmental factors on health or uncovering genetic associations with diseases, bivariate analysis serves as a cornerstone in unlocking the mysteries hidden within vast datasets. Through careful analysis and interpretation, data scientists can harness the power of bivariate analysis to drive innovation and advance knowledge in their respective fields.

Learn more about Bioinformatics vs Biostatistics: A 2024 Analysis of Biological Data Trends.

## Share Article

## Tanzeela Arshad

Other articles.

## 8 Data Science Portfolio Projects in Healthcare and Genomics: Step by Step Guidance and Resources

## Bivariate Data: Types & Characteristics with 5 Examples

No comment be the first one., leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

By using this form you agree with the storage and handling of your data by this website. *

## A Quick Introduction to Bivariate Analysis

The term bivariate analysis refers to the analysis of two variables. You can remember this because the prefix “bi” means “two.”

The purpose of bivariate analysis is to understand the relationship between two variables. You can contrast this type of analysis with the following:

- Univariate Analysis : The analysis of one variable.
- Multivariate Analysis: The analysis of two or more variables.

There are three common ways to perform bivariate analysis:

1. Scatterplots.

2. Correlation Coefficients.

3. Simple Linear Regression.

This tutorial provides an example of each of these types of bivariate analysis using the following dataset that contains information about two variables: (1) Hours spent studying and (2) Exam score received by 20 different students:

## 1. Scatterplots

A scatterplot offers a visual way to perform bivariate analysis. It allows us to visualize the relationship between two variables by placing the value of one variable on the x-axis and the value of the other variable on the y-axis.

In the scatterplot below, we place hours studied on the x-axis and exam score on the y-axis:

We can clearly see that there is a positive relationship between the two variables: As hours studied increases, exam score tends to increase as well.

## 2. Correlation Coefficients

A correlation coefficient offers another way to perform bivariate analysis. The most common type of correlation coefficient is the Pearson Correlation Coefficient , which is a measure of the linear association between two variables. It has a value between -1 and 1 where:

- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables

This simple metric gives us a good idea of how two variables are related. In practice, we often use scatterplots and correlation coefficients to understand the relationship between two variables so we can visualize and quantify their relationship.

## 3. Simple Linear Regression

A third way to perform bivariate analysis is with simple linear regression .

Using this method, we choose one variable to be an explanatory variable and the other variable to be a response variable . We then find the line that best “fits” the dataset, which we can then use to understand the exact relationship between the two variables.

For example, the line of best fit for the dataset above is:

Exam score = 69.07 + 3.85*(hours studied)

This means that each additional hour studied is associated with an average exam score increase of 3.85. By fitting this linear regression model, we can quantify the exact relationship between hours studied and exam score received.

Related: How to Perform Simple Linear Regression in Excel

Bivariate analysis is one of the most common types of analysis used in statistics because we’re often interested in understanding the relationship between two variables.

By using scatterplots, correlation coefficients, and simple linear regression, we can visualize and quantify the relationship between two variables.

Often these three methods are all used together in an analysis to gain a full picture of how two variables are related, so it’s a good idea to familiarize yourself with each method.

## Featured Posts

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

## Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

## Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Quantitative Data Analysis

## 4 Bivariate Analyses: Crosstabulation

Crosstabulation

Roger Clark

In most research projects involving variables, researchers do indeed investigate the central tendency and variation of important variables, and such investigations can be very revealing. But the typical researcher, using quantitative data analysis, is interested in testing hypotheses or answering research questions that involve at least two variables. A relationship is said to exist between two variables when certain categories of one variable are associated, or go together, with, certain categories of the other variable. Thus, for example, one might expect that in any given sample of men and women (assume, for the purposes of this discussion, that the sample leaves out nonbinary folks), men would tend to be taller than women. If this turned out to be true, one would have shown that there is a relationship between gender and height.

But before we go further, we need to make a couple of distinctions. One crucial distinction is that between an independent variable and a dependent variable . An independent variable is a variable a researcher suspects may affect or influence another variable. A dependent variable , on the other hand, is a variable that a researcher suspects may be affected or influenced by (or dependent upon ) another variable. In the example of the previous paragraph, gender is the variable that is expected to affect or influence height and is therefore the independent variable. Height is the variable that is expected to be affected or influenced by gender and is therefore the dependent variable. Any time one states an expected relationship between two (or more) variables, one is stating a hypothesis . The hypothesis stated in the second-to-last sentence of the previous paragraph is that men will tend to be taller than women. We can map two-variable hypotheses in the following way (Figure 3.1):

When mapping a hypothesis, we normally put the variable we think to be affecting the other variable on the left and the variable we expect to be affected on the right and then draw arrows between the categories of the first variable and the categories of the second that we expect to be connected.

Quiz at the End of The Paragraph

Read the following report by Annie Lowrey about a study done by two researchers, Kearney and Levine. What is the main hypothesis, or at least the main finding, of Kearney and Levine’s study on the effects of Watching 16 and Pregnant on adolescent women? How might you map this hypothesis (or finding)?

https://www.nytimes.com/2014/01/13/business/media/mtvs-16-and-pregnant-derided-by-some-may-resonate-as-a-cautionary-tale.html

We’d like to say a couple of things about what we think Kearney and Levine’s major hypothesis was and then introduce you to a way you might analyze data collected to test the hypothesis. Kearney and Levine’s basic hypothesis is that adolescent women who watched 16 and Pregnant were less likely to become pregnant than women who did not watch it. They find some evidence not only to support this basic hypothesis but also to support the idea that the ones who watched the show were less likely to get pregnant because they were more likely to seek information about contraception (and presumably to use it) than others. Your map of the basic hypothesis, at least as it applied to individual adolescent women, might look like this:

Let’s look at a way of showing a relationship between two nominal level variables: crosstabulation . Crosstabulation is process of making a bivariate table for nominal level variables to show their relationship. But how does crosstabulation work?

Suppose you collected data from 8 adolescent women and the data looked like this:

Table 1: Data from Hypothetical Sample A

Person 1 | yes | no |

Person 2 | yes | no |

Person 3 | yes | no |

Person 4 | yes | yes |

Person 5 | no | yes |

Person 6 | no | yes |

Person 7 | no | yes |

Person 8 | no | no |

Quick Check : What percentage of those who have watched 16 and Pregnant in the sample have become pregnant? What percentage of those who have NOT watched 16 and Pregnant have become pregnant?

If you found that 25 percent of those who had watched the show became pregnant, while 75 percent of those who had not watched it did so, you have essentially done a crosstabulation in your head. But here’s how you can do it more formally and more generally.

First you need to take note of the number of categories in your independent variable (for “Watched 16 and Pregnant ” it was 2: Yes and No). Then note the number of categories in your dependent variable (for “Got Pregnant” it was also 2: again, Yes and No). Now you prepare a “2 by 2” table like the one in Table 3.2, [1] labeling the columns with the categories of the independent variables and the rows with the categories of the dependent variable. Then decide where the first case should be put, as we’ve done, by determining which cell is where its appropriate row and column “cross.” We’ve “crosstabulated” Person 1’s data by putting a mark in the box where the “Yes” for “watched” and the “No” for “got pregnant” cross.

Table 2. Crosstabulating Person 1’s Data from Table 3.1 Above

I |

We’ve “crosstabulated” the first case for you. Can you crosstabulate the other seven cases? We’re going to call the cell in the upper left corner of the table cell “A,” the one in the upper right, cell “B,” the one in the lower left, cell “C,” and the one in the lower right, cell “D.” If you’ve finished your crosstabulation and had one case in cell A, 3 in cell B, 3 in cell C, and 1 in cell D, you’ve done great!

In order to interpret and understand the meaning of your crosstabulation, you need to take one more step, and that is converting those tally marks to percentages. To do this, you add up all the tally marks in each column, and then you determine what percentage of the column total is found in each cell in that column. You’ll see what that looks like in Table 3 below.

## Direction of the Relationship

Now, there are three characteristics of a crosstabulated relationship that researchers are often interested in: its direction , its strength , and its generalizability . We’ll define each of these in turn, and as we come to it. The direction of a relationship refers to how categories of the independent variable are related to categories of the dependent variable. There are two steps involved in working out the direction of a crosstabulated relationship… and these are almost indecipherable until you’ve seen it done:

1. Percentage in the direction of the independent variable.

2. Compare percentages in one category of the dependent variable.

The first step actually involves three substeps. First you change the tally marks to numbers. Thus, in the example above, cell A would get a 1, B, a 3, C, a 3, and D, a 1. Second, you’d add up all the numbers in each category of the independent variable and put the total on the side of the table at the end of that column. Third, you would calculate the percentage of that total that falls into each cell along that column (as noted above). Once you’d done all that with the data we gave you above, you should get a table that looks like this (Table 3.3):

Table 3 Crosstabulation of Our Imaginary Data from a 16 and Pregnant Study

1 (25%) | 3 (75%) | ||

3 (75%) | 1 (25%) | ||

4 (100%) | 4 (100%) |

Step 2 in determining the direction of a crosstabulated relationship involves comparing percentages in one category of the dependent variable. When we look at the “yes” category, we find that 25% of those who watched the show got pregnant, while 75% of those who did NOT watch the show got pregnant. Turning this from a percentage comparison to plain English, this crosstabulation would have shown us that those who did watch the show were less likely to get pregnant than those who did not. And that is the direction of the relationship.

Note: because we are designing our crosstabulations to have the independent variable in the columns, one of the simplest ways to look at the direction or nature of the relationship is to compare the percentages across the rows. Whenever you look at a crosstabulation, start by making sure you know which is the independent and which is the dependent variable and comparing the percentages accordingly.

## Strength of the Relationship

When we deal with the strength of a relationship, we’re dealing with the question of how reliably we can predict a sample member’s value or category of the dependent variable based on knowledge of that member’s value or category on the independent variables, just knowing the direction of the relationship. Thus, for the table above, it’s clear that if you knew that a person had watched 16 and Pregnant and you guessed she’d not gotten pregnant, you’d have a 75% (3 out of 4) chance of being correct; if you knew she hadn’t watched, and you guessed she had gotten pregnant, you’d have a 75% (3 out of 4) chance of being correct. Knowing the direction of this relationship would greatly improve your chances of making good guesses…but they wouldn’t necessarily be perfect all the time.

There are several measures of the strength of association and, if they’ve been designed for nominal level variables, they all vary between 0 and 1. When one of the measures is 0.00, it indicates that knowing a value of the independent variable won’t help you at all in guessing what a value of the dependent variable will be. When one of these measures is 1.00, it indicates that knowing a value of the independent variable and the direction of the relationship, you could make perfect guesses all the time. One of the simplest of these measures of strength, which can only be used when you have 2 categories in both the independent and dependent variables, is the absolute value of Yule’s Q . Because the “absolute value of Yule’s Q” is so relatively easy to compute, we will be using it a lot from now on, and it is the one formula in this book we would like you to learn by heart. We will be referring to it simply as |Yule’s Q|—note that the “|” symbols on both sides of the ‘Yule’s Q’ are asking us to take whatever Yule’s Q computes to be and turn it into a positive number (its absolute value). So here’s the formula for Yule’s Q:

[latex]|\mbox{Yule's Q}| = \frac{|(A \times B) - (B \times C) |}{|(A \times D) + (B \times C)|} \\ \text{Where }\\ \text{$A$ is the number of cases in cell $A$ }\\ \text{$B$ is the number of cases in cell $B$ }\\ \text{$C$ is the number of cases in cell $C$ }\\ \text{$D$ is the number of cases in cell $D$}[/latex]

For the crosstabulation of Table 3,

[latex]|{\mbox{Yule's Q}}| = \frac{|(1 \times 1) - (3 \times 3)|}{|(1 \times 1) + (3 \times 3)|} = \frac{|1-9|}{|1+9|} = \frac{|-8|}{|10|} = \frac{8}{10} = .80[/latex]

In other words, the Yule’s Q is .80, much close to the upper limit of Yule’s Q (1.00) than it is to its lower limit (0.00). So the relationship is very strong, indicating, as we already knew, that, given knowledge of the direction of the relationship, we could make a pretty good guess about what value on the dependent variable a case would have if we knew what value on the independent variable it had.

Practice Exercise

Suppose you took three samples of four adolescent women apiece and obtained the following data on the 16 and Pregnant topic:

Yes | No | Yes | No | Yes | Yes |

Yes | No | Yes | Yes | Yes | Yes |

No | Yes | No | Yes | No | No |

No | Yes | No | No | No | No |

See if you can determine both the direction and strength of the relationship between having watched “16 and Pregnant” in each of these imaginary samples. In what ways does each sample, other than sample size, differ from the Sample A above? Answers to be found in the footnote. [2]

Roger now wants to share with you a discovery he made after analyzing some data that two now post-graduate students of his, Angela Leonardo and Alyssa Pollard, have made using crosstabulation. At the time of this writing, they had just coded their first night of TV commercials, looking for the gender of the authoritative “voice-over”—the disembodied voice that tells viewers key stuff about the product. It’s been generally found in gender studies that these voice-overs are overwhelmingly male (e.g., O’Donnell and O’Donnell 1978; Lovdal 1989; Bartsch et al. 2000), even though the percentage of such voice-overs that were male had dropped from just over 90 percent in the 1970s and 1980s to just over 70 percent in 1998. We will be looking at considerably more data, but so far things are so interesting that Roger wants to share them with you…and you’re now sophisticated enough about crosstabs (shorthand for crosstabulations) to appreciate them. Thus, Table 3.4 suggests that things have changed a great deal. In fact the direction of the relationship between the time period of the commercials and the gender of the voice-over is clearly that more recent commercials are much more likely to have a female voice-over than older ones. While only 29 percent of commercials in 1998 had a female voice-over, 71 percent in 2020 did so. And a Yule’s Q of .72 indicates that the relationship is very strong.

Table 3.4 Crosstabulation of Year of Commercial and Gender of the Voice-Over

432 (71%) | 14 (29%) | ||

177 (29%) | 35 (71%) | ||

Notes: |Yule’s Q| = 0.72; 1998 data from Bartsch et al., 2001. |

Yule’s Q, while relatively easy to calculate, has a couple of notable limitations. One is that if one of the four cells in a 2 x 2 table (a table based on an independent variable with 2 categories and a dependent variable with 2 categories) has no cases, the calculated Yule’s Q will be 1.00, even if the relationship isn’t anywhere near that strong. (Why don’t you try it with a sample that has 5 cases on cell A, 5 in cell B, 5 in cell C, and 0 in cell D?)

Another problem with Yule’s Qis that it can only be used to describe 2 x 2 tables. But not all variables have just 2 categories. As a consequence, there are several other measures of strength of association for nominal level variables that can handle bigger tables. (One that we recommend for sheep farmers is lambda. Bahhh!) But, we most typically use one called Cramer’s V, which shares with Yule’s Q (and lambda) the property of varying between 0 and 1. Roger normally advises students that values of Cramer’s V between 0.00 and 0.10 suggests that the relationship is weak; between 0.11 and 0.30, that the relationship is moderately strong; between 0.31 and and 0.59, that the relationship is strong; and between 0.60 and 1.00, that the relationship is very strong. Associations (a fancy word for the strength of the relationship) above 0.59 are not so common in social science research.

An example of the use of Cramer’s V? Roger used statistical software called the Statistical Package for the Social Sciences (SPSS) to analyze the data Angela, Alyssa and he collected about commercials (on one night) to see whether men or women, both or neither, were more likely to appear as the main characters in commercials focused on domestic goods (goods used inside the home) and non-domestic goods (goods used outside the home). Who (men or women or both) would you expect to be such (main) characters in commercials involving domestic products? Non-domestic products? If you guessed that females might be the major characters in commercials for domestic products (e.g., food, laundry detergent, and home remedies) and males might be major characters in commercials for non-domestic products (e.g., cars, trucks, cameras), your guesses would be consistent with findings of previous researchers (e.g., O’Donnell and O’Donnell, 1978; Lovdal, 1989; Bartsch et al., 2001). The data we collected on our first night of data collection suggest some support for these findings (and your expectations), but also some support for another viewpoint. Table 3.5, for instance, shows that women were, in fact, the main characters in about 48 percent of commercials for domestic products, while they were the main characters in only about 13 percent of commercials for non-domestic products. So far, so good. But males, too, were more likely to be main characters in commercials for domestic products (they were these characters about 24 percent of the time) than they were in commercials for non-domestic products (for which they were the main character only about 4 percent of the time). So who were the main product “representatives” for non-domestic commercials? We found that in these commercials at least one man and one woman were together the main characters about 50 percent of the time, while men and women together were the main characters in only about 18 percent of the time in commercials for domestic products.

But the analysis involving gender of main character and whether products were domestic or non-domestic involved more than a 2 x 2 table. In fact, it involved a 2 x 4 table because our dependent variable, gender of main character, had four categories: female, male, both, and neither. Consequently, we couldn’t use Yule’s Q as a measure of strength of association. But we could ask, and did ask (using SPSS), for Cramer’s V, which turned out to be about 0.53, suggesting (if you re-examine Roger’s advice above) that the relationship is a strong one.

Table 3.5 Crosstabulation of Type of Commercial and Gender of Main Character

18 (47.4%) | 3 (12.5%) | ||

9 (23.7%) | 1 (4.2%) | ||

7 (18.4%) | 12 (50%) | ||

4 (10.4%) | 8 (33.3%) | ||

Notes: Cramer’s V = 0.53 |

## Generalizability of the Relationship

When we speak of the generalizability of a relationship, we’re dealing with the question of whether something like the relationship (in direction, if not strength) that is found in the sample can be safely generalized to the larger population from which the sample was drawn. If, for instance, we drew a probability sample of eight adolescent women like the ones we pretended to draw in the first example above, we’d know we have a sample in which a strong relationship existed between watching “16 and Pregnant” and not becoming pregnant. But how could one tell that this sample relationship was likely to be representative of the true relationship in the larger population?

If you recall the distinction we drew between descriptive and inferential statistics in the Chapter on Univariate Analysis , you won’t be surprised to learn that we are now entering the realm of inferential statistics for bivariate relationships. When we use percentage comparisons within one category of the dependent variable to determine the direction of a relationship and measures like Yule’s Q and Cramer’s V to get at its strength, we’re using descriptive statistics—ones that describe the relationship in the sample. But when we talk about Pearson’s chi-square (or Χ ² ), we’re referring to an inferential statistic—one that can help us determine whether we can generalize that something like the relationship in the sample exists in the larger population from which the sample was drawn.

But, before we learn how to calculate and interpret Pearson’s chi-square, let’s get a feel for the logic of this inferential statistic first. Scientists generally, and social scientists in particular, are very nervous about inferring that a relationship exists in the larger population when it really doesn’t exist there. This kind of error—the one you’d make if you inferred that a relationship existed in the larger population when it didn’t really exist there—has a special name: a Type I error. Social scientists are so anxious about making Type 1 errors that they want to keep the chances of making them very low, but not impossibly low. If they made them impossibly low, then they’d risk making the opposite of a Type 1 error: a Type 2 error —the kind of error you’d make when you failed to infer that a relationship existed in the larger population when it really did exist there. The chances, or probability, of something happening can vary from 0.00 (when there’s no chance at all of it happening) to 1.00, when there’s a perfect chance that it will happen. In general, social scientists aim to keep the chances of making a Type 1 error below .05, or below a 1 in 20 chance. They thus aim for a very small, but not impossibly small, chance of making the inference that a relationship exists in the larger population when it doesn’t really exist there.

Karl Pearson, the statistician whose name is associated with Pearson’s chi-square, studied the statistic’s property in about 1900. He found, among other things, that crosstabulations of different sizes (i.e., different numbers of cells) required a different chi-square to be associated with a .05 chance, or probability ( p ), of making a Type 1 error or less. As the number of cells increase, the required chi-square increases as well. For a 2 x 2 table, the critical chi-square is 3.84 (that is, the computed chi-square value should be 3.84 or more for you to infer that a relationship exists in the larger population with only a .05 chance, or less, of being wrong); for a 2 x 3 table, the critical chi-square is 5.99, and so on. Before we were able to use statistical processing software like SPSS, statistical researchers relied on tables that outlined the critical values of chi-quare for different size tables (degrees of freedom, to be discussed below) and different probabilities of making a Type 1 error. A truncated (shortened) version of such a table can be seen in Table 6.

Table 6: Table of Critical Values of the Chi-Square Distribution

1 | 2.706 | 3.841 | 5.024 | 10.828 |

2 | 4.605 | 5.991 | 7.378 | 13.816 |

3 | 6.251 | 7.815 | 9.384 | 16.266 |

4 | 7.779 | 9.488 | 11.143 | 13.277 |

5 | 9.236 | 11.070 | 12.833 | 20.515 |

6 | 10.645 | 12.592 | 14.449 | 22.458 |

7 | 12.017 | 14.067 | 17.013 | 24.458 |

And so on… |

Now you’re ready to see how to calculate chi-square. The formula for chi-square (Χ²) is:

[latex]\chi^2 = \sum\frac{(O-E)^2}{E}\\ \text{where}\\ \text{$\chi$ means "the sum of"}\\ \text{$O = $ the number of observed number of cases in each cell in the sample}\\ \text{$E =$ the expected number in each cell, if there were no relationship between the two variables}[/latex]

Let’s see how this would work with the example of the imaginary data in Table 3.3. This table, if you recall, looked (mostly) like this:

Table 7 (Slightly Revised) Crosstabulation of Our Imaginary Data from a “16 and Pregnant” Study

Row Marginals | ||||

1 | 3 | 4 | ||

3 | 1 | 4 | ||

Column Marginals | 4 | 4 |

How do you figure out what the expected number of cases would be in each cell? You use the following formula:

[latex]E = \frac{M_r \times M_c}{N}\\ \text{Where}\\ \text{$M_r$ is the row marginal for the cell}\\ \text{$M_c$ is the column marginal for the cell}\\ \text{$N$ is the total number of cases in the sample}[/latex]

A row marginal is the total number of cases in a given row of a table. A column marginal is the total number of cases in a given column of a table. For this table, the N is 8, the total number of cases involved in the crosstabulation. For cell A, the row marginal is 4 and the column marginal is 4, which means its expected number of cases would be 4 x 4 = 16/8 = 2. In this particular table, all the cells would have had an expected frequency (or number of cases) of 2. So now all we have to do to compute χ 2 is to make a series of calculation columns:

Cell | Observed Number of Cases in Cell | Expected Number of Cases in Cell | (O-E) | (O-E) | (O-E) /E |

A | 1 | 2 | -1 | 1 | ½ |

B | 3 | 2 | 1 | 1 | ½ |

C | 1 | 2 | -1 | 1 | ½ |

D | 3 | 2 | 1 | 1 | ½ |

And the sum of all the numbers in the (0-E)²/E column is 2.00. This is less than the 3.84 that χ² needs to be for us to conclude that the chances of making a Type 1 error are less than .05 (see Table 3.6), so we cannot safely generalize that something like the relationship in this small sample exists in the larger population. Aren’t you glad that these days programs like SPSS can do these calculations for us? Even though they can, it’s important to go through the process a few times on your own so that you understand what it is that the computer is doing.

Chi-square varies based on three characteristics of the sample relationship. The first of these is the number of cells. Higher chi-squares are more easily achieved in tables with more cells; hence the 3.84 standard for 2 x 2 tables and the 5.99 standard for 2 x 3 tables. You’ll recall from Table 3.6 that we used the term degrees of freedom to refer to the calculation of table size. To figure out the degrees of freedom for a crosstabulation, you simply count the number of columns in the table (only the columns with data in them, not columns with category names) and subtract one. Then you count the number of rows in the table, again only those with data in them, and subtract one. Finally, you multiply the two numbers you have computed. Therefore, the degrees of freedom for a 2×2 table will be 1 [(2-1)*(2-1)], while the degrees of freedom for a 4×6 table will be 15 [(4-1)*(6-1)].

Higher chi-squares will also be achieved when the relationship is stronger. If, instead of the 1, 3, 3, 1 pattern in the four cells above (a relationship that yields a Yule’s Q of 0.80, one had a 0, 4, 4, 0 pattern (a relationship that yields a Yule’s Q of 1.00), the chi-square would be 8.00, [3] considerably greater than the 3.84 standard, and one could then generalize that something like the relationship in the sample also existed in the larger population.

But chi-square also varies with the size of the sample. Thus, if instead of the 1, 3, 3, 1 pattern above, one had a 10, 30, 30, 10 pattern—both of which would yield a Yule’s Q of 0.80 and are therefore of the same strength, and both of which have the same number of cells (4)—the chi-square would compute to be 20, instead of 2, and give pretty clear guidance to infer that a relationship exists in the larger population. The message of this last co-variant of chi-square—that it grows as the sample grows—implies that researchers who want to find generalizable results do well to increase sample size. A sample that tells us that the relationship under investigation is generalizable is said to be significant —sometimes a desirable and often an interesting thing. [4] Incidentally, SPSS computed the chi-square for the crosstabulation in Table 3.5, the one that showed the relationship between type of product advertised (domestic or non-domestic) and the gender of the product representative, to be 17.5. Even for a 2 x 4 table like that one, this is high enough to infer that a relationship exists in the larger population, with less than a .05 chance of being wrong. In fact, SPSS went even further, telling us that the chances of making a Type 1 error were less than .001. (Aren’t computers great?)

## Crosstabulation with Two Ordinal Level Variables

We’ve introduced crosstabulation as a technique designed for the analysis of the relationship between two nominal level variables. But because all variables are at least nominal level, one could theoretically use crosstabulation to analyze the relation between variables of any scale. [5] In the case of two interval level variables, however, there are much more elegant techniques for doing so and we’ll be looking at those in the chapter on correlation and regression . If one were looking into the relationship between a nominal level variable (say, gender, with the categories male and female) [6] and an ordinal level variable (say, happiness with marriage with the three categories: very happy, happy, not so happy), one could simply use all the same techniques for determining the direction, strength, and generalizability we’ve discussed above.

If we chose to analyze the relationship between two ordinal level variables, however, we could still use crosstabulation, but we might want to use a more elegant way of determining direction and strength of relationship than by comparing percentages and seeing what Cramer’s V tells us. One very cool statistic used for determining the direction and strength of a relationship between two ordinal level variables is gamma . Unlike Cramer’s V and Yule’s Q, whose values only vary between 0.00 and 1.00, and therefore can only speak to the strength of a relationship, gamma’s possible values are between -1.00 and 1.00. This one statistic can tell us about both the direction and the strength of the relationship. Thus, a gamma of zero still means there is no relationship between the two variables. But a gamma with a positive sign not only reveals strength (a gamma of 0.30 indicates a stronger relationship than one of 0.10), but it also says that as values of the independent variable increase, so do values of the dependent variable. And a gamma with a negative sign not only reveals strength (a gamma of -0.30 indicates a stronger relationship than one of -0.10), but also says that as values of the independent variable increase, values of the dependent variable decrease. But what exactly do we mean by “values,” here?

Let’s explore a couple of examples from the GSS (via the Social Data Archive, or SDA ). Table 8 shows the relationship between the happiness of GSS respondents’ marriages (HAPMAR) and their general happiness (HAPPY) over the years. Using our earlier way of determining direction, we can see that 90 percent of those that are “very happy” generally are also happy in their marriages, while only 19.5 percent of those who are “not too happy” generally are pretty happy in their marriages. Pretty clear that marital happiness and general happiness are related, right?

Table 8. Crosstabulation of Marital Happiness and General Happiness, GSS data from SDA

Frequency Distribution | |||||
---|---|---|---|---|---|

Cells contain: – -N of cases | HAPPY | ||||

1 very happy | 2 pretty happy | 3 not too happy | |||

HAPMAR | 1: very happy | 11,666 | 7,938 | 894 | |

2: pretty happy | 1,237 | 8,617 | 1,120 | | |

3: not too happy | 51 | 433 | 503 | | |

| | | | ||

Means | 1.10 | 1.56 | 1.84 | 1.40 | |

Std Devs | .32 | .54 | .72 | .55 | |

Unweighted N | 12,954 | 16,988 | 2,517 | 32,459 |

Color coding: | <-2.0 | <-1.0 | <0.0 | >0.0 | >1.0 | >2.0 | |
---|---|---|---|---|---|---|---|

N in each cell: | Smaller than expected | Larger than expected |

Summary Statistics | ||||||||
---|---|---|---|---|---|---|---|---|

Eta* = | .46 | Gamma = | .75 | Rao-Scott-P: F(4,2360) = | 1,807.32 | (p= 0.00) | ||

R = | .46 | Tau-b = | .45 | Rao-Scott-LR: F(4,2360) = | 1,709.73 | (p= 0.00) | ||

Somers’ d* = | .42 | Tau-c = | .35 | Chisq-P(4) = | 8,994.28 | |||

Chisq-LR(4) = | 8,508.63 | |||||||

*Row variable treated as the dependent variable. |

The more elegant way is to look at the statistics at the bottom of the table. Most of these statistics aren’t helpful to us now. But one, gamma, certainly is. You’ll note that gamma is 0.75. There are two important attributes of this statistic: its sign (positive) and its magnitude (0.75). The former tells you that as coded values of marital happiness—1=very happy; 2 happy; 3=not so happy—go up, values of general happiness—1=very happy; 2=happy; 3=not so happy—tend to go up as well. We can interpret this by saying that respondents who are less happy with their marriages are likely to be less happy generally than others. (Notice that this also means that people who are happy in their marriages are also likely to be more generally happy than others.) But the 0.75, independent of the sign, means that this relationship is very strong. By the way, you might also notice that there is a little parenthetical expression at the end of the row gamma is on in the statistics box—(p=0.00). The “p” stands for the chances (probability) of making a Type 1 error, and is sometimes called the “ p value ” or the significance level. The fact that the “p value” here is 0.00 does NOT mean that there is zero chance of making an error if you infer that there is a relationship between marital happiness and general happiness in the larger population. There will always be such a chance. But the SDA printouts of such values give up after two digits to the right of the decimal point. All one can really say is that the chances of making a Type 1 error, then, are less than 0.01 (which itself is less than 0.05)—and so researchers would conclude that they could reasonably generalize.

To emphasize the importance of the sign of gamma (+ or -), let’s have a look at Table 9, which displays the relationship between job satisfaction, whose coded values are 1=very dissatisfied; 2=a little dissatisfied; 3= moderately satisfied; 4=very satisfied, and general happiness, whose codes are the same as they were in Table 3.7. You can probably tell from looking at the internal percentages of the table that as job satisfaction increases so does general happiness—as one might expect. But sign of the gamma of -0.43 might at first persuade you that there is a negative association between job satisfaction and happiness, until you remember that what it’s really telling you is that when the coded values of job satisfaction go up, from 1 (very dissatisfied) to 4 (very satisfied), the coded values of happiness go down, from 3 (not so happy) to 1 (very happy). Which really means that as job satisfaction goes up, happiness goes up as well, right? Note, however, that if we reversed the coding for the job satisfaction variable, so that 1 represented being very satisfied with your job while 4 represented being very dissatisfied, the direction of gamma would reverse. Thus, it is essential that data analysts do not stop by looking at whether gamma is positive or negative, but rather also ensure they understand the way the variable is coded (its attributes ).

Also note here that the 0.43 portion of the gamma tells you how strong this relationship is—it’s strong, but not as strong as the relationship between marital happiness and general happiness (which had a gamma of 0.75). The “p value” here again is .00, which means that it’s less than .01, which of course is less than .05, and we can infer that there’s very probably a relationship between job satisfaction and general happiness in the larger population from which this sample was drawn.

Table 9. Crosstabulation of Job Satisfaction and General Happiness, GSS data from SDA

Frequency Distribution | ||||||
---|---|---|---|---|---|---|

Cells contain: – -Weighted N | satjob2 | |||||

1 Very Dissatisfied | 2 A Little Dissatisfied | 3 Moderately Satisfied | 4 Very Satisfied | |||

happy | 1: very happy | 283.0 | 722.3 | 4,317.4 | 10,134.3 | |

2: pretty happy | 955.7 | 2,877.8 | 11,716.0 | 10,982.8 | | |

3: not too happy | 631.3 | 1,034.4 | 2,032.6 | 1,448.8 | | |

| | | | | ||

Means | 2.19 | 2.07 | 1.87 | 1.62 | 1.78 | |

Std Devs | .67 | .61 | .58 | .60 | .62 | |

Unweighted N | 1,907 | 4,539 | 17,514 | 22,091 | 46,051 |

Summary Statistics | ||||||||
---|---|---|---|---|---|---|---|---|

Eta* = | .28 | Gamma = | -.43 | Rao-Scott-P: F(6,3396) = | 584.48 | (p= 0.00) | ||

R = | -.28 | Tau-b = | -.26 | Rao-Scott-LR: F(6,3396) = | 545.83 | (p= 0.00) | ||

Somers’ d* = | -.25 | Tau-c = | -.23 | Chisq-P(6) = | 4,310.95 | |||

Chisq-LR(6) = | 4,025.87 | |||||||

*Row variable treated as the dependent variable. |

We haven’t shown you the formula for gamma, but it’s not that difficult to compute. In fact, when you have a 2 x 2 table gamma is the same as Yule’s Q, except that it can take on both positive and negative values. Obviously, Yule’s Q could do that as well, if it weren’t for the absolute value symbols surrounding it. As a consequence, you can use gamma as a substitute for Yule’s Q for 2 x 2 tables when using the SDA interface to access GSS data—as long as you remember to take the absolute value of gamma that is calculated for you. Thus, in Table 10, showing the relationship between gender and whether or not a respondent was married, the absolute value of the reported gamma—that is, |-0.11|=0.11—is the Yule’s Q for the relationship. And it is clearly weak. By the way, the p value here, 0.07, indicates that we cannot safely infer that a similar relationship existed in the larger population in 2010.

Table 10. Crosstabulation of Gender and Marital Status in 2010, GSS data from SDA

Frequency Distribution | ||||
---|---|---|---|---|

Cells contain: – -Weighted N | sex | |||

1 male | 2 female | |||

married | 0: not married | 420.9 | 565.9 | |

1: married | 506.1 | 549.5 | | |

| | | ||

Means | .55 | .49 | .52 | |

Std Devs | .50 | .50 | .50 | |

Unweighted N | 891 | 1,152 | 2,043 |

Color coding: | <-2.0 | <-1.0 | <0.0 | >0.0 | >1.0 | >2.0 | |
---|---|---|---|---|---|---|---|

N in each cell: | Smaller than expected | Larger than expected |

Summary Statistics | ||||||||
---|---|---|---|---|---|---|---|---|

Eta* = | .05 | Gamma = | -.11 | Rao-Scott-P: F(1,78) = | 3.29 | (p= 0.07) | ||

R = | -.05 | Tau-b = | -.05 | Rao-Scott-LR: F(1,78) = | 3.29 | (p= 0.07) | ||

Somers’ d* = | -.05 | Tau-c = | -.05 | Chisq-P(1) = | 5.75 | |||

Chisq-LR(1) = | 5.76 | |||||||

*Row variable treated as the dependent variable. |

One problem with an SDA output is that none of the statistics reported (not the Eta, the R, the Tau-b, etc.) are actually designed to measure the strength of relationship between two purely nominal level variables—Cramer’s V and Yule’s Q, for instance, are not provided in the output. All of the measures that are provided, however, do have important uses. To learn more about these and other measures of association and the circumstances in which they should be used, see the chapter focusing on measures of association .

- independent variable
- dependent variable
- crosstabulation
- direction of a relationship
- strength of a relationship
- generalizability of relationship
- Type 1 error
- Type 2 error
- Pearson’s chi-square
- null hypothesis

Case | Gender | Height |

Person 1 | Child | Short |

Person 2 | Adult | Tall |

Person 3 | Child | Short |

Person 4 | Adult | Tall |

Person 5 | Child | Short |

Person 6 | Adult | Tall |

Person 7 | Child | Short |

Person 8 | Adult | Tall |

Return to the Social Data Archive we’ve explored before. The data, again, are available at https://sda.berkeley.edu/ . Go down to the second full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll down to the section labeled “General Social Surveys” and click on the first link there: General Social Survey (GSS) Cumulative Datafile 1972-2021 release.

Now type “hapmar” in the row box and “satjob” in the column box. Hit “output options” and find the “percentaging” options and make sure “column” is clicked. (Satjob will be our independent variable here, so we want column percentages.) Now click on “summary statistics,” under “other options.” Hit on “run the table,” examine the resulting printout and write a short paragraph in which you use gamma and the p-value to evaluate the hypothesis that people who are more satisfied with their jobs are more likely to be happily married than those who are less satisfied with their jobs. Your paragraph should mention the direction, strength and generalizability of the relationship as well as what determinations you can make in terms of the null and research hypotheses.

## Media Attributions

- A Mapping of the Hypothesis that Men Will Tend to be Taller than Women © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
- A Mapping of Kearney and Levine’s Hypothesis © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
- If one of your variables had three categories, it might be a “2 by 3” table. If both variables had 3 categories, you’d want a 3 by 3 table, etc. ↵
- Answers: In Sample 1, the direction of the relationship is the same as it was in Sample A (those who watched the show were less likely than those who didn’t), but its strength is greater (Yule’s Q= 1.00, rather than 0.80). In Sample 2, there is no direction of the relationship (those who watched the show were just as likely to get pregnant as those who didn’t) and its strength is as weak as it could be (Yule’s = 0.00). In Sample 3, the direction of the relationship is the opposite of what it was in Sample A. In this case, those who watched the show were more likely to get pregnant than those who didn’t. And the strength of the relationship was as strong as it could be (Yule’s Q= 1.00). ↵
- Can you double-check Roger’s calculation of chi-square for this arrangement to make sure he’s right? He’d appreciate the help. ↵
- Of course, with very large samples, like the entire General Social Survey (GSS) since it was begun, it is sometimes possible to uncover significant relationships—i.e., ones that almost surely exist in the larger population—that aren’t all that strong. Does that make sense? ↵
- You would generate some pretty gnarly tables that would be very hard to interpret, though. ↵
- While there are clearly more than two genders, we are at the mercy of the way the General Social Survey asked its questions in any given year, and thus for the examples presented in this text only data for males and females is available. While this is unfortunate, it's also an important lesson about the limits of existing survey data and the importance of ensuring proper survey question design. ↵

When certain categories of one variable are associated, or go together, with certain categories of the other variable(s).

A variable that may affect or influence another variable; the cause in a causal relationship.

A variable that is affected or influenced by (or depends on) another variable; the effect in a causal relationship.

A statement of the expected or predicted relationship between two or more variables.

An analytical method in which a bivariate table is created using discrete variables to show their relationship.

How categories of an independent variable are related to categories of a dependent variable.

A measure of how well we can predict the value or category of the dependent variable for any given unit in our sample based on knowing the value or category of the independent variable(s).

A measure of the strength of association use with binary variables

The degree to which a finding based on data from a sample can be assumed to be true for the larger population from which the population was drawn.

A measure of statistical significance used in crosstabulation to determine the generalizability of results.

The error made if one infers that a relationship exists in a larger population when it does not really exist; in other words, a false positive error.

The error you make when you do not infer a relationship exists in the larger population when it actually does exist; in other words, a false negative conclusion.

The total number of cases in a given row of a table.

The total number of cases in a given column of a table.

The number of cells in a table that can vary if we know something about the row and column totals of that table, calculated according to the formula (# of columns-1)*(# of rows-1).

A statistical measure that suggests that sample results can be generalized to the larger population, based on a low probability of having made a Type 1 error.

A measure of the direction and strength of a crosstabulated relationship between two ordinal-level variables.

The measure of statistical significance typically used in quantitative analysis. The lower the p value, the more likely you are to reject the null hypothesis.

The possible levels or response choices of a given variable.

Social Data Analysis Copyright © 2021 by Roger Clark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings
- My Bibliography
- Collections
- Citation manager

## Save citation to file

Email citation, add to collections.

- Create a new collection
- Add to an existing collection

## Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

- Search in PubMed
- Search in NLM Catalog
- Add to Search

## How to describe bivariate data

Affiliations.

- 1 Department for the Treatment and Study of Cardiothoracic Diseases and Cardiothoracic Transplantation, Division of Thoracic Surgery and Lung Transplantation, IRCCS ISMETT - UPMC, Palermo, Italy.
- 2 Office of Research, IRCCS ISMETT, Palermo, Italy.
- PMID: 29607192
- PMCID: PMC5864614
- DOI: 10.21037/jtd.2018.01.134

The role of scientific research is not limited to the description and analysis of single phenomena occurring independently one from each other (univariate analysis). Even though univariate analysis has a pivotal role in statistical analysis, and is useful to find errors inside datasets, to familiarize with and to aggregate data, to describe and to gather basic information on simple phenomena, it has a limited cognitive impact. Therefore, research also and mostly focuses on the relationship that single phenomena may have with each other. More specifically, bivariate analysis explores how the dependent ("outcome") variable depends or is explained by the independent ("explanatory") variable (asymmetrical analysis), or it explores the association between two variables without any cause and effect relationship (symmetrical analysis). In this paper we will introduce the concept of "causation", dependent ("outcome") and independent ("explanatory") variable. Also, some statistical techniques used for the analysis of the relationship between the two variables will be presented, based on the type of variable (categorical or continuous).

Keywords: Bivariate data; causality; covariation.

PubMed Disclaimer

## Conflict of interest statement

Conflicts of Interest: The authors have no conflicts of interest to declare.

Example of a bar chart.

Example of a scatterplot box.

Examples of linear correlation.

## Similar articles

- Bivariate Causal Discovery and Its Applications to Gene Expression and Imaging Data Analysis. Jiao R, Lin N, Hu Z, Bennett DA, Jin L, Xiong M. Jiao R, et al. Front Genet. 2018 Aug 31;9:347. doi: 10.3389/fgene.2018.00347. eCollection 2018. Front Genet. 2018. PMID: 30233639 Free PMC article.
- Unadjusted Bivariate Two-Group Comparisons: When Simpler is Better. Vetter TR, Mascha EJ. Vetter TR, et al. Anesth Analg. 2018 Jan;126(1):338-342. doi: 10.1213/ANE.0000000000002636. Anesth Analg. 2018. PMID: 29189214
- Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification. Wolahan SM, Hirt D, Glenn TC. Wolahan SM, et al. In: Kobeissy FH, editor. Brain Neurotrauma: Molecular, Neuropsychological, and Rehabilitation Aspects. Boca Raton (FL): CRC Press/Taylor & Francis; 2015. Chapter 25. In: Kobeissy FH, editor. Brain Neurotrauma: Molecular, Neuropsychological, and Rehabilitation Aspects. Boca Raton (FL): CRC Press/Taylor & Francis; 2015. Chapter 25. PMID: 26269925 Free Books & Documents. Review.
- An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy [Internet]. Dahabreh IJ, Trikalinos TA, Lau J, Schmid C. Dahabreh IJ, et al. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Nov. Report No.: 12(13)-EHC136-EF. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Nov. Report No.: 12(13)-EHC136-EF. PMID: 23326899 Free Books & Documents. Review.
- [Regression I - How to describe the linear dependence of a medical parameter from an explanatory variable]. Wegscheider K. Wegscheider K. Herzschrittmacherther Elektrophysiol. 1997 Dec;8(4):255-61. doi: 10.1007/BF03042616. Herzschrittmacherther Elektrophysiol. 1997. PMID: 19484328 German.
- Latent tuberculosis infection (LTBI) in health-care workers: a cross-sectional study at a northern Peruvian hospital. Meregildo-Rodriguez ED, Yuptón-Chávez V, Asmat-Rubio MG, Vásquez-Tirado GA. Meregildo-Rodriguez ED, et al. Front Med (Lausanne). 2023 Nov 30;10:1295299. doi: 10.3389/fmed.2023.1295299. eCollection 2023. Front Med (Lausanne). 2023. PMID: 38098842 Free PMC article.
- Rural-Urban Determinants of Receiving Skilled Birth Attendants among Women in Bangladesh: Evidence from National Survey 2017-18. Afroja S, Muhammad Nasim AS, Khan MS, Kabir MA. Afroja S, et al. Int J Clin Pract. 2022 Dec 8;2022:5426875. doi: 10.1155/2022/5426875. eCollection 2022. Int J Clin Pract. 2022. PMID: 36567778 Free PMC article.
- Public healthcare practitioners' knowledge, attitudes and practices related to oral antibiotic prescriptions for dental use in Pietermaritzburg, KwaZulu-Natal. Ramnarain P, Singh S. Ramnarain P, et al. Health SA. 2022 Apr 29;27:1832. doi: 10.4102/hsag.v27i0.1832. eCollection 2022. Health SA. 2022. PMID: 35548063 Free PMC article.
- Review of guidance papers on regression modeling in statistical series of medical journals. Wallisch C, Bach P, Hafermann L, Klein N, Sauerbrei W, Steyerberg EW, Heinze G, Rauch G; topic group 2 of the STRATOS initiative. Wallisch C, et al. PLoS One. 2022 Jan 24;17(1):e0262918. doi: 10.1371/journal.pone.0262918. eCollection 2022. PLoS One. 2022. PMID: 35073384 Free PMC article. Review.
- Agresti A. Categorical Data Analysis, 1th ed. Wiley, 2002.
- Agresti A, Finlay B. Statistical Methods for the Social Science, 4th ed. Pearson, 2009.
- Corbetta P, Gasperoni G, Pisati M. Statistica per la Ricerca Sociale, Il Mulino, 2001.
- Vittinghoff E, Glidden DV, Shboski SC, et al. Regression Methods in Biostatistics, 2th, Springer, 2012.
- Aﬁﬁ A, May S, Clark VA. Practical Multivariate Analysis, 5th. Chapman Hall/CRC, 2011.

## Related information

Linkout - more resources, full text sources.

- AME Publishing Company
- Europe PubMed Central
- PubMed Central

## Other Literature Sources

- scite Smart Citations

- Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

## 24 15. Bivariate analysis

Chapter outline.

- What is bivariate data analysis? (5 minute read time)
- Chi-square (4 minute read time)
- Correlations (5 minute read time)
- T-tests (5-minute read time)
- ANOVA (6-minute read time)

Content warning: examples include discussions of anxiety symptoms.

So now we get to the math! Just kidding. Mostly. In this chapter, you are going to learn more about bivariate analysis , or analyzing the relationship between two variables. I don’t expect you to finish this chapter and be able to execute everything you just read about – instead, the big goal here is for you to be able to understand what bivariate analysis is, what kinds of analyses are available, and how you can use them in your research.

Take a deep breath, and let’s look at some numbers!

## 15.1 What is bivariate analysis?

Learning objectives.

Learners will be able to…

- Define bivariate analysis
- Explain when we might use bivariate analysis in social work research

Did you know that ice cream causes shark attacks? It’s true! When ice cream sales go up in the summer, so does the rate of shark attacks. So you’d better put down that ice cream cone, unless you want to make yourself look more delicious to a shark.

Ok, so it’s quite obviously not true that ice cream causes shark attacks. But if you looked at these two variables and how they’re related, you’d notice that during times of the year with high ice cream sales, there are also the most shark attacks. Despite the fact that the conclusion we drew about the relationship was wrong, it’s nonetheless true that these two variables appear related, and researchers figured that out through the use of bivariate analysis. (For a refresher on correlation versus causation, head back to Chapter 8 .)

Bivariate analysis consists of a group of statistical techniques that examine the relationship between two variables. We could look at how anti-depressant medications and appetite are related, whether there is a relationship between having a pet and emotional well-being, or if a policy-maker’s level of education is related to how they vote on bills related to environmental issues.

Bivariate analysis forms the foundation of multivariate analysis, which we don’t get to in this book. All you really need to know here is that there are steps beyond bivariate analysis, which you’ve undoubtedly seen in scholarly literature already! But before we can move forward with multivariate analysis, we need to understand whether there are any relationships between our variables that are worth testing.

A study from Kwate, Loh, White, and Saldana (2012) illustrates this point. These researchers were interested in whether the lack of retail stores in predominantly Black neighborhoods in New York City could be attributed to the racial differences of those neighborhoods. Their hypothesis was that race had a significant effect on the presence of retail stores in a neighborhood, and that Black neighborhoods experience “retail redlining” – when a retailer decides not to put a store somewhere because the area is predominantly Black.

The researchers needed to know if the predominant race of a neighborhood’s residents was even related to the number of retail stores. With bivariate analysis, they found that “predominantly Black areas faced greater distances to retail outlets; percent Black was positively associated with distance to nearest store for 65 % (13 out of 20) stores” (p. 640). With this information in hand, the researchers moved on to multivariate analysis to complete their research.

## Statistical significance

Before we dive into analyses, let’s talk about statistical significance. Statistical significance is the extent to which our statistical analysis has produced a result that is likely to represent a real relationship instead of some random occurrence. But just because a relationship isn’t random doesn’t mean it’s useful for drawing a sound conclusion.

We went into detail about statistical significance in Chapter 5 . You’ll hopefully remember that there, we laid out some key principles from the American Statistical Association for understanding and using p-values in social science:

- P-values can indicate how incompatible the data are with a specified statistical model. P-values can provide evidence against the null hypothesis or the underlying assumptions of the statistical model the researchers used.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Both are inaccurate, though common, misconceptions about statistical significance.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. More nuance is needed to interpret scientific findings, as a conclusion does not become true or false when it passes from p=0.051 to p=0.049.
- Proper inference requires full reporting and transparency, rather than cherry-picking promising findings or conducting multiple analyses and only reporting those with significant findings. For the authors of this textbook, we believe the best response to this issue is for researchers make their data openly available to reviewers and general public and register their hypotheses in a public database prior to conducting analyses.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. In our culture, to call something significant is to say it is larger or more important, but any effect, no matter how tiny, can produce a small p-value if the study is rigorous enough. Statistical significance is not equivalent to scientific, human, or economic significance.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. (adapted from Wasserstein & Lazar, 2016, p. 131-132). [1]

A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. The word significant can cause people to interpret these differences as strong and important, to the extent that they might even affect someone’s behavior. As we have seen however, these statistically significant differences are actually quite weak—perhaps even “trivial.” The correlation between ice cream sales and shark attacks is statistically significant, but practically speaking, it’s meaningless.

There is debate about acceptable p- values in some disciplines. In medical sciences, a p- value even smaller than 0.05 is often favored, given the stakes of biomedical research. Some researchers in social sciences and economics argue that a higher p -value of up to 0.10 still constitutes strong evidence. Other researchers think that p -values are entirely overemphasized and that there are better measures of statistical significance. At this point in your research career, it’s probably best to stick with 0.05 because you’re learning a lot at once, but it’s important to know that there is some debate about p- values and that you shouldn’t automatically discount relationships with a p -value of 0.06.

## A note about “assumptions”

For certain types of bivariate, and in general for multivariate, analysis, we assume a few things about our data and the way it’s distributed. The characteristics we assume about our data that makes it suitable for certain types of statistical tests are called assumptions . For instance, we assume that our data has a normal distribution. While I’m not going to go into detail about these assumptions because it’s beyond the scope of the book, I want to point out that it is important to check these assumptions before your analysis.

Something else that’s important to note is that going through this chapter, the data analyses presented are merely for illustrative purposes – the necessary assumptions have not been checked. So don’t draw any conclusions based on the results shared.

For this chapter, I’m going to use a data set from IPUMS USA , where you can get individual-level, de-identified U.S. Census and American Community Survey data. The data are clean and the data sets are large, so it can be a good place to get data you can use for practice.

## Key Takeaways

- Bivariate analysis is a group of statistical techniques that examine the relationship between two variables.
- You need to conduct bivariate analyses before you can begin to draw conclusions from your data, including in future multivariate analyses.
- Statistical significance and p-values help us understand the extent to which the relationships we see in our analyses are real relationships, and not just random or spurious .
- Find a study from your literature review that uses quantitative analyses. What kind of bivariate analyses did the authors use? You don’t have to understand everything about these analyses yet!
- What do the p -values of their analyses tell you?

## 15.2 Chi-square

- Explain the uses of Chi-square test for independence
- Explain what kind of variables are appropriate for a Chi-square test
- Interpret results of a Chi-square test and draw a conclusion about a hypothesis from the results

The first test we’re going to introduce you to is known as a Chi-square test (sometimes denoted as χ 2 ) and is foundational to analyzing relationships between nominal or ordinal variables. A Chi-square test for independence (Chi-square for short) is a statistical test to determine whether there is a significant relationship between two nominal or ordinal variables. The “test for independence” refers to the null hypothesis of our comparison – that the two variables are independent and have no relationship.

A Chi-square can only be used for the relationship between two nominal or ordinal variables – there are other tests for relationships between other types of variables that we’ll talk about later in this chapter. For instance, you could use a Chi-square to determine whether there is a significant relationship between a person’s self-reported race and whether they have health insurance through their employer. (We will actually take a look at this a little later.)

Chi-square tests the hypothesis that there is a relationship between two categorical variables by comparing the values we actually observed and the value we would expect to occur based on our null hypothesis. T he expected value is a calculation based on your data when it’s in a summarized form called a contingency table , which is a visual representation of a cross-tabulation of categorical variables to demonstrate all the possible occurrences of your categories. I know that sounds complex, so let’s look at an example.

Earlier, we talked about looking at the relationship between a person’s race and whether they have health insurance through an employer. Based on 2017 American Community Survey data from IPUMS, this is what a contingency table for these two variables would look like.

1,037,071 | 1,401,453 | 2,438,524 | |

177,648 | 177,648 | 317,308 | |

24,123 | 12,142 | 36,265 | |

71,155 | 105,596 | 176,751 | |

75,117 | 46,699 | 121,816 | |

46,107 | 53,269 | 87,384 | |

So now we know what our observed values for these categories are. Next, let’s think about our expected values. We don’t need to get so far into it as to put actual numbers to it, but we can come up with a hypothesis based on some common knowledge about racial differences in employment. (We’re going to be making some generalizations here, so remember that there can be exceptions.)

## An applied example

Let’s say research shows that people who identify as black, indigenous, and people of color ( BIPOC ) tend to hold multiple part-time jobs and have a higher unemployment rate in general. Given that, our hypothesis based on this data could be that BIPOC people are less likely to have employer-provided health insurance. Before we can assess a likelihood, we need to know if these to variables are even significantly related. Here’s where our Chi-square test comes in!

I’ve used SPSS to run these tests, so depending on what statistical program you use, your outputs might look a little different.

There are a number of different statistics reported here. What I want you to focus on is the first line, the Pearson Chi-Square, which is the most commonly used statistic for larger samples that have more than two categories each. (The other two lines are alternatives to Pearson that SPSS puts out automatically, but they are appropriate for data that is different from ours, so you can ignore them. You can also ignore the “df” column for now, as it’s a little advanced for what’s in this chapter.)

The last column gives us our statistical significance level, which in this case is 0.00. So what conclusion can we draw here? The significant Chi-square statistic means we can reject the null hypothesis (which is that our two variables are not related). There is likely a strong relationship between our two variables that is probably not random, meaning that we should further explore the relationship between a person’s race and whether they have employer-provided health insurance. Are there other factors that affect the relationship between these two variables? That seems likely. (One thing to keep in mind is that this is a large data set, which can inflate statistical significance levels. However, for the purposes of our exercises, we’ll ignore that for now.)

What we cannot conclude is that these two variables are causally related. That is, someone’s race doesn’t cause them to have employer-provided health insurance or not. It just appears to be a contributing factor, but we are not accounting for the effect of other variables on the relationship we observe (yet).

- The Chi-square test is designed to test the null hypothesis that our two variables are not related to each other.
- The Chi-square test is only appropriate for nominal and/or ordinal variables.
- A statistically significant Chi-square statistic means we can reject the null hypothesis and assume our two variables are, in fact, related.
- A Chi-square test doesn’t let us draw any conclusions about causality because it does not account for the influence of other variables on the relationship we observe.
- Which two variables would you most like to use in the analysis?
- What about the relationship between these two variables interests you in light of what your literature review has shown so far?

## 15.3 Correlations

- Define correlation and understand how to use it in quantitative analysis
- Explain what kind of variables are appropriate for a correlation
- Interpret a correlation coefficient
- Define the different types of correlation – positive and negative
- Interpret results of a correlation and draw a conclusion about a hypothesis from the results

A correlation is a relationship between two variables in which their values change together. For instance, we might expect education and income to be correlated – as a person’s educational attainment (how much schooling they have completed) goes up, so does their income. What about minutes of exercise each week and blood pressure? We would probably expect those who exercise more have lower blood pressures than those who don’t. We can test these relationships using correlation analyses. Correlations are appropriate only for two interval/ratio variables.

It’s very important to understand that correlations can tell you about relationships, but not causes – as you’ve probably already heard, correlation is not causation! Go back to our example about shark attacks and ice cream sales from the beginning of the chapter. Clearly, ice cream sales don’t cause shark attacks, but the two are strongly correlated (most likely because both increase in the summer for other reasons). This relationship is an example of a spurious relationship , or a relationship that appears to exist between to variables, but in fact does not and is caused by other factors. We hear about these all the time in the news and correlation analyses are often misrepresented. As we talked about in Chapter 4 when discussing critical information literacy, your job as a researcher and informed social worker is to make sure people aren’t misstating what these analyses actually mean, especially when they are being used to harm vulnerable populations.

Let’s say we’re looking at the relationship between age and income among indigenous people in the United States. In the data set we’ve been using so far, these folks generally fall into the racial category of American Indian/Alaska native, so we’ll use that category because it’s the best we can do. Using SPSS, this is the output you’d get with these two variables for this group. We’ll also limit the analysis to people age 18 and over since children are unlikely to report an individual income.

Here’s Pearson again, but don’t be confused – this is not the same test as the Chi-square, it just happens to be named after the same person. First, let’s talk about the number next to Pearson Correlation, which is the correlation coefficient. The c orrelation coefficient is a statistically derived value between -1 and 1 that tells us the magnitude and direction of the relationship between two variables. A statistically significant correlation coefficient like the one in this table (denoted by a p -value of 0.01) means the relationship is not random.

The magnitude of the relationship is how strong the relationship is and can be determined by the absolute value of the coefficient. In the case of our analysis in the table above, the correlation coefficient is 0.108, which denotes a pretty weak relationship. This means that, among the population in our sample, age and income don’t have much of an effect on each other. (If the correlation coefficient were -0.108, the conclusion about its strength would be the same.)

In general, you can say that a correlation coefficient with an absolute value below 0.5 represents a weak correlation. Between 0.5 and 0.75 represents a moderate correlation, and above 0.75 represents a strong correlation. Although the relationship between age and income in our population is statistically significant, it’s also very weak.

The sign on your correlation coefficient tells you the direction of your relationship. A p ositive correlation or direct relationship occurs w hen two variables move together in the same direction – as one increases, so does the other, or, as one decreases, so does the other. Correlation coefficients will be positive, so that means the correlation we calculated is a positive correlation and the two variables have a direct, though very weak, relationship. For instance, in our example about shark attacks and ice cream, the number of both shark attacks and pints of ice cream sold would go up, meaning there is a direct relationship between the two.

A negative correlation or inverse relationship occurs w hen two variables change in opposite directions – one goes up, the other goes down and vice versa. The correlation coefficient will be negative. For example, if you were studying social media use and found that time spent on social media corresponded to lower scores on self-esteem scales, this would represent an inverse relationship.

Correlations are important to run at the outset of your analyses so you can start thinking about how variables relate to each other and whether you might want to include them in future multivariate analyses. For instance, if you’re trying to understand the relationship between receipt of an intervention and a particular outcome, you might want to test whether client characteristics like race or gender are correlated with your outcome; if they are, they should be plugged into subsequent multivariate models. If not, you might want to consider whether to include them in multivariate models.

## A final note

Just because the correlation between your dependent variable and your primary independent variable is weak or not statistically significant doesn’t mean you should stop your work. For one thing, disproving your hypothesis is important for knowledge-building. For another, the relationship can change when you consider other variables in multivariate analysis, as they could mediate or moderate the relationships.

- Correlations are a basic measure of the strength of the relationship between two interval/ratio variables.
- A correlation between two variables does not mean one variable causes the other one to change. Drawing conclusions about causality from a simple correlation is likely to lead to you to describing a spurious relationship, or one that exists at face value, but doesn’t hold up when more factors are considered.
- Correlations are a useful starting point for almost all data analysis projects.
- The magnitude of a correlation describes its strength and is indicated by the correlation coefficient, which can range from -1 to 1.
- A positive correlation, or direct relationship, occurs when the values of two variables move together in the same direction.
- A negative correlation, or inverse relationship, occurs when the value of one variable moves one direction, while the value of the other variable moves the opposite direction.

## 15.4 T-tests

- Describe the three different types of t-tests and when to use them.
- Explain what kind of variables are appropriate for t-tests.

At a very basic level, t-tests compare the means between two groups, the same group at two points in time, or a group and a hypothetical mean. By doing so using this set of statistical analyses, you can learn whether these differences are reflective of a real relationship or not (whether they are statistically significant).

Say you’ve got a data set that includes information about marital status and personal income (which we do!). You want to know if married people have higher personal (not family) incomes than non-married people, and whether the difference is statistically significant. Essentially, you want to see if the difference in average income between these two groups is down to chance or if it warrants further exploration. What analysis would you run to find this information? A t-test!

A lot of social work research focuses on the effect of interventions and programs, so t-tests can be particularly useful. Say you were studying the effect of a smoking cessation hotline on the number of days participants went without smoking a cigarette. You might want to compare the effect for men and women, in which case you’d use an independent samples t-test. If you wanted to compare the effect of your smoking cessation hotline to others in the country and knew the results of those, you would use a one-sample t-test. And if you wanted to compare the average number of cigarettes per day for your participants before they started a tobacco education group and then again when they finished, you’d use a paired-samples t-test. Don’t worry – we’re going into each of these in detail below.

So why are they called t-tests? Basically, when you conduct a t-test, you’re comparing your data to a theoretical distribution of data known as the t distribution to get the t statistic. The t distribution is normal, so when your data are not normally distributed, a t distribution can approximate a normal distribution well enough for you to test some hypotheses. (Remember our discussion of assumptions in section 15.1 – one of them is that data be normally distributed.) Ultimately, the t statistic that the test produces allows you to determine if any differences are statistically significant.

For t-tests, you need to have an interval/ratio dependent variable and a nominal or ordinal independent variable. Basically, you need an average (using an interval or ratio variable) to compare across mutually exclusive groups (using a nominal or ordinal variable).

Let’s jump into the three different types of t- tests.

## Paired samples t- test

The paired samples t -test is used to compare two means for the same sample tested at two different times or under two different conditions. This comparison is appropriate for pretest-post-test designs or within-subjects experiments. The null hypothesis is that the means at the two times or under the two conditions are the same in the population. The alternative hypothesis is that they are not the same.

For example, say you are testing the effect of pet ownership on anxiety symptoms. You have access to a group of people who have the same diagnosis involving anxiety who do not have pets, and you give them a standardized anxiety inventory questionnaire. Then, each of these participants gets some kind of pet and after 6 months, you give them the same standardized anxiety questionnaire.

To compare their scores on the questionnaire at the beginning of the study and after 6 months of pet ownership, you would use paired samples t-test. Since the sample includes the same people, the samples are “paired” (hence the name of the test). If the t-statistic is statistically significant, there is evidence that owning a pet has an effect on scores on your anxiety questionnaire.

## Independent samples/two samples t-test

An independent/two samples t-test is used to compare the means of two separate samples. The two samples might have been tested under different conditions in a between-subjects experiment, or they could be pre-existing groups in a cross-sectional design (e.g., women and men, extroverts and introverts). The null hypothesis is that the means of the two populations are the same. The alternative hypothesis is that they are not the same.

Let’s go back to our example related to anxiety diagnoses and pet ownership. Say you want to know if people who own pets have different scores on certain elements of your standard anxiety questionnaire than people who don’t own pets.

You have access to two groups of participants: pet owners and non-pet owners. These groups both fit your other study criteria. You give both groups the same questionnaire at one point in time. You are interested in two questions, one about self-worth and one about feelings of loneliness. You can calculate mean scores for the questions you’re interested in and then compare them across two groups. If the t-statistic is statistically significant, then there is evidence of a difference in these scores that may be due to pet ownership.

## One-sample t-test

Finally, let’s talk about a one sample t-test. This t-test is appropriate when there is an external benchmark to use for your comparison mean, either known or hypothesized. The null hypothesis for this kind of test is that the mean in your sample is different from the mean of the population. The alternative hypothesis is that the means are different.

Let’s say you know the average years of post-high school education for Black women, and you’re interested in learning whether the Black women in your study are on par with the average. You could use a one-sample t-test to determine how your sample’s average years of post-high school education compares to the known value in the population. This kind of t-test is useful when a phenomenon or intervention has already been studied, or to see how your sample compares to your larger population.

- There are three types of t- tests that are each appropriate for different situations. T-tests can only be used with an interval/ratio dependent variable and a nominal/ordinal independent variable.
- T-tests in general compare the means of one variable between either two points in time or conditions for one group, two different groups, or one group to an external benchmark variable..
- In a paired-samples t-test , you are comparing the means of one variable in your data for the same group , either at two different times or under two different conditions, and testing whether the difference is statistically significant.
- In an independent samples t-test , you are comparing the means of one variable in your data for two different groups to determine if any difference is statistically significant.
- In a one-sample t-test , you are comparing the mean of one variable in your data to an external benchmark, either observed or hypothetical.
- Which t-test makes the most sense for your data and research design? Why?
- Which variable would be an appropriate dependent variable? Why?
- Which variable would be an interesting independent variable? Why?

## 15.5 ANOVA ( AN alysis O f VA riance)

- Explain what kind of variables are appropriate for ANOVA
- Explain the difference between one-way and two-way ANOVA
- Come up with an example of when each type of ANOVA is appropriate

Analysis of variance , generally abbreviated to ANOVA for short, is a statistical method to examine how a dependent variable changes as the value of a categorical independent variable changes. It serves the same purpose as the t-tests we learned in 15.4: it tests for differences in group means. ANOVA is more flexible in that it can handle any number of groups, unlike t-tests, which are limited to two groups (independent samples) or two time points (dependent samples). Thus, the purpose and interpretation of ANOVA will be the same as it was for t-tests.

There are two types of ANOVA: a one-way ANOVA and a two-way ANOVA. One-way ANOVAs are far more common than two-way ANOVAs.

## One-way ANOVA

The most common type of ANOVA that researchers use is the one-way ANOVA , which is a statistical procedure to compare the means of a variable across three or more groups of an independent variable. Let’s take a look at some data about income of different racial and ethnic groups in the United States. The data in Table 15.2 below comes from the US Census Bureau’s 2018 American Community Survey [2] . The racial and ethnic designations in the table reflect what’s reported by the Census Bureau, which is not fully representative of how people identify racially.

American Indian and Alaska Native | $20,709 |

Asian | $40,878 |

Black/African American | $23,303 |

Native Hawaiian or Other Pacific Islander | $25,304 |

White | $36,962 |

Two or more races | $19,162 |

Another race | $20,482 |

Off the bat, of course, we can see a difference in the average income between these groups. Now, we want to know if the difference between average income of these racial and ethnic groups is statistically significant, which is the perfect situation to use one-way ANOVA. To conduct this analysis, we need the person-level data that underlies this table, which I was able to download from IPUMS. For this analysis, race is the independent variable (nominal) and total income is the dependent variable (interval/ratio). Let’s assume for this exercise that we have no other data about the people in our data set besides their race and income. (If we did, we’d probably do another type of analysis.)

I used SPSS to run a one-way ANOVA using this data. With the basic analysis, the first table in the output was the following.

Without going deep into the statistics, the column labeled “F” represents our F statistic, which is similar to the T statistic in a t-test in that it gives a statistical point of comparison for our analysis. The important thing to noticed here, however, is our significance level, which is .000. Sounds great! But we actually get very little information here – all we know is that the between-group differences are statistically significant as a whole, but not anything about the individual groups.

This is where post hoc tests come into the picture. Because we are comparing each race to each other race, that adds up to a lot of comparisons, and statistically, this increases the likelihood of a type I error. A post hoc test in ANOVA is a way to correct and reduce this error after the fact (hence “post hoc”). I’m only going to talk about one type – the Bonferroni correction – because it’s commonly used. However, there are other types of post hoc tests you may encounter.

When I tell SPSS to run the ANOVA with a Bonferroni correction, in addition to the table above, I get a very large table that runs through every single comparison I asked it to make among the groups in my independent variable – in this case, the different races. Figure 15.4 below is the first grouping in that table – they will all give the same conceptual information, though some of the signs on the mean difference and, consequently the confidence intervals, will vary.

Now we see some points of interest. As you’d expect knowing what we know from prior research, race seems to have a pretty strong influence on a person’s income. (Notice I didn’t say “effect” – we don’t have enough information to establish causality!) The significance levels for the mean of White people’s incomes compared to the mean of several races are .000. Interestingly, for Asian people in the US, race appears to have no influence on their income compared to White people in the US. The significance level for Native Hawaiians and Pacific Islanders is also relatively high.

So what does this mean? We can say with some confidence that, overall, race seems to influence a person’s income. In our hypothetical data set, since we only have race and income, this is a great analysis to conduct. But do we think that’s the only thing that influences a person’s income? Probably not. To look at other factors if we have them, we can use a two-way ANOVA.

## Two-way ANOVA and n- way ANOVA

A two-way ANOVA is a statistical procedure to compare the means of a variable across groups using multiple independent variables to distinguish among groups. For instance, we might want to examine income by both race and gender, in which case, we would use a two-way ANOVA. Fundamentally, the procedures and outputs for two-way ANOVA are almost identical to one-way ANOVA, just with more cross-group comparisons, so I am not going to run through an example in SPSS for you.

You may also see textbooks or scholarly articles refer to n- way ANOVAs. Essentially, just like you’ve seen throughout this book, the n can equal just about any number. However, going far beyond a two-way ANOVA increases your likelihood of a type I error, for the reasons discussed in the previous section.

You may notice that this book doesn’t get into multivariate analysis at all. Regression analysis, which you’ve no doubt seen in many academic articles you’ve read, is an incredibly complex topic. There are entire courses and textbooks on the multiple different types of regression analysis, and we did not think we could adequately cover regression analysis at this level. Don’t let that scare you away from learning about it – just understand that we don’t expect you to know about it at this point in your research learning.

- One-way ANOVA is a statistical procedure to compare the means of a variable across three or more categories of an independent variable. This analysis can help you understand whether there are meaningful differences in your sample based on different categories like race, geography, gender, or many others.
- Two-way ANOVA is almost identical to one-way ANOVA, except that you can compare the means of a variable across multiple independent variables.
- Would you want to conduct a two-way or n -way ANOVA? If so, what other independent variables would you use, and why?
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70 , p. 129-133. ↵
- Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0 ↵

a group of statistical techniques that examines the relationship between two variables

"Assuming that the null hypothesis is true and the study is repeated an infinite number times by drawing random samples from the same populations(s), less than 5% of these results will be more extreme than the current result" (Cassidy et al., 2019, p. 233).

The characteristics we assume about our data, like that it is normally distributed, that makes it suitable for certain types of statistical tests

A relationship where it appears that two variables are related BUT they aren't. Another variable is actually influencing the relationship.

a statistical test to determine whether there is a significant relationship between two categorical variables

variables whose values are organized into mutually exclusive groups but whose numerical values cannot be used in mathematical operations.

a visual representation of across-tabulation of categorical variables to demonstrate all the possible occurrences of categories

a relationship between two variables in which their values change together.

when a relationship between two variables appears to be causal but can in fact be explained by influence of a third variable

Graduate research methods in social work Copyright © 2020 by Matthew DeCarlo, Cory Cummings, Kate Agnelli is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

## Share This Book

## How to Create a Data Analysis Plan: A Detailed Guide

by Barche Blaise | Aug 12, 2020 | Writing

If a good research question equates to a story then, a roadmap will be very vita l for good storytelling. We advise every student/researcher to personally write his/her data analysis plan before seeking any advice. In this blog article, we will explore how to create a data analysis plan: the content and structure.

This data analysis plan serves as a roadmap to how data collected will be organised and analysed. It includes the following aspects:

- Clearly states the research objectives and hypothesis
- Identifies the dataset to be used
- Inclusion and exclusion criteria
- Clearly states the research variables
- States statistical test hypotheses and the software for statistical analysis
- Creating shell tables

## 1. Stating research question(s), objectives and hypotheses:

All research objectives or goals must be clearly stated. They must be Specific, Measurable, Attainable, Realistic and Time-bound (SMART). Hypotheses are theories obtained from personal experience or previous literature and they lay a foundation for the statistical methods that will be applied to extrapolate results to the entire population.

## 2. The dataset:

The dataset that will be used for statistical analysis must be described and important aspects of the dataset outlined. These include; owner of the dataset, how to get access to the dataset, how the dataset was checked for quality control and in what program is the dataset stored (Excel, Epi Info, SQL, Microsoft access etc.).

## 3. The inclusion and exclusion criteria :

They guide the aspects of the dataset that will be used for data analysis. These criteria will also guide the choice of variables included in the main analysis.

## 4. Variables:

Every variable collected in the study should be clearly stated. They should be presented based on the level of measurement (ordinal/nominal or ratio/interval levels), or the role the variable plays in the study (independent/predictors or dependent/outcome variables). The variable types should also be outlined. The variable type in conjunction with the research hypothesis forms the basis for selecting the appropriate statistical tests for inferential statistics. A good data analysis plan should summarize the variables as demonstrated in Figure 1 below.

## 5. Statistical software

There are tons of software packages for data analysis, some common examples are SPSS, Epi Info, SAS, STATA, Microsoft Excel. Include the version number, year of release and author/manufacturer. Beginners have the tendency to try different software and finally not master any. It is rather good to select one and master it because almost all statistical software have the same performance for basic and the majority of advance analysis needed for a student thesis. This is what we recommend to all our students at CRENC before they begin writing their results section .

## 6. Selecting the appropriate statistical method to test hypotheses

Depending on the research question, hypothesis and type of variable, several statistical methods can be used to answer the research question appropriately. This aspect of the data analysis plan outlines clearly why each statistical method will be used to test hypotheses. The level of statistical significance (p-value) which is often but not always <0.05 should also be written. Presented in figures 2a and 2b are decision trees for some common statistical tests based on the variable type and research question

A good analysis plan should clearly describe how missing data will be analysed.

## 7. Creating shell tables

Data analysis involves three levels of analysis; univariable, bivariable and multivariable analysis with increasing order of complexity. Shell tables should be created in anticipation for the results that will be obtained from these different levels of analysis. Read our blog article on how to present tables and figures for more details. Suppose you carry out a study to investigate the prevalence and associated factors of a certain disease “X” in a population, then the shell tables can be represented as in Tables 1, Table 2 and Table 3 below.

Table 1: Example of a shell table from univariate analysis

Table 2: Example of a shell table from bivariate analysis

Table 3: Example of a shell table from multivariate analysis

aOR = adjusted odds ratio

Now that you have learned how to create a data analysis plan, these are the takeaway points. It should clearly state the:

- Research question, objectives, and hypotheses
- Dataset to be used
- Variable types and their role
- Statistical software and statistical methods
- Shell tables for univariate, bivariate and multivariate analysis

## Further readings

Creating a Data Analysis Plan: What to Consider When Choosing Statistics for a Study https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4552232/pdf/cjhp-68-311.pdf

Creating an Analysis Plan: https://www.cdc.gov/globalhealth/healthprotection/fetp/training_modules/9/creating-analysis-plan_pw_final_09242013.pdf

Data Analysis Plan: https://www.statisticssolutions.com/dissertation-consulting-services/data-analysis-plan-2/

Photo created by freepik – www.freepik.com

Dr Barche is a physician and holds a Masters in Public Health. He is a senior fellow at CRENC with interests in Data Science and Data Analysis.

## Post Navigation

16 comments.

Thanks. Quite informative.

Educative write-up. Thanks.

Easy to understand. Thanks Dr

Very explicit Dr. Thanks

I will always remember how you help me conceptualize and understand data science in a simple way. I can only hope that someday I’ll be in a position to repay you, my dear friend.

Plan d’analyse

This is interesting, Thanks

Very understandable and informative. Thank you..

love the figures.

Nice, and informative

This is so much educative and good for beginners, I would love to recommend that you create and share a video because some people are able to grasp when there is an instructor. Lots of love

Thank you Doctor very helpful.

Educative and clearly written. Thanks

Well said doctor,thank you.But when do you present in tables ,bars,pie chart etc?

Very informative guide!

## Submit a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

Submit Comment

## Receive updates on new courses and blog posts

## Never Miss a Thing!

Subscribe to our mailing list to receive the latest news and updates on our webinars, articles and courses.

## You have Successfully Subscribed!

## Bivariate Analysis: Associations, Hypotheses, and Causal Stories

- Open Access
- First Online: 04 October 2022

## Cite this chapter

You have full access to this open access chapter

- Mark Tessler 2

Part of the book series: SpringerBriefs in Sociology ((BRIEFSSOCY))

3226 Accesses

Every day, we encounter various phenomena that make us question how, why, and with what implications they vary. In responding to these questions, we often begin by considering bivariate relationships, meaning the way that two variables relate to one another. Such relationships are the focus of this chapter.

You have full access to this open access chapter, Download chapter PDF

## 3.1 Description, Explanation, and Causal Stories

There are many reasons why we might be interested in the relationship between two variables. Suppose we observe that some of the respondents interviewed in Arab Barometer surveys and other surveys report that they have thought about emigrating, and we are interested in this variable. We may want to know how individuals’ consideration of emigration varies in relation to certain attributes or attitudes. In this case, our goal would be descriptive , sometimes described as the mapping of variance. Our goal may also or instead be explanation , such as when we want to know why individuals have thought about emigrating.

## Description

Description means that we seek to increase our knowledge and refine our understanding of a single variable by looking at whether and how it varies in relation to one or more other variables. Descriptive information makes a valuable contribution when the structure and variance of an important phenomenon are not well known, or not well known in relation to other important variables.

Returning to the example about emigration, suppose you notice that among Jordanians interviewed in 2018, 39.5 percent of the 2400 men and women interviewed reported that they have considered the possibility of emigrating.

Our objective may be to discover what these might-be migrants look like and what they are thinking. We do this by mapping the variance of emigration across attributes and orientations that provide some of this descriptive information, with the descriptions themselves each expressed as bivariate relationships. These relationships are also sometimes labeled “associations” or “correlations” since they are not considered causal relationships and are not concerned with explanation.

Of the 39.5 percent of Jordanians who told interviewers that they have considered emigrating, 57.3 percent are men and 42.7 percent are women. With respect to age, 34 percent are age 29 or younger and 19.2 percent are age 50 or older. It might have been expected that a higher percentage of respondents age 29 or younger would have considered emigrating. In fact, however, 56 percent of the 575 men and women in this age category have considered emigrating. And with respect to destination, the Arab country most frequently mentioned by those who have considered emigration is the UAE, named by 17 percent, followed by Qatar at 10 percent and Saudi Arabia at 9.8 percent. Non-Arab destinations were mentioned more frequently, with Turkey named by 18.1 percent, Canada by 21.1 percent, and the U.S. by 24.2 percent.

With the variables sex, age, and prospective destination added to the original variable, which is consideration of emigration, there are clearly more than two variables under consideration. But the variables are described two at a time and so each relationship is bivariate.

These bivariate relationships, between having considered emigration on the one hand and sex, age, and prospective destination on the other, provide descriptive information that is likely to be useful to analysts, policymakers, and others concerned with emigration. They tell, or begin to tell, as noted above, what might-be migrants look like and what they are thinking. Still additional insight may be gained by adding descriptive bivariate relationships for Jordanians interviewed in a different year to those interviewed in 2018. In addition, of course, still more information and possibly a more refined understanding, may be gained by examining the attributes and orientations of prospective emigrants who are citizens of other Arab (and perhaps also non-Arab) countries.

With a focus on description, these bivariate relationships are not constructed to shed light on explanation, that is to contribute to causal stories that seek to account for variance and tell why some individuals but not others have considered the possibility of emigrating. In fact, however, as useful as bivariate relationships that provide descriptive information may be, researchers usually are interested as much if not more in bivariate relationships that express causal stories and purport to provide explanations.

## Explanation and Causal Stories

There is a difference in the origins of bivariate relationships that seek to provide descriptive information and those that seek to provide explanatory information. The former can be thought to be responding to what questions: What characterizes potential emigrants? What do they look like? What are their thoughts about this or that subject? If the objective is description, a researcher collects and uses her data to investigate the relationship between two variables without a specific and firm prediction about the relationship between them. Rather, she simply wonders about the “what” questions listed above and believes that finding out the answers will be instructive. In this case, therefore, she selects the bivariate relationships to be considered based on what she thinks it will be useful to know, and not based on assessing the accuracy of a previously articulated causal story that specifies the strength and structure of the effect that one variable has on the other.

A researcher is often interested in causal stories and explanation, however, and this does usually begin with thinking about the relationship between two variables, one of which is the presumed cause and the other the presumed effect. The presumed cause is the independent variable, and the presumed effect is the dependent variable . Offering evidence that there is a strong relationship between two variables is not sufficient to demonstrate that the variables are likely to be causally related, but it is a necessary first step. In this respect it is a point of departure for the fuller, probably multivariate analysis, required to persuasively argue that a relationship is likely to be causal. In addition, as discussed in Chap. 4 , multivariate analysis often not only strengthens the case for inferring that a relationship is causal, but also provides a more elaborate and more instructive causal story. The foundation, however, on which a multivariate analysis aimed at causal inference is built, is a bivariate relationship composed of a presumed independent variable and a presumed dependent variable.

A hypothesis that posits a causal relationship between two variables is not the same as a causal story, although the two are of course closely connected. The former specifies a presumed cause, a presumed determinant of variance on the dependent variable. It probably also specifies the structure of the relationship, such as linear as opposed to non-linear, or positive (also called direct) as opposed to negative (also called inverse).

On the other hand, a causal story describes in more detail what the researcher believes is actually taking place in the relationship between the variables in her hypothesis; and accordingly, why she thinks this involves causality. A causal story provides a fuller account of operative processes, processes that the hypothesis references but does not spell out. These processes may, for example, involve a pathway or a mechanism that tells how it is that the independent variable causes and thus accounts for some of the variance on the dependent variable. Expressed yet another way, the causal story describes the researcher’s understandings, or best guesses, about the real world, understandings that have led her to believe, and then propose for testing, that there is a causal connection between her variables that deserves investigation. The hypothesis itself does not tell this story; it is rather a short formulation that references and calls attention to the existence, or hypothesized existence, of a causal story. Research reports present the causal story as well as the hypothesis, as the hypothesis is often of limited interpretability without the causal story.

A causal story is necessary for causal inference. It enables the researcher to formulate propositions that purport to explain rather than merely describe or predict. There may be a strong relationship between two variables, and if this is the case, it will be possible to predict with relative accuracy the value, or score, of one variable from knowledge of the value, or score, of the other variable. Prediction is not explanation, however. To explain, or attribute causality, there must be a causal story to which a hypothesized causal relationship is calling attention.

An instructive illustration is provided by a recent study of Palestinian participation in protest activities that express opposition to Israeli occupation. Footnote 1 There is plenty of variance on the dependent variable: There are many young Palestinians who take part in these activities, and there are many others who do not take part. Education is one of the independent variables that the researcher thought would be an important determinant of participation, and so she hypothesized that individuals with more education would be more likely to participate in protest activities than individuals with less education.

But why would the researcher think this? The answer is provided by the causal story. To the extent that this as yet untested story is plausible, or preferably, persuasive, at least in the eyes of the investigator, it gives the researcher a reason to believe that education is indeed a determinant of participation in protest activities in Palestine. By spelling out in some detail how and why the hypothesized independent variable, education in this case, very likely impacts a person’s decision about whether or not to protest, the causal story provides a rationale for the researcher’s hypothesis.

In the case of Palestinian participation in protest activities, another investigator offered an insightful causal story about the ways that education pushes toward greater participation, with emphasis on its role in communication and coordination. Footnote 2 Schooling, as the researcher theorizes and subsequently tests, integrates young Palestinians into a broader institutional environment that facilitates mass mobilizations and lowers informational and organizational barriers to collective action. More specifically, she proposes that those individuals who have had at least a middle school education, compared to those who have not finished middle school, have access to better and more reliable sources of information, which, among other things, enables would-be protesters to assess risks. More schooling also makes would-be protesters better able to forge inter-personal relationships and establish networks that share information about needs, opportunities, and risks, and that in this way facilitate engaging in protest activities in groups, rather than on an individual basis. This study offers some additional insights to be discussed later.

The variance motivating the investigation of a causal story may be thought of as the “variable of interest,” and it may be either an independent variable or a dependent variable. It is a variable of interest because the way that it varies poses a question, or puzzle, that a researcher seeks to investigate. It is the dependent variable in a bivariate relationship if the researcher seeks to know why this variable behaves, or varies, as it does, and in pursuit of this objective, she will seek to identify the determinants and drivers that account for this variance. The variable of interest is an independent variable in a particular research project if the researcher seeks to know what difference it makes—on what does its variance have an impact, of what other variable or variables is it a driver or determinant.

The variable in which a researcher is initially interested, that is to say the variable of interest, can also be both a dependent variable and an independent variable. Returning to the variable pertaining to consideration of emigration, but this time with country as the unit of analysis, the variance depicted in Table 3.1 provides an instructive example. The data are based on Arab Barometer surveys conducted in 2018–2019, and the table shows that there is substantial variation across twelve countries. Taking the countries together, the mean percentage of citizens that have thought about relocating to another country is 30.25 percent. But in fact, there is very substantial variation around this mean. Kuwait is an outlier, with only 8 percent having considered emigration. There are also countries in which only 21 percent or 22 percent of the adult population have thought about this, figures that may be high in absolute terms but are low relative to other Arab countries. At the other end of the spectrum are countries in which 45 percent or even 50 percent of the citizens report having considered leaving their country and relocating elsewhere.

The very substantial variance shown in Table 3.1 invites reflection on both the causes and the consequences of this country-level variable, aggregate thinking about emigration. As a dependent variable, the cross-country variance brings the question of why the proportion of citizens that have thought about emigrating is higher in some countries than in others; and the search for an answer begins with the specification of one or more bivariate relationships, each of which links this dependent variable to a possible cause or determinant. As an independent variable, the cross-country variance brings the question of what difference does it make—of what is it a determinant or driver and what are the consequences for a country if more of its citizens, rather than fewer, have thought about moving to another country.

## 3.2 Hypotheses and Formulating Hypotheses

Hypotheses emerge from the research questions to which a study is devoted. Accordingly, a researcher interested in explanation will have something specific in mind when she decides to hypothesize and then evaluate a bivariate relationship in order to determine whether, and if so how, her variable of interest is related to another variable. For example, if the researcher’s variable of interest is attitude toward gender equality and one of her research questions asks why some people support gender equality and others do not, she might formulate the hypothesis below to see if education provides part of the answer.

Hypothesis 1. Individuals who are better educated are more likely to support gender equality than are individuals who are less well-educated.

The usual case, and the preferred case, is for an investigator to be specific about the research questions she seeks to answer, and then to formulate hypotheses that propose for testing part of the answer to one or more of these questions. Sometimes, however, a researcher will proceed without formulating specific hypotheses based on her research questions. Sometimes she will simply look at whatever relationships between her variable of interest and a second variable her data permit her to identify and examine, and she will then follow up and incorporate into her study any findings that turn out to be significant and potentially instructive. This is sometimes described as allowing the data to “speak.” When this hit or miss strategy of trial and error is used in bivariate and multivariate analysis, findings that are significant and potentially instructive are sometimes described as “grounded theory.” Some researchers also describe the latter process as “inductive” and the former as “deductive.”

Although the inductive, atheoretical approach to data analysis might yield some worthwhile findings that would otherwise have been missed, it can sometimes prove misleading, as you may discover relationships between variables that happened by pure chance and are not instructive about the variable of interest or research question. Data analysis in research aimed at explanation should be, in most cases, preceded by the formulation of one or more hypotheses. In this context, when the focus is on bivariate relationships and the objective is explanation rather than description, each hypothesis will include a dependent variable and an independent variable and make explicit the way the researcher thinks the two are, or probably are, related. As discussed, the dependent variable is the presumed effect; its variance is what a hypothesis seeks to explain. The independent variable is the presumed cause; its impact on the variance of another variable is what the hypothesis seeks to determine.

Hypotheses are usually in the form of if-then, or cause-and-effect, propositions. They posit that if there is variance on the independent variable, the presumed cause, there will then be variance on the dependent variable, the presumed effect. This is because the former impacts the latter and causes it to vary.

An illustration of formulating hypotheses is provided by a study of voting behavior in seven Arab countries: Algeria, Bahrain, Jordan, Lebanon, Morocco, Palestine, and Yemen. Footnote 3 The variable of interest in this individual-level study is electoral turnout, and prominent among the research questions is why some citizens vote and others do not. The dependent variable in the hypotheses proposed in response to this question is whether a person did or did not vote in the country’s most recent parliamentary election. The study initially proposed a number of hypotheses, which include the two listed here and which would later be tested with data from Arab Barometer surveys in the seven countries in 2006–2007. We will return to this illustration later in this chapter.

Hypothesis 1: Individuals who have used clientelist networks in the past are more likely to turn out to vote than are individuals who have not used clientelist networks in the past.

Hypothesis 2: Individuals with a positive evaluation of the economy are more likely to vote than are individuals with a negative evaluation of the economy.

Another example pertaining to voting, which this time is hypothetical but might be instructively tested with Arab Barometer data, considers the relationship between perceived corruption and turning out to vote at the individual level of analysis.

The normal expectation in this case would be that perceptions of corruption influence the likelihood of voting. Even here, however, competing causal relationships are plausible. More perceived corruption might increase the likelihood of voting, presumably to register discontent with those in power. But greater perceived corruption might also actually reduce the likelihood of voting, presumably in this case because the would-be voter sees no chance that her vote will make a difference. But in this hypothetical case, even the direction of the causal connection might be ambiguous. If voting is complicated, cumbersome, and overly bureaucratic, it might be that the experience of voting plays a role in shaping perceptions of corruption. In cases like this, certain variables might be both independent and dependent variables, with causal influence pushing in both directions (often called “endogeneity”), and the researcher will need to carefully think through and be particularly clear about the causal story to which her hypothesis is designed to call attention.

The need to assess the accuracy of these hypotheses, or any others proposed to account for variance on a dependent variable, will guide and shape the researcher’s subsequent decisions about data collection and data analysis. Moreover, in most cases, the finding produced by data analysis is not a statement that the hypothesis is true or that the hypothesis is false. It is rather a statement that the hypothesis is probably true or it is probably false. And more specifically still, when testing a hypothesis with quantitative data, it is often a statement about the odds, or probability, that the researcher will be wrong if she concludes that the hypothesis is correct—if she concludes that the independent variable in the hypothesis is indeed a significant determinant of the variance on the dependent variable. The lower the probability of being wrong, of course, the more confident a researcher can be in concluding, and reporting, that her data and analysis confirm her hypothesis.

## Exercise 3.1

Hypotheses emerge from the research questions to which a study is devoted. Thinking about one or more countries with which you are familiar: (a) Identify the independent and dependent variables in each of the example research questions below. (b) Formulate at least one hypothesis for each question. Make sure to include your expectations about the directionality of the relationship between the two variables; is it positive/direct or negative/inverse? (c) In two or three sentences, describe a plausible causal story to which each of your hypotheses might call attention.

Does religiosity affect people’s preference for democracy?

Does preference for democracy affect the likelihood that a person will vote? Footnote 4

## Exercise 3.2

Since its establishment in 2006, the Arab Barometer has, as of spring 2022, conducted 68 social and political attitude surveys in the Middle East and North Africa. It has conducted one or more surveys in 16 different Arab countries, and it has recorded the attitudes, values, and preferences of more than 100,000 ordinary citizens.

The Arab Barometer website ( arabbarometer.org ) provides detailed information about the Barometer itself and about the scope, methodology, and conduct of its surveys. Data from the Barometer’s surveys can be downloaded in either SPSS, Stata, or csv format. The website also contains numerous reports, articles, and summaries of findings.

In addition, the Arab Barometer website contains an Online Data Analysis Tool that makes it possible, without downloading any data, to find the distribution of responses to any question asked in any country in any wave. The tool is found in the “Survey Data” menu. After selecting the country and wave of interest, click the “See Results” tab to select the question(s) for which you want to see the response distributions. Click the “Cross by” tab to see the distributions of respondents who differ on one of the available demographic attributes.

The charts below present, in percentages, the response distributions of Jordanians interviewed in 2018 to two questions about gender equality. Below the charts are questions that you are asked to answer. These questions pertain to formulating hypotheses and to the relationship between hypotheses and causal stories.

For each of the two distributions, do you think (hypothesize) that the attitudes of Jordanian women are:

About the same as those of Jordanian men

More favorable toward gender equality than those of Jordanian men

Less favorable toward gender equality than those of Jordanian men

For each of the two distributions, do you think (hypothesize) that the attitudes of younger Jordanians are:

About the same as those of older Jordanians

More favorable toward gender equality than those of older Jordanians

Less favorable toward gender equality than those of older Jordanians

Restate your answers to Questions 1 and 2 as hypotheses.

Give the reasons for your answers to Questions 1 and 2. In two or three sentences, make explicit the presumed causal story on which your hypotheses are based.

Using the Arab Barometer’s Online Analysis Tool, check to see whether your answers to Questions 1 and 2 are correct. For those instances in which an answer is incorrect, suggest in a sentence or two a causal story on which the correct relationship might be based.

In which other country surveyed by the Arab Barometer in 2018 do you think the distributions of responses to the questions about gender equality are very similar to the distributions in Jordan? What attributes of Jordan and the other country informed your selection of the other country?

In which other country surveyed by the Arab Barometer in 2018 do you think the distributions of responses to the questions about gender equality are very different from the distributions in Jordan? What attributes of Jordan and the other country informed your selection of the other country?

Using the Arab Barometer’s Online Analysis Tool, check to see whether your answers to Questions 6 and 7 are correct. For those instances in which an answer is incorrect, suggest in a sentence or two a causal story on which the correct relationship might be based.

We will shortly return to and expand the discussion of probabilities and of hypothesis testing more generally. First, however, some additional discussion of hypothesis formulation is in order. Three important topics will be briefly considered. The first concerns the origins of hypotheses; the second concerns the criteria by which the value of a particular hypothesis or set of hypotheses should be evaluated; and the third, requiring a bit more discussion, concerns the structure of the hypothesized relationship between an independent variable and a dependent variable, or between any two variables that are hypothesized to be related.

## Origins of Hypotheses

Where do hypotheses come from? How should an investigator identify independent variables that may account for much, or at least some, of the variance on a dependent variable that she has observed and in which she is interested? Or, how should an investigator identify dependent variables whose variance has been determined, presumably only in part, by an independent variable whose impact she deems it important to assess.

Previous research is one place the investigator may look for ideas that will shape her hypotheses and the associated causal stories. This may include previous hypothesis-testing research, and this can be particularly instructive, but it may also include less systematic and structured observations, reports, and testimonies. The point, very simply, is that the investigator almost certainly is not the first person to think about, and offer information and insight about, the topic and questions in which the researcher herself is interested. Accordingly, attention to what is already known will very likely give the researcher some guidance and ideas as she strives for originality and significance in delineating the relationship between the variables in which she is interested.

Consulting previous research will also enable the researcher to determine what her study will add to what is already known—what it will contribute to the collective and cumulative work of researchers and others who seek to reduce uncertainty about a topic in which they share an interest. Perhaps the researcher’s study will fill an important gap in the scientific literature. Perhaps it will challenge and refine, or perhaps even place in doubt, distributions and explanations of variance that have thus far been accepted. Or perhaps her study will produce findings that shed light on the generalizability or scope conditions of previously accepted variable relationships. It need not do any of these things, but that will be for the researcher to decide, and her decision will be informed by knowledge of what is already known and reflection on whether and in what ways her study should seek to add to that body of knowledge.

Personal experience will also inform the researcher’s search for meaningful and informative hypotheses. It is almost certainly the case that a researcher’s interest in a topic in general, and in questions pertaining to this topic in particular, have been shaped by her own experience. The experience itself may involve many different kinds of connections or interactions, some more professional and work-related and some flowing simply and perhaps unintentionally from lived experience. The hypotheses about voting mentioned earlier, for example, might be informed by elections the researcher has witnessed and/or discussions with friends and colleagues about elections, their turnout, and their fairness. Or perhaps the researcher’s experience in her home country has planted questions about the generalizability of what she has witnessed at home.

All of this is to some extent obvious. But the take-away is that an investigator should not endeavor to set aside what she has learned about a topic in the name of objectivity, but rather, she should embrace whatever personal experience has taught her as she selects and refines the puzzles and propositions she will investigate. Should it happen that her experience leads her to incorrect or perhaps distorted understandings, this will be brought to light when her hypotheses are tested. It is in the testing that objectivity is paramount. In hypothesis formation, by contrast, subjectivity is permissible, and, in fact, it may often be unavoidable.

A final arena in which an investigator may look for ideas that will shape her hypotheses overlaps with personal experience and is also to some extent obvious. This is referenced by terms like creativity and originality and is perhaps best captured by the term “sociological imagination.” The take-away here is that hypotheses that deserve attention and, if confirmed, will provide important insights, may not all be somewhere out in the environment waiting to be found, either in the relevant scholarly literature or in recollections about relevant personal experience. They can and sometimes will be the product of imagination and wondering, of discernments that a researcher may come upon during moments of reflection and deliberation.

As in the case of personal experience, the point to be retained is that hypothesis formation may not only be a process of discovery, of finding the previous research that contains the right information. Hypothesis formation may also be a creative process, a process whereby new insights and proposed original understandings are the product of an investigator’s intellect and sociological imagination.

## Crafting Valuable Hypotheses

What are the criteria by which the value of a hypothesis or set of hypotheses should be evaluated? What elements define a good hypothesis? Some of the answers to these questions that come immediately to mind pertain to hypothesis testing rather than hypothesis formation. A good hypothesis, it might be argued, is one that is subsequently confirmed. But whether or not a confirmed hypothesis makes a positive contribution depends on the nature of the hypothesis and goals of the research. It is possible that a researcher will learn as much, and possibly even more, from findings that lead to rejection of a hypothesis. In any event, findings, whatever they may be, are valuable only to the extent that the hypothesis being tested is itself worthy of study.

Two important considerations, albeit somewhat obvious ones, are that a hypothesis should be non-trivial and non-obvious. If a proposition is trivial, suggesting a variable relationship with little or no significance, discovering whether and how the variables it brings together are related will not make a meaningful contribution to knowledge about the determinants and/or impact of the variance at the heart of the researcher’s concern. Few will be interested in findings, however rigorously derived, about a trivial proposition. The same is true of an obvious hypothesis, obvious being an attribute that makes a proposition trivial. As stated, these considerations are themselves somewhat obvious, barely deserving mention. Nevertheless, an investigator should self-consciously reflect on these criteria when formulating hypotheses. She should be sure that she is proposing variable relationships that are neither trivial nor obvious.

A third criterion, also somewhat obvious but nonetheless essential, has to do with the significance and salience of the variables being considered. Will findings from research about these variables be important and valuable, and perhaps also useful? If the primary variable of interest is a dependent variable, meaning that the primary goal of the research is to account for variance, then the significance and salience of the dependent variable will determine the value of the research. Similarly, if the primary variable of interest is an independent variable, meaning that the primary goal of the research is to determine and assess impact, then the significance and salience of the independent variable will determine the value of the research.

These three criteria—non-trivial, non-obvious, and variable importance and salience—are not very different from one another. They collectively mean that the researcher must be able to specify why and how the testing of her hypothesis, or hypotheses, will make a contribution of value. Perhaps her propositions are original or innovative; perhaps knowing whether they are true or false makes a difference or will be of practical benefit; perhaps her findings add something specific and identifiable to the body of existing scholarly literature on the subject. While calling attention to these three connected and overlapping criteria might seem unnecessary since they are indeed somewhat obvious, it remains the case that the value of a hypothesis, regardless of whether or not it is eventually confirmed, is itself important to consider, and an investigator should, therefore, know and be able to articulate the reasons and ways that consideration of her hypothesis, or hypotheses, will indeed be of value.

## Hypothesizing the Structure of a Relationship

Relevant in the process of hypothesis formation are, as discussed, questions about the origins of hypotheses and the criteria by which the value of any particular hypothesis or set of hypotheses will be evaluated. Relevant, too, is consideration of the structure of a hypothesized variable relationship and the causal story to which that relationship is believed to call attention.

The point of departure in considering the structure of a hypothesized variable relationship is an understanding that such a relationship may or may not be linear. In a direct, or positive, linear relationship, each increase in the independent variable brings a constant increase in the dependent variable. In an inverse, or negative, linear relationship, each increase in the independent variable brings a constant decrease in the dependent variable. But these are only two of the many ways that an independent variable and a dependent variable may be related, or hypothesized to be related. This is easily illustrated by hypotheses in which level of education or age is the independent variable, and this is relevant in hypothesis formation because the investigator must be alert to and consider the possibility that the variables in which she is interested are in fact related in a non-linear way.

Consider, for example, the relationship between age and support for gender equality, the latter measured by an index based on several questions about the rights and behavior of women that are asked in Arab Barometer surveys. A researcher might expect, and might therefore want to hypothesize, that an increase in age brings increased support for, or alternatively increased opposition to, gender equality. But these are not the only possibilities. Likely, perhaps, is the possibility of a curvilinear relationship, in which case increases in age bring increases in support for gender equality until a person reaches a certain age, maybe 40, 45, or 50, after which additional increases in age bring decreases in support for gender equality. Or the researcher might hypothesize that the curve is in the opposite direction, that support for gender equality initially decreases as a function of age until a particular age is reached, after which additional increases in age bring an increase in support.

Of course, there are also other possibilities. In the case of education and gender equality, for example, increased education may initially have no impact on attitudes toward gender equality. Individuals who have not finished primary school, those who have finished primary school, and those who have gone somewhat beyond primary school and completed a middle school program may all have roughly the same attitudes toward gender equality. Thus, increases in education, within a certain range of educational levels, are not expected to bring an increase or a decrease in support for gender equality. But the level of support for gender equality among high school graduates may be higher and among university graduates may be higher still. Accordingly, in this hypothetical illustration, an increase in education does bring increased support for gender equality but only beginning after middle school.

A middle school level of education is a “floor” in this example. Education does not begin to make a difference until this floor is reached, and thereafter it does make a difference, with increases in education beyond middle school bringing increases in support for gender equality. Another possibility might be for middle school to be a “ceiling.” This would mean that increases in education through middle school would bring increases in support for gender equality, but the trend would not continue beyond middle school. In other words, level of education makes a difference and appears to have explanatory power only until, and so not after, this ceiling is reached. This latter pattern was found in the study of education and Palestinian protest activity discussed earlier. Increases in education through middle school brought increases in the likelihood that an individual would participate in demonstrations and protests of Israeli occupation. However, additional education beyond middle school was not associated with greater likelihood of taking part in protest activities.

This discussion of variation in the structure of a hypothesized relationship between two variables is certainly not exhaustive, and the examples themselves are straightforward and not very complicated. The purpose of the discussion is, therefore, to emphasize that an investigator must be open to and think through the possibility and plausibility of different kinds of relationships between her two variables, that is to say, relationships with different structures. Bivariate relationships with several different kinds of structures are depicted visually by the scatter plots in Fig. 3.4 .

These possibilities with respect to structure do not determine the value of a proposed hypothesis. As discussed earlier, the value of a proposed relationship depends first and foremost on the importance and salience of the variable of interest. Accordingly, a researcher should not assume that the value of a hypothesis varies as a function of the degree to which it posits a complicated variable relationship. More complicated hypotheses are not necessarily better or more correct. But while she should not strive for or give preference to variable relationships that are more complicated simply because they are more complicated, she should, again, be alert to the possibility that a more complicated pattern does a better job of describing the causal connection between the two variables in the place and time in which she is interested.

This brings the discussion of formulating hypotheses back to our earlier account of causal stories. In research concerned with explanation and causality, a hypothesis for the most part is a simplified stand-in for a causal story. It represents the causal story, as it were. Expressing this differently, the hypothesis states the causal story’s “bottom line;” it posits that the independent variable is a determinant of variance on the dependent variable, and it identifies the structure of the presumed relationship between the independent variable and the dependent variable. But it does not describe the interaction between the two variables in a way that tells consumers of the study why the researcher believes that the relationship involves causality rather than an association with no causal implications. This is left to the causal story, which will offer a fuller account of the way the presumed cause impacts the presumed effect.

## 3.3 Describing and Visually Representing Bivariate Relationships

Once a researcher has collected or otherwise obtained data on the variables in a bivariate relationship she wishes to examine, her first step will be to describe the variance on each of the variables using the univariate statistics described in Chap. 2 . She will need to understand the distribution on each variable before she can understand how these variables vary in relation to one another. This is important whether she is interested in description or wishes to explore a bivariate causal story.

Once she has described each one of the variables, she can turn to the relationship between them. She can prepare and present a visual representation of this relationship, which is the subject of the present section. She can also use bivariate statistical tests to assess the strength and significance of the relationship, which is the subject of the next section of this chapter.

## Contingency Tables

Contingency tables are used to display the relationship between two categorical variables. They are similar to the univariate frequency distributions described in Chap. 2 , the difference being that they juxtapose the two univariate distributions and display the interaction between them. Also called cross-tabulation tables, the cells of the table may present frequencies, row percentages, column percentages, and/or total percentages. Total frequencies and/or percentages are displayed in a total row and a total column, each one of which is the same as the univariate distribution of one of the variables taken alone.

Table 3.2 , based on Palestinian data from Wave V of the Arab Barometer, crosses gender and the average number of hours watching television each day. Frequencies are presented in the cells of the table. In the cell showing the number of Palestinian men who do not watch television at all, row percentage, column percentage, and total percentage are also presented. Note that total percentage is based on the 10 cells showing the two variables taken together, which are summed in the lower right-hand cell. Thus, total percent for this cell is 342/2488 = 13.7. Only frequencies are given in the other cells of the table; but in a full table, these four figures – frequency, row percent, column percent and total percent – would be given in every cell.

## Exercise 3.3

Compute the row percentage, the column percentage, and the total percentage in the cell showing the number of Palestinian women who do not watch television at all.

Describe the relationship between gender and watching television among Palestinians that is shown in the table. Do the television watching habits of Palestinian men and women appear to be generally similar or fairly different? You might find it helpful to convert the frequencies in other cells to row or column percentages.

## Stacked Column Charts and Grouped Bar Charts

Stacked column charts and grouped bar charts are used to visually describe how two categorical variables, or one categorical and one continuous variable, relate to one another. Much like contingency tables, they show the percentage or count of each category of one variable within each category of the second variable. This information is presented in columns stacked on each other or next to each other. The charts below show the number of male Palestinians and the number of female Palestinians who watch television for a given number of hours each day. Each chart presents the same information as the other chart and as the contingency table shown above (Fig. 3.1 ).

Stacked column charts and grouped bar charts comparing Palestinian men and Palestinian women on hours watching television

## Box Plots and Box and Whisker Plots

Box plots, box and whisker plots, and other types of plots can also be used to show the relationship between one categorical variable and one continuous variable. They are particularly useful for showing how spread out the data are. Box plots show five important numbers in a variable’s distribution: the minimum value; the median; the maximum value; and the first and third quartiles (Q1 and Q2), which represent, respectively, the number below which are 25 percent of the distribution’s values and the number below which are 75 percent of the distribution’s values. The minimum value is sometimes called the lower extreme, the lower bound, or the lower hinge. The maximum value is sometimes called the upper extreme, the upper bound, or the upper hinge. The middle 50 percent of the distribution, the range between Q1 and Q3 that represents the “box,” constitutes the interquartile range (IQR). In box and whisker plots, the “whiskers” are the short perpendicular lines extending outside the upper and lower quartiles. They are included to indicate variability below Q1 and above Q3. Values are usually categorized as outliers if they are less than Q1 − IQR*1.5 or greater than Q3 + IQR*1.5. A visual explanation of a box and whisker plot is shown in Fig. 3.2a and an example of a box plot that uses actual data is shown in Fig. 3.2b .

The box plot in Fig. 3.2b uses Wave V Arab Barometer data from Tunisia and shows the relationship between age, a continuous variable, and interpersonal trust, a dichotomous categorical variable. The line representing the median value is shown in bold. Interpersonal trust, sometimes known as generalized trust, is an important personal value. Previous research has shown that social harmony and prospects for democracy are greater in societies in which most citizens believe that their fellow citizens for the most part are trustworthy. Although the interpersonal trust variable is dichotomous in Fig. 3.2b , the variance in interpersonal trust can also be measured by a set of ordered categories or a scale that yields a continuous measure, the latter not being suitable for presentation by a box plot. Figure 3.2b shows that the median age of Tunisians who are trusting is slightly higher than the median age of Tunisians who are mistrustful of other people. Notice also that the box plot for the mistrustful group has an outlier.

( a ) A box and whisker plot. ( b ) Box plot comparing the ages of trusting and mistrustful Tunisians in 2018

Line plots may be used to visualize the relationship between two continuous variables or a continuous variable and a categorical variable. They are often used when time, or a variable related to time, is one of the two variables. If a researcher wants to show whether and how a variable changes over time for more than one subgroup of the units about which she has data (looking at men and women separately, for example), she can include multiple lines on the same plot, with each line showing the pattern over time for a different subgroup. These lines will generally be distinguished from each other by color or pattern, with a legend provided for readers.

Line plots are a particularly good way to visualize a relationship if an investigator thinks that important events over time may have had a significant impact. The line plot in Fig. 3.3 shows the average support for gender equality among men and among women in Tunisia from 2013 to 2018. Support for gender equality is a scale based on four questions related to gender equality in the three waves of the Arab Barometer. An answer supportive of gender equality on a question adds +.5 to the scale and an answer unfavorable to gender equality adds −.5 to the scale. Accordingly, a scale score of 2 indicates maximum support for gender equality and a scale score of −2 indicates maximum opposition to gender equality.

Line plot showing level of support for gender equality among Tunisian women and men in 2013, 2016, and 2018

## Scatter Plots

Scatter plots are used to visualize a bivariate relationship when both variables are numerical. The independent variable is put on the x-axis, the horizontal axis, and the dependent variable is put on the y-axis, the vertical axis. Each data point becomes a dot in the scatter plot’s two-dimensional field, with its precise location being the point at which its value on the x-axis intersects with its value on the y-axis. The scatter plot shows how the variables are related to one another, including with respect to linearity, direction, and other aspects of structure. The scatter plots in Fig. 3.4 illustrate a strong positive linear relationship, a moderately strong negative linear relationship, a strong non-linear relationship, and a pattern showing no relationship. Footnote 5 If the scatter plot displays no visible and clear pattern, as in the lower left hand plot shown in Fig. 3.4 , the scatter plot would indicate that the independent variable, by itself, has no meaningful impact on the dependent variable.

Scatter plots showing bivariate relationships with different structures

Scatter plots are also a good way to identify outliers—data points that do not follow a pattern that characterizes most of the data. These are also called non-scalar types. Figure 3.5 shows a scatter plot with outliers.

Outliers can be informative, making it possible, for example, to identify the attributes of cases for which the measures of one or both variables are unreliable and/or invalid. Nevertheless, the inclusion of outliers may not only distort the assessment of measures, raising unwarranted doubts about measures that are actually reliable and valid for the vast majority of cases, they may also bias bivariate statistics and make relationships seem weaker than they really are for most cases. For this reason, researchers sometimes remove outliers prior to testing a hypothesis. If one does this, it is important to have a clear definition of what is an outlier and to justify the removal of the outlier, both using the definition and perhaps through substantive analysis. There are several mathematical formulas for identifying outliers, and researchers should be aware of these formulas and their pros and cons if they plan to remove outliers.

If there are relatively few outliers, perhaps no more than 5–10 percent of the cases, it may be justifiable to remove them in order to better discern the relationship between the independent variable and the dependent variable. If outliers are much more numerous, however, it is probably because there is not a significant relationship between the two variables being considered. The researcher might in this case find it instructive to introduce a third variable and disaggregate the data. Disaggregation will be discussed in Chap. 4 .

A scatter plot with outliers marked in red

## Exercise 3.4 Exploring Hypotheses through Visualizing Data: Exercise with the Arab Barometer Online Analysis Tool

Go to the Arab Barometer Online Analysis Tool ( https://www.arabbarometer.org/survey-data/data-analysis-tool/ )

Select Wave V and a country that interests you

Select “See Results”

Select “Social, Cultural and Religious topics”

Select “Religion: frequency: pray”

Questions: What does the distribution of this variable look like? How would you describe the variance?

Click on “Cross by,” then

Select “Show all variables”

Select “Kind of government preferable” and click

Select “Options,” then “Show % over Row total,” then “Apply”

Questions: Does there seem to be a relationship between religiosity and preference for democracy? If so, what might explain the relationship you observe—what is a plausible causal story? Is it consistent with the hypothesis you wrote for Exercise 3.1?

What other variables could be used to measure religiosity and preference for democracy? Explore your hypothesis using different items from the list of Arab Barometer variables

Do these distributions support the previous results you found? Do you learn anything additional about the relationship between religiosity and preference for democracy?

Now it is your turn to explore variables and variable relationships that interest you!

Pick two variables that interest you from the list of Arab Barometer variables. Are they continuous or categorical? Ordinal or nominal? (Hint: Most Arab Barometer variables are categorical, even if you might be tempted to think of them as continuous. For example, age is divided into the ordinal categories 18–29, 30–49, and 50 and more.)

Do you expect there to be a relationship between the two variables? If so, what do you think will be the structure of that relationship, and why?

Select the wave (year) and the country that interest you

Select one of your two variables of interest

Click on “Cross by,” and then select your second variable of interest.

On the left side of the page, you’ll see a contingency table. On the right side at the top, you’ll see several options to graphically display the relationship between your two variables. Which type of graph best represents the relationship between your two variables of interest?

Do the two variables seem to be independent of each other, or do you think there might be a relationship between them? Is the relationship you see similar to what you had expected

## 3.4 Probabilities and Type I and Type II Errors

As in visual presentations of bivariate relationships, selecting the appropriate measure of association or bivariate statistical test depends on the types of the two variables. The data on both variables may be categorical; the data on both may be continuous; or the data may be categorical on one variable and continuous on the other variable. These characteristics of the data will guide the way in which our presentation of these measures and tests is organized. Before briefly describing some specific measures of association and bivariate statistical tests, however, it is necessary to lay a foundation by introducing a number of terms and concepts. Relevant here are the distinction between population and sample and the notions of the null hypothesis, of Type I and Type II errors, and of probabilities and confidence intervals. As concepts, or abstractions, these notions may influence the way a researcher thinks about drawing conclusions about a hypothesis from qualitative data, as was discussed in Chap. 2 . In their precise meaning and application, however, these terms and concepts come into play when hypothesis testing involves the statistical analysis of quantitative data.

To begin, it is important to distinguish between, on the one hand, the population of units—individuals, countries, ethnic groups, political movements, or any other unit of analysis—in which the researcher is interested and about which she aspires to advance conclusions and, on the other hand, the units on which she has actually acquired the data to be analyzed. The latter, the units on which she actually has data, is her sample. In cases where the researcher has collected or obtained data on all of the units in which she is interested, there is no difference between the sample and the population, and drawing conclusions about the population based on the sample is straightforward. Most often, however, a researcher does not possess data on all of the units that make up the population in which she is interested, and so the possibility of error when making inferences about the population based on the analysis of data in the sample requires careful and deliberate consideration.

This concern for error is present regardless of the size of the sample and the way it was constructed. The likelihood of error declines as the size of the sample increases and thus comes closer to representing the full population. It also declines if the sample was constructed in accordance with random or other sampling procedures designed to maximize representation. It is useful to keep these criteria in mind when looking at, and perhaps downloading and using, Arab Barometer data. The Barometer’s website gives information about the construction of each sample. But while it is possible to reduce the likelihood of error when characterizing the population from findings based on the sample, it is not possible to eliminate entirely the possibility of erroneous inference. Accordingly, a researcher must endeavor to make the likelihood of this kind of error as small as possible and then decide if it is small enough to advance conclusions that apply to the population as well as the sample.

The null hypothesis, frequently designated as H0, is a statement to the effect that there is no meaningful and significant relationship between the independent variable and the dependent variable in a hypothesis, or indeed between two variables even if the relationship between them has not been formally specified in a hypothesis and does not purport to be causal or explanatory. The null hypothesis may or may not be stated explicitly by an investigator, but it is nonetheless present in her thinking; it stands in opposition to the hypothesized variable relationship. In a point and counterpoint fashion, the hypothesis, H1, posits that the variables are significantly related, and the null hypothesis, H0, replies and says no, they are not significantly related. It further says that they are not related in any meaningful way, neither in the way proposed in H1 nor in any other way that could be proposed.

Based on her analysis, the researcher needs to determine whether her findings permit rejecting the null hypothesis and concluding that there is indeed a significant relationship between the variables in her hypothesis, concluding in effect that the research hypothesis, H1, has been confirmed. This is most relevant and important when the investigator is basing her analysis on some but not all of the units to which her hypothesis purports to apply—when she is analyzing the data in her sample but seeks to advance conclusions that apply to the population in which she is interested. The logic here is that the findings produced by an analysis of some of the data, the data she actually possesses, may be different than the findings her analysis would hypothetically produce were she able to use data from very many more, or ideally even all, of the units that make up her population of interest.

This means, of course, that there will be uncertainty as the researcher adjudicates between H0 and H1 on the basis of her data. An analysis of these data may suggest that there is a strong and significant relationship between the variables in H1. And the stronger the relationship, the more unlikely it is that the researcher’s sample is a subset of a population characterized by H0 and that, therefore, the researcher may consider H1 to have been confirmed. Yet, it remains at least possible that the researcher’s sample, although it provides strong support for H1, is actually a subset of a population characterized by the null hypothesis. This may be unlikely, but it is not impossible, and so, therefore, to consider H1 to have been confirmed is to run the risk, at least a small risk, of what is known as a Type I error. A Type I error is made when a researcher accepts a research hypothesis that is actually false, when she judges to be true a hypothesis that does not characterize the population of which her sample is a subset. Because of the possibility of a Type I error, even if quite unlikely, researchers will often write something like “We can reject the null hypothesis,” rather than “We can confirm our hypothesis.”

Another analysis related to voter turnout provides a ready illustration. In the Arab Barometer Wave V surveys in 12 Arab countries, Footnote 6 13,899 respondents answered a question about voting in the most recent parliamentary election. Of these, 46.6 percent said they had voted, and the remainder, 53.4 percent, said they had not voted in the last parliamentary election. Footnote 7 Seeking to identify some of the determinants of voting—the attitudes and experiences of an individual that increase the likelihood that she will vote, the researcher might hypothesize that a judgment that the country is going in the right direction will push toward voting. More formally:

H1. An individual who believes that her country is going in the right direction is more likely to vote in a national election than is an individual who believes her country is going in the wrong direction.

Arab Barometer surveys provide data with which to test this proposition, and in fact there is a difference associated with views about the direction in which the country is going. Among those who judged that their country is going in the right direction, 52.4 percent voted in the last parliamentary election. By contrast, among those who judged that their country is going in the wrong direction, only 43.8 percent voted in the last parliamentary election.

This illustrates the choice a researcher faces when deciding what to conclude from a study. Does the analysis of her data from a subset of her population of interest confirm or not confirm her hypothesis? In this example, based on Arab Barometer data, the findings are in the direction of her hypothesis, and differences in voting associated with views about the direction the country is going do not appear to be trivial. But are these differences big enough to justify the conclusion that judgements about the country’s path going forward are a determinant of voting, one among others of course, in the population from which her sample was drawn? In other words, although this relationship clearly characterizes the sample, it is unclear whether it characterizes the researcher’s population of interest, the population from which the sample was drawn.

Unless the researcher can gather data on the entire population of eligible voters, or at least almost all of this population, it is not possible to entirely eliminate uncertainty when the researcher makes inferences about the population of voters based on findings from the subset, or sample, of voters on which she has data. She can either conclude that her findings are sufficiently strong and clear to propose that the pattern she has observed characterizes the population as well, and that H1 is therefore confirmed; or she can conclude that her findings are not strong enough to make such an inference about the population, and that H1, therefore, is not confirmed. Either conclusion could be wrong, and so there is a chance of error no matter which conclusion the researcher advances.

The terms Type I error and Type II error are often used to designate the possible error associated with each of these inferences about the population based on the sample. Type I error refers to the rejection of a true null hypothesis. This means, in other words, that the investigator could be wrong if she concludes that her finding of a strong, or at least fairly strong, relationship between her variables characterizes Arab voters in the 12 countries in general, and if she thus judges H1 to have been confirmed when the population from which her sample was drawn is in fact characterized by H0. Type II error refers to acceptance of a false null hypothesis. This means, in other words, that the investigator could be wrong if she concludes that her finding of a somewhat weak relationship, or no relationship at all, between her variables characterizes Arab voters in the 12 countries in general, and that she thus judges H0 to be true when the population from which her sample was drawn is in fact characterized by H1.

In statistical analyses of quantitative data, decisions about whether to risk a Type I error or a Type II error are usually based on probabilities. More specifically, they are based on the probability of a researcher being wrong if she concludes that the variable relationship—or hypothesis in most cases—that characterizes her data, meaning her sample, also characterizes the population on which the researcher hopes her sample and data will shed light. To say this in yet another way, she computes the odds that her sample does not represent the population of which it is a subset; or more specifically still, she computes the odds that from a population that is characterized by the null hypothesis she could have obtained, by chance alone, a subset of the population, her sample, that is not characterized by the null hypothesis. The lower the odds, or probability, the more willing the researcher will be to risk a Type I error.

There are numerous statistical tests that are used to compute such probabilities. The nature of the data and the goals of the analysis will determine the specific test to be used in a particular situation. Most of these tests, frequently called tests of significance or tests of statistical significance, provide output in the form of probabilities, which always range from 0 to 1. The lower the value, meaning the closer to 0, the less likely it is that a researcher has collected and is working with data that produce findings that differ from what she would find were she to somehow have data on the entire population. Another way to think about this is the following:

If the researcher provisionally assumes that the population is characterized by the null hypothesis with respect to the variable relationship under study, what is the probability of obtaining from that population, by chance alone, a subset or sample that is not characterized by the null hypothesis but instead shows a strong relationship between the two variables;

The lower the probability value, meaning the closer to 0, the less likely it is that the researcher’s data, which support H1, have come from a population that is characterized by H0;

The lower the probability that her sample could have come from a population characterized by H0, the lower the possibility that the researcher will be wrong, that she will make a Type I error, if she rejects the null hypothesis and accepts that the population, as well as her sample, is characterized by H1;

When the probability value is low, the chance of actually making a Type I error is small. But while small, the possibility of an error cannot be entirely eliminated.

If it helps you to think about probability and Type I and Type II error, imagine that you will be flipping a coin 100 times and your goal is to determine whether the coin is unbiased, H0, or biased in favor of either heads or tails, H1. How many times more than 50 would heads have to come up before you would be comfortable concluding that the coin is in fact biased in favor of heads? Would 60 be enough? What about 65? To begin to answer these questions, you would want to know the odds of getting 60 or 65 heads from a coin that is actually unbiased, a coin that would come up heads and come up tails roughly the same number of times if it were flipped many more than 100 times, maybe 1000 times, maybe 10,000. With this many flips, would the ratio of heads to tails even out. The lower the odds, the less likely it is that the coin is unbiased. In this analogy, you can think of the mathematical calculations about an unbiased coin’s odds of getting heads as the population, and your actual flips of the coin as the sample.

But exactly how low does the probability of a Type I error have to be for a researcher to run the risk of rejecting H0 and accepting that her variables are indeed related? This depends, of course, on the implications of being wrong. If there are serious and harmful consequences of being wrong, of accepting a research hypothesis that is actually false, the researcher will reject H0 and accept H1 only if the odds of being wrong, of making a Type I error, are very low.

There are some widely used probability values, which define what are known as “confidence intervals,” that help researchers and those who read their reports to think about the likelihood that a Type I error is being made. In the social sciences, rejecting H0 and running the risk of a Type I error is usually thought to require a probability value of less than .05, written as p < .05. The less stringent value of p < .10 is sometimes accepted as sufficient for rejecting H0, although such a conclusion would be advanced with caution and when the consequences of a Type I error are not very harmful. Frequently considered safer, meaning that the likelihood of accepting a false hypothesis is lower, are p < .01 and p < .001. The next section introduces and briefly describes some of the bivariate statistics that may be used to calculate these probabilities.

## 3.5 Measures of Association and Bivariate Statistical Tests

The following section introduces some of the bivariate statistical tests that can be used to compute probabilities and test hypotheses. The accounts are not very detailed. They will provide only a general overview and refresher for readers who are already fairly familiar with bivariate statistics. Readers without this familiarity are encouraged to consult a statistics textbook, for which the accounts presented here will provide a useful guide. While the account below will emphasize calculating these test statistics by hand, it is also important to remember that they can be calculated with the assistance of statistical software as well. A discussion of statistical software is available in Appendix 4.

## Parametric and Nonparametric Statistics

Parametric and nonparametric are two broad classifications of statistical procedures. A parameter in statistics refers to an attribute of a population. For example, the mean of a population is a parameter. Parametric statistical tests make certain assumptions about the shape of the distribution of values in a population from which a sample is drawn, generally that it is normally distributed, and about its parameters, that is to say the means and standard deviations of the assumed distributions. Nonparametric statistical procedures rely on no or very few assumptions about the shape or parameters of the distribution of the population from which the sample was drawn. Chi-squared is the only nonparametric statistical test among the tests described below.

## Degrees of Freedom

Degrees of freedom (df) is the number of values in the calculation of a statistic that are free to vary. Statistical software programs usually give degrees of freedom in the output, so it is generally unnecessary to know the number of the degrees of freedom in advance. It is nonetheless useful to understand what degrees of freedom represent. Consistent with the definition above, it is the number of values that are not predetermined, and thus are free to vary, within the variables used in a statistical test.

This is illustrated by the contingency tables below, which are constructed to examine the relationship between two categorical variables. The marginal row and column totals are known since these are just the univariate distributions of each variable. df = 1 for Table 3.3a , which is a 4-cell table. You can enter any one value in any one cell, but thereafter the values of all the other three cells are determined. Only one number is not free to vary and thus not predetermined. df = 2 for Table 3.3b , which is a 6-cell table. You can enter any two values in any two cells, but thereafter the values of all the other cells are determined. Only two numbers are free to vary and thus not predetermined. For contingency tables, the formula for calculating df is:

## Chi-Squared

Chi-squared, frequently written X 2 , is a statistical test used to determine whether two categorical variables are significantly related. As noted, it is a nonparametric test. The most common version of the chi-squared test is the Pearson chi-squared test, which gives a value for the chi-squared statistic and permits determining as well a probability value, or p-value. The magnitude of the statistic and of the probability value are inversely correlated; the higher the value of the chi-squared statistic, the lower the probability value, and thus the lower the risk of making a Type I error—of rejecting a true null hypothesis—when asserting that the two variables are strongly and significantly related.

The simplicity of the chi-squared statistic permits giving a little more detail in order to illustrate several points that apply to bivariate statistical tests in general. The formula for computing chi-squared is given below, with O being the observed (actual) frequency in each cell of a contingency table for two categorical variables and E being the frequency that would be expected in each cell if the two variables are not related. Put differently, the distribution of E values across the cells of the two-variable table constitutes the null hypothesis, and chi-squared provides a number that expresses the magnitude of the difference between an investigator’s actual observed values and the values of E.

The computation of chi-squared involves the following procedures, which are illustrated using the data in Table 3.4 .

The values of O in the cells of the table are based on the data collected by the investigator. For example, Table 3.4 shows that of the 200 women on whom she collected information, 85 are majoring in social science.

The value of E for each cell is computed by multiplying the marginal total of the column in which the cell is located by the marginal total of the row in which the cell is located divided by N, N being the total number of cases. For the female students majoring in social science in Table 3.4 , this is: 200 * 150/400 = 30,000/400 = 75. For the female students majoring in math and natural science in Table 3.4 , this is: 200 * 100/400 = 20,000/400 = 50.

The difference between the value of O and the value of E is computed for each cell using the formula for chi-squared. For the female students majoring in social science in Table 3.4 , this is: (85–75) 2 /75 = 10 2 /75 = 100/75 = 1.33. For the female students majoring in math and natural science, the value resulting from the application of the chi-squared is: (45–50) 2 /50 = 5 2 /75 = 25/75 = .33.

The values in each cell of the table resulting from the application of the chi-squared formula are summed (Σ). This chi-squared value expresses the magnitude of the difference between a distribution of values indicative of the null hypothesis and what the investigator actually found about the relationship between gender and field of study. In Table 3.4 , the cell for female students majoring in social science adds 1.33 to the sum of the values in the eight cells, the cell for female students majoring in math and natural science adds .33 to the sum, and so forth for the remaining six cells.

A final point to be noted, which applies to many other statistical tests as well, is that the application of chi-squared and other bivariate (and multivariate) statistical tests yields a value with which can be computed the probability that an observed pattern does not differ from the null hypothesis and that a Type I error will be made if the null hypothesis is rejected and the research hypothesis is judged to be true. The lower the probability, of course, the lower the likelihood of an error if the null hypothesis is rejected.

Prior to the advent of computer assisted statistical analysis, the value of the statistic and the number of degrees of freedom were used to find the probability value in a table of probability values in an appendix in most statistics books. At present, however, the probability value, or p-value, and also the degrees of freedom, are routinely given as part of the output when analysis is done by one of the available statistical software packages.

Table 3.5 shows the relationship between economic circumstance and trust in the government among 400 ordinary citizens in a hypothetical country. The observed data were collected to test the hypothesis that greater wealth pushes people toward greater trust and less wealth pushes people toward lesser trust. In the case of all three patterns, the probability that the null hypothesis is true is very low. All three patterns have the same high chi-squared value and low probability value. Thus, the chi-squared and p-values show only that the patterns all differ significantly from what would be expected were the null hypothesis true. They do not show whether the data support the hypothesized variable relationship or any other particular relationship.

As the three patterns in Table 3.5 show, variable relationships with very different structures can yield similar or even identical statistical test and probability values, and thus these tests provide only some of the information a researcher needs to draw conclusions about her hypothesis. To draw the right conclusion, it may also be necessary for the investigator to “look at” her data. For example, as Table 3.5 suggests, looking at a tabular or visual presentation of the data may also be needed to draw the proper conclusion about how two variables are related.

How would you describe the three patterns shown in the table, each of which differs significantly from the null hypothesis? Which pattern is consistent with the research hypothesis? How would you describe the other two patterns? Try to visualize a plot of each pattern.

## Pearson Correlation Coefficient

The Pearson correlation coefficient, more formally known as the Pearson product-moment correlation, is a parametric measure of linear association. It gives a numerical representation of the strength and direction of the relationship between two continuous numerical variables. The coefficient, which is commonly represented as r , will have a value between −1 and 1. A value of 1 means that there is a perfect positive, or direct, linear relationship between the two variables; as one variable increases, the other variable consistently increases by some amount. A value of −1 means that there is a perfect negative, or inverse, linear relationship; as one variable increases, the other variable consistently decreases by some amount. A value of 0 means that there is no linear relationship; as one variable increases, the other variable neither consistently increases nor consistently decreases.

It is easy to think of relationships that might be assessed by a Pearson correlation coefficient. Consider, for example, the relationship between age and income and the proposition that as age increases, income consistently increases or consistently decreases as well. The closer a coefficient is to 1 or −1, the greater the likelihood that the data on which it is based are not the subset of a population in which age and income are unrelated, meaning that the population of interest is not characterized by the null hypothesis. Coefficients very close to 1 or −1 are rare; although it depends on the number of units on which the researcher has data and also on the nature of the variables. Coefficients higher than .3 or lower than −.03 are frequently high enough, in absolute terms, to yield a low probability value and justify rejecting the null hypothesis. The relationship in this case would be described as “statistically significant.”

## Exercise 3.5

Estimating Correlation Coefficients from scatter plots

Look at the scatter plots in Fig. 3.4 and estimate the correlation coefficient that the bivariate relationship shown in each scatter plot would yield.

Explain the basis for each of your estimates of the correlation coefficient.

## Spearman’s Rank-Order Correlation Coefficient

The Spearman’s rank-order correlation coefficient is a nonparametric version of the Pearson product-moment correlation . Spearman’s correlation coefficient, (ρ, also signified by r s ) measures the strength and direction of the association between two ranked variables.

## Bivariate Regression

Bivariate regression is a parametric measure of association that, like correlation analysis, assesses the strength and direction of the relationship between two variables. Also, like correlation analysis, regression assumes linearity. It may give misleading results if used with variable relationships that are not linear.

Regression is a powerful statistic that is widely used in multivariate analyses. This includes ordinary least squares (OLS) regression, which requires that the dependent variable be continuous and assumes linearity; binary logistic regression, which may be used when the dependent variable is dichotomous; and ordinal logistic regression, which is used with ordinal dependent variables. The use of regression in multivariate analysis will be discussed in the next chapter. In bivariate analysis, regression analysis yields coefficients that indicate the strength and direction of the relationship between two variables. Researchers may opt to “standardize” these coefficients. Standardized coefficients from a bivariate regression are the same as the coefficients produced by Pearson product-moment correlation analysis.

The t-test, also sometimes called a “difference of means” test, is a parametric statistical test that compares the means of two variables and determines whether they are different enough from each other to reject the null hypothesis and risk a Type I error. The dependent variable in a t-test must be continuous or ordinal—otherwise the investigator cannot calculate a mean. The independent variable must be categorical since t-tests are used to compare two groups.

An example, drawing again on Arab Barometer data, tests the relationship between voting and support for democracy. The hypothesis might be that men and women who voted in the last parliamentary election are more likely than men and women who did not vote to believe that democracy is suitable for their country. Whether a person did or did not vote would be the categorical independent variable, and the dependent variable would be the response to a question like, “To what extent do you think democracy is suitable for your country?” The question about democracy asked respondents to situate their views on a 11-point scale, with 0 indicating completely unsuitable and 10 indicating completely suitable.

Focusing on Tunisia in 2018, Arab Barometer Wave V data show that the mean response on the 11-point suitability question is 5.11 for those who voted and 4.77 for those who did not vote. Is this difference of .34 large enough to be statistically significant? A t-test will determine the probability of getting a difference of this magnitude from a population of interest, most likely all Tunisians of voting age, in which there is no difference between voters and non-voters in views about the suitability of democracy for Tunisia. In this example, the t-test showed p < .086. With this p-value, which is higher than the generally accepted standard of .05, a researcher cannot with confidence reject the null hypotheses, and she is unable, therefore, to assert that the proposed relationship has been confirmed.

This question can also be explored at the country level of analysis with, for example, regime type as the independent variable. In this illustration, the hypothesis is that citizens of monarchies are more likely than citizens of republics to believe that democracy is suitable for their country. Of course, a researcher proposing this hypothesis would also advance an associated causal story that provides the rationale for the hypothesis and specifies what is really being tested. To test this proposition, an investigator might merge data from surveys in, say, three monarchies, perhaps Morocco, Jordan, and Kuwait, and then also merge data from surveys in three republics, perhaps Algeria, Egypt, and Iraq. A t-test would then be used to compare the means of people in republics and people in monarchies and give the p-value.

A similar test, the Wilcoxon-Mann-Whitney test, is a nonparametric test that does not require that the dependent variable be normally distributed.

Analysis of variance, or ANOVA, is closely related to the t-test. It may be used when the dependent variable is continuous and the independent variable is categorical. A one-way ANOVA compares the mean and variance values of a continuous dependent variable in two or more categories of a categorical independent variable in order to determine if the latter affects the former.

ANOVA calculates the F-ratio based on the variance between the groups and the variance within each group. The F-ratio can then be used to calculate a p-value. However, if there are more than two categories of the independent variable, the ANOVA test will not indicate which pairs of categories differ enough to be statistically significant, making it necessary, again, to look at the data in order to draw correct conclusions about the structure of the bivariate relationships. Two-way ANOVA is used when an investigator has more than two variables.

Table 3.6 presents a summary list of the visual representations and bivariate statistical tests that have been discussed. It reminds readers of the procedures that can be used when both variables are categorical, when both variables are numerical/continuous, and when one variable is categorical and one variable is numerical/continuous.

## Bivariate Statistics and Causal Inference

It is important to remember that bivariate statistical tests only assess the association or correlation between two variables. The tests described above can help a researcher estimate how much confidence her hypothesis deserves and, more specifically, the probability that any significant variable relationships she has found characterize the larger population from which her data were drawn and about which she seeks to offer information and insight.

The finding that two variables in a hypothesized relationship are related to a statistically significant degree is not evidence that the relationship is causal, only that the independent variable is related to the dependent variable. The finding is consistent with the causal story that the hypothesis represents, and to that extent, it offers support for this story. Nevertheless, there are many reasons why an observed statistically significant relationship might be spurious. The correlation might, for example, reflect the influence of one or more other and uncontrolled variables. This will be discussed more fully in the next chapter. The point here is simply that bivariate statistics do not, by themselves, address the question of whether a statistically significant relationship between two variables is or is not a causal relationship.

## Only an Introductory Overview

As has been emphasized throughout, this chapter seeks only to offer an introductory overview of the bivariate statistical tests that may be employed when an investigator seeks to assess the relationship between two variables. Additional information will be presented in Chap. 4 . The focus in Chap. 4 will be on multivariate analysis, on analyses involving three or more variables. In this case again, however, the chapter will provide only an introductory overview. The overviews in the present chapter and the next provide a foundation for understanding social statistics, for understanding what statistical analyses involve and what they seek to accomplish. This is important and valuable in and of itself. Nevertheless, researchers and would-be researchers who intend to incorporate statistical analyses into their investigations, perhaps to test hypotheses and decide whether to risk a Type I error or a Type II error, will need to build on this foundation and become familiar with the contents of texts on social statistics. If this guide offers a bird’s eye view, researchers who implement these techniques will also need to expose themselves to the view of the worm at least once.

Chapter 2 makes clear that the concept of variance is central and foundational for much and probably most data-based and quantitative social science research. Bivariate relationships, which are the focus of the present chapter, are building blocks that rest on this foundation. The goal of this kind of research is very often the discovery of causal relationships, relationships that explain rather than merely describe or predict. Such relationships are also frequently described as accounting for variance. This is the focus of Chap. 4 , and it means that there will be, first, a dependent variable, a variable that expresses and captures the variance to be explained, and then, second, an independent variable, and possibly more than one independent variable, that impacts the dependent variable and causes it to vary.

Bivariate relationships are at the center of this enterprise, establishing the empirical pathway leading from the variance discussed in Chap. 2 to the causality discussed in Chap. 4 . Finding that there is a significant relationship between two variables, a statistically significant relationship, is not sufficient to establish causality, to conclude with confidence that one of the variables impacts the other and causes it to vary. But such a finding is necessary.

The goal of social science inquiry that investigates the relationship between two variables is not always explanation. It might be simply to describe and map the way two variables interact with one another. And there is no reason to question the value of such research. But the goal of data-based social science research is very often explanation; and while the inter-relationships between more than two variables will almost always be needed to establish that a relationship is very likely to be causal, these inter-relationships can only be examined by empirics that begin with consideration of a bivariate relationship, a relationship with one variable that is a presumed cause and one variable that is a presumed effect.

Against this background, with the importance of two-variable relationships in mind, the present chapter offers a comprehensive overview of bivariate relationships, including but not only those that are hypothesized to be causally related. The chapter considers the origin and nature of hypotheses that posit a particular relationship between two variables, a causal relationship if the larger goal of the research is explanation and the delineation of a causal story to which the hypothesis calls attention. This chapter then considers how a bivariate relationship might be described and visually represented, and thereafter it discusses how to think about and determine whether the two variables actually are related.

Presenting tables and graphs to show how two variables are related and using bivariate statistics to assess the likelihood that an observed relationship differs significantly from the null hypothesis, the hypothesis of no relationship, will be sufficient if the goal of the research is to learn as much as possible about whether and how two variables are related. And there is plenty of excellent research that has this kind of description as its primary objective, that makes use for purposes of description of the concepts and procedures introduced in this chapter. But there is also plenty of research that seeks to explain, to account for variance, and for this research, use of these concepts and procedures is necessary but not sufficient. For this research, consideration of a two-variable relationship, the focus of the present chapter, is a necessary intermediate step on a pathway that leads from the observation of variance to explaining how and why that variance looks and behaves as it does.

Dana El Kurd. 2019. “Who Protests in Palestine? Mobilization Across Class Under the Palestinian Authority.” In Alaa Tartir and Timothy Seidel, eds. Palestine and Rule of Power: Local Dissent vs. International Governance . New York: Palgrave Macmillan.

Yael Zeira. 2019. The Revolution Within: State Institutions and Unarmed Resistance in Palestine . New York: Cambridge University Press.

Carolina de Miguel, Amaney A. Jamal, and Mark Tessler. 2015. “Elections in the Arab World: Why do citizens turn out?” Comparative Political Studies 48, (11): 1355–1388.

Question 1: Independent variable is religiosity; dependent variable is preference for democracy. Example of hypothesis for Question 1: H1. More religious individuals are more likely than less religious individuals to prefer democracy to other political systems. Question 2: Independent variable is preference for democracy; dependent variable is turning out to vote. Example of hypothesis for Question 2: H2. Individuals who prefer democracy to other political systems are more likely than individuals who do not prefer democracy to other political systems to turn out to vote.

Mike Yi. “A complete Guide to Scatter Plots,” posted October 16, 2019 and seen at https://chartio.com/learn/charts/what-is-a-scatter-plot/

The countries are Algeria, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Palestine, Sudan, Tunisia, and Yemen. The Wave V surveys were conducted in 2018–2019.

Not considered in this illustration are the substantial cross-country differences in voter turnout. For example, 63.6 of the Lebanese respondents reported voting, whereas in Algeria the proportion who reported voting was only 20.3 percent. In addition to testing hypotheses about voting in which the individual is the unit of analysis, country could also be the unit of analysis, and hypotheses seeking to account for country-level variance in voting could be formulated and tested.

## Author information

Authors and affiliations.

Department of Political Science, University of Michigan, Ann Arbor, MI, USA

Mark Tessler

You can also search for this author in PubMed Google Scholar

## Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

## Copyright information

© 2023 The Author(s)

## About this chapter

Tessler, M. (2023). Bivariate Analysis: Associations, Hypotheses, and Causal Stories. In: Social Science Research in the Arab World and Beyond. SpringerBriefs in Sociology. Springer, Cham. https://doi.org/10.1007/978-3-031-13838-6_3

## Download citation

DOI : https://doi.org/10.1007/978-3-031-13838-6_3

Published : 04 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-13837-9

Online ISBN : 978-3-031-13838-6

eBook Packages : Social Sciences Social Sciences (R0)

## Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Publish with us

Policies and ethics

- Find a journal
- Track your research

- Privacy Policy

Home » Research Proposal – Types, Template and Example

## Research Proposal – Types, Template and Example

Table of Contents

## Research Proposal

Research proposal is a document that outlines a proposed research project . It is typically written by researchers, scholars, or students who intend to conduct research to address a specific research question or problem.

## Types of Research Proposal

Research proposals can vary depending on the nature of the research project and the specific requirements of the funding agency, academic institution, or research program. Here are some common types of research proposals:

## Academic Research Proposal

This is the most common type of research proposal, which is prepared by students, scholars, or researchers to seek approval and funding for an academic research project. It includes all the essential components mentioned earlier, such as the introduction, literature review , methodology , and expected outcomes.

## Grant Proposal

A grant proposal is specifically designed to secure funding from external sources, such as government agencies, foundations, or private organizations. It typically includes additional sections, such as a detailed budget, project timeline, evaluation plan, and a description of the project’s alignment with the funding agency’s priorities and objectives.

## Dissertation or Thesis Proposal

Students pursuing a master’s or doctoral degree often need to submit a proposal outlining their intended research for their dissertation or thesis. These proposals are usually more extensive and comprehensive, including an in-depth literature review, theoretical framework, research questions or hypotheses, and a detailed methodology.

## Research Project Proposal

This type of proposal is often prepared by researchers or research teams within an organization or institution. It outlines a specific research project that aims to address a particular problem, explore a specific area of interest, or provide insights for decision-making. Research project proposals may include sections on project management, collaboration, and dissemination of results.

## Research Fellowship Proposal

Researchers or scholars applying for research fellowships may be required to submit a proposal outlining their proposed research project. These proposals often emphasize the novelty and significance of the research and its alignment with the goals and objectives of the fellowship program.

## Collaborative Research Proposal

In cases where researchers from multiple institutions or disciplines collaborate on a research project, a collaborative research proposal is prepared. This proposal highlights the objectives, responsibilities, and contributions of each collaborator, as well as the overall research plan and coordination mechanisms.

## Research Proposal Outline

A research proposal typically follows a standard outline that helps structure the document and ensure all essential components are included. While the specific headings and subheadings may vary slightly depending on the requirements of your institution or funding agency, the following outline provides a general structure for a research proposal:

- Title of the research proposal
- Name of the researcher(s) or principal investigator(s)
- Affiliation or institution
- Date of submission
- A concise summary of the research proposal, typically limited to 200-300 words.
- Briefly introduce the research problem or question, state the objectives, summarize the methodology, and highlight the expected outcomes or significance of the research.
- Provide an overview of the subject area and the specific research problem or question.
- Present relevant background information, theories, or concepts to establish the need for the research.
- Clearly state the research objectives or research questions that the study aims to address.
- Indicate the significance or potential contributions of the research.
- Summarize and analyze relevant studies, theories, or scholarly works.
- Identify research gaps or unresolved issues that your study intends to address.
- Highlight the novelty or uniqueness of your research.
- Describe the overall approach or research design that will be used (e.g., experimental, qualitative, quantitative).
- Justify the chosen approach based on the research objectives and question.
- Explain how data will be collected (e.g., surveys, interviews, experiments).
- Describe the sampling strategy and sample size, if applicable.
- Address any ethical considerations related to data collection.
- Outline the data analysis techniques or statistical methods that will be applied.
- Explain how the data will be interpreted and analyzed to answer the research question(s).
- Provide a detailed schedule or timeline that outlines the various stages of the research project.
- Specify the estimated duration for each stage, including data collection, analysis, and report writing.
- State the potential outcomes or results of the research.
- Discuss the potential significance or contributions of the study to the field.
- Address any potential limitations or challenges that may be encountered.
- Identify the resources required to conduct the research, such as funding, equipment, or access to data.
- Specify any collaborations or partnerships necessary for the successful completion of the study.
- Include a list of cited references in the appropriate citation style (e.g., APA, MLA).

———————————————————————————————–

## Research Proposal Example Template

Here’s an example of a research proposal to give you an idea of how it can be structured:

Title: The Impact of Social Media on Adolescent Well-being: A Mixed-Methods Study

This research proposal aims to investigate the impact of social media on the well-being of adolescents. The study will employ a mixed-methods approach, combining quantitative surveys and qualitative interviews to gather comprehensive data. The research objectives include examining the relationship between social media use and mental health, exploring the role of peer influence in shaping online behaviors, and identifying strategies for promoting healthy social media use among adolescents. The findings of this study will contribute to the understanding of the effects of social media on adolescent well-being and inform the development of targeted interventions.

1. Introduction

1.1 Background and Context:

Adolescents today are immersed in social media platforms, which have become integral to their daily lives. However, concerns have been raised about the potential negative impact of social media on their well-being, including increased rates of depression, anxiety, and body dissatisfaction. It is crucial to investigate this phenomenon further and understand the underlying mechanisms to develop effective strategies for promoting healthy social media use among adolescents.

1.2 Research Objectives:

The main objectives of this study are:

- To examine the association between social media use and mental health outcomes among adolescents.
- To explore the influence of peer relationships and social comparison on online behaviors.
- To identify strategies and interventions to foster positive social media use and enhance adolescent well-being.

2. Literature Review

Extensive research has been conducted on the impact of social media on adolescents. Existing literature suggests that excessive social media use can contribute to negative outcomes, such as low self-esteem, cyberbullying, and addictive behaviors. However, some studies have also highlighted the positive aspects of social media, such as providing opportunities for self-expression and social support. This study will build upon this literature by incorporating both quantitative and qualitative approaches to gain a more nuanced understanding of the relationship between social media and adolescent well-being.

3. Methodology

3.1 Research Design:

This study will adopt a mixed-methods approach, combining quantitative surveys and qualitative interviews. The quantitative phase will involve administering standardized questionnaires to a representative sample of adolescents to assess their social media use, mental health indicators, and perceived social support. The qualitative phase will include in-depth interviews with a subset of participants to explore their experiences, motivations, and perceptions related to social media use.

3.2 Data Collection Methods:

Quantitative data will be collected through an online survey distributed to schools in the target region. The survey will include validated scales to measure social media use, mental health outcomes, and perceived social support. Qualitative data will be collected through semi-structured interviews with a purposive sample of participants. The interviews will be audio-recorded and transcribed for thematic analysis.

3.3 Data Analysis:

Quantitative data will be analyzed using descriptive statistics and regression analysis to examine the relationships between variables. Qualitative data will be analyzed thematically to identify common themes and patterns within participants’ narratives. Integration of quantitative and qualitative findings will provide a comprehensive understanding of the research questions.

4. Timeline

The research project will be conducted over a period of 12 months, divided into specific phases, including literature review, study design, data collection, analysis, and report writing. A detailed timeline outlining the key milestones and activities is provided in Appendix A.

5. Expected Outcomes and Significance

This study aims to contribute to the existing literature on the impact of social media on adolescent well-being by employing a mixed-methods approach. The findings will inform the development of evidence-based interventions and guidelines to promote healthy social media use among adolescents. This research has the potential to benefit adolescents, parents, educators, and policymakers by providing insights into the complex relationship between social media and well-being and offering strategies for fostering positive online experiences.

6. Resources

The resources required for this research include access to a representative sample of adolescents, research assistants for data collection, statistical software for data analysis, and funding to cover survey administration and participant incentives. Ethical considerations will be taken into account, ensuring participant confidentiality and obtaining informed consent.

7. References

## Research Proposal Writing Guide

Writing a research proposal can be a complex task, but with proper guidance and organization, you can create a compelling and well-structured proposal. Here’s a step-by-step guide to help you through the process:

- Understand the requirements: Familiarize yourself with the guidelines and requirements provided by your institution, funding agency, or program. Pay attention to formatting, page limits, specific sections or headings, and any other instructions.
- Identify your research topic: Choose a research topic that aligns with your interests, expertise, and the goals of your program or funding opportunity. Ensure that your topic is specific, focused, and relevant to the field of study.
- Conduct a literature review : Review existing literature and research relevant to your topic. Identify key theories, concepts, methodologies, and findings related to your research question. This will help you establish the context, identify research gaps, and demonstrate the significance of your proposed study.
- Define your research objectives and research question(s): Clearly state the objectives you aim to achieve with your research. Formulate research questions that address the gaps identified in the literature review. Your research objectives and questions should be specific, measurable, achievable, relevant, and time-bound (SMART).
- Develop a research methodology: Determine the most appropriate research design and methodology for your study. Consider whether quantitative, qualitative, or mixed-methods approaches will best address your research question(s). Describe the data collection methods, sampling strategy, data analysis techniques, and any ethical considerations associated with your research.
- Create a research plan and timeline: Outline the various stages of your research project, including tasks, milestones, and deadlines. Develop a realistic timeline that considers factors such as data collection, analysis, and report writing. This plan will help you stay organized and manage your time effectively throughout the research process.
- A. Introduction: Provide background information on the research problem, highlight its significance, and introduce your research objectives and questions.
- B. Literature review: Summarize relevant literature, identify gaps, and justify the need for your proposed research.
- C . Methodology: Describe your research design, data collection methods, sampling strategy, data analysis techniques, and any ethical considerations.
- D . Expected outcomes and significance: Explain the potential outcomes, contributions, and implications of your research.
- E. Resources: Identify the resources required to conduct your research, such as funding, equipment, or access to data.
- F . References: Include a list of cited references in the appropriate citation style.
- Revise and proofread: Review your proposal for clarity, coherence, and logical flow. Check for grammar and spelling errors. Seek feedback from mentors, colleagues, or advisors to refine and improve your proposal.
- Finalize and submit: Make any necessary revisions based on feedback and finalize your research proposal. Ensure that you have met all the requirements and formatting guidelines. Submit your proposal within the specified deadline.

## Research Proposal Length

The length of a research proposal can vary depending on the specific guidelines provided by your institution or funding agency. However, research proposals typically range from 1,500 to 3,000 words, excluding references and any additional supporting documents.

## Purpose of Research Proposal

The purpose of a research proposal is to outline and communicate your research project to others, such as academic institutions, funding agencies, or potential collaborators. It serves several important purposes:

- Demonstrate the significance of the research: A research proposal explains the importance and relevance of your research project. It outlines the research problem or question, highlights the gaps in existing knowledge, and explains how your study will contribute to the field. By clearly articulating the significance of your research, you can convince others of its value and potential impact.
- Provide a clear research plan: A research proposal outlines the methodology, design, and approach you will use to conduct your study. It describes the research objectives, data collection methods, data analysis techniques, and potential outcomes. By presenting a clear research plan, you demonstrate that your study is well-thought-out, feasible, and likely to produce meaningful results.
- Secure funding or support: For researchers seeking funding or support for their projects, a research proposal is essential. It allows you to make a persuasive case for why your research is deserving of financial resources or institutional backing. The proposal explains the budgetary requirements, resources needed, and potential benefits of the research, helping you secure the necessary funding or support.
- Seek feedback and guidance: Presenting a research proposal provides an opportunity to receive feedback and guidance from experts in your field. It allows you to engage in discussions and receive suggestions for refining your research plan, improving the methodology, or addressing any potential limitations. This feedback can enhance the quality of your study and increase its chances of success.
- Establish ethical considerations: A research proposal also addresses ethical considerations associated with your study. It outlines how you will ensure participant confidentiality, obtain informed consent, and adhere to ethical guidelines and regulations. By demonstrating your awareness and commitment to ethical research practices, you build trust and credibility in your proposed study.

## Importance of Research Proposal

The research proposal holds significant importance in the research process. Here are some key reasons why research proposals are important:

- Planning and organization: A research proposal requires careful planning and organization of your research project. It forces you to think through the research objectives, research questions, methodology, and potential outcomes before embarking on the actual study. This planning phase helps you establish a clear direction and framework for your research, ensuring that your efforts are focused and purposeful.
- Demonstrating the significance of the research: A research proposal allows you to articulate the significance and relevance of your study. By providing a thorough literature review and clearly defining the research problem or question, you can showcase the gaps in existing knowledge that your research aims to address. This demonstrates to others, such as funding agencies or academic institutions, why your research is important and deserving of support.
- Obtaining funding and resources: Research proposals are often required to secure funding for your research project. Funding agencies and organizations need to evaluate the feasibility and potential impact of the proposed research before allocating resources. A well-crafted research proposal helps convince funders of the value of your research and increases the likelihood of securing financial support, grants, or scholarships.
- Receiving feedback and guidance: Presenting a research proposal provides an opportunity to seek feedback and guidance from experts in your field. By sharing your research plan and objectives with others, you can benefit from their insights and suggestions. This feedback can help refine your research design, strengthen your methodology, and ensure that your study is rigorous and well-informed.
- Ethical considerations: A research proposal addresses ethical considerations associated with your study. It outlines how you will protect the rights and welfare of participants, maintain confidentiality, obtain informed consent, and adhere to ethical guidelines and regulations. This emphasis on ethical practices ensures that your research is conducted responsibly and with integrity.
- Enhancing collaboration and partnerships: A research proposal can facilitate collaborations and partnerships with other researchers, institutions, or organizations. When presenting your research plan, you may attract the interest of potential collaborators who share similar research interests or possess complementary expertise. Collaborative partnerships can enrich your study, expand your resources, and foster knowledge exchange.
- Establishing a research trajectory: A research proposal serves as a foundation for your research project. Once approved, it becomes a roadmap that guides your study’s implementation, data collection, analysis, and reporting. It helps maintain focus and ensures that your research stays on track and aligned with the initial objectives.

## When to Write Research Proposal

The timing of when to write a research proposal can vary depending on the specific requirements and circumstances. However, here are a few common situations when it is appropriate to write a research proposal:

- Academic research: If you are a student pursuing a research degree, such as a Ph.D. or Master’s by research, you will typically be required to write a research proposal as part of the application process. This is usually done before starting the research program to outline your proposed study and seek approval from the academic institution.
- Funding applications: When applying for research grants, scholarships, or funding from organizations or institutions, you will often need to submit a research proposal. Funding agencies require a detailed description of your research project, including its objectives, methodology, and expected outcomes. Writing a research proposal in this context is necessary to secure financial support for your study.
- Research collaborations: When collaborating with other researchers, institutions, or organizations on a research project, it is common to prepare a research proposal. This helps outline the research objectives, roles and responsibilities, and expected contributions from each party. Writing a research proposal in this case allows all collaborators to align their efforts and ensure a shared understanding of the project.
- Research project within an organization: If you are conducting research within an organization, such as a company or government agency, you may be required to write a research proposal to gain approval and support for your study. This proposal outlines the research objectives, methodology, resources needed, and expected outcomes, ensuring that the project aligns with the organization’s goals and objectives.
- Independent research projects: Even if you are not required to write a research proposal, it can still be beneficial to develop one for your independent research projects. Writing a research proposal helps you plan and structure your study, clarify your research objectives, and anticipate potential challenges or limitations. It also allows you to communicate your research plans effectively to supervisors, mentors, or collaborators.

## About the author

## Muhammad Hassan

Researcher, Academic Writer, Web developer

## You may also like

## How To Write A Business Proposal – Step-by-Step...

## How To Write A Proposal – Step By Step Guide...

## How To Write A Grant Proposal – Step-by-Step...

## Grant Proposal – Example, Template and Guide

## How To Write A Research Proposal – Step-by-Step...

## Business Proposal – Templates, Examples and Guide

Last updated 09/07/24: Online ordering is currently unavailable due to technical issues. We apologise for any delays responding to customers while we resolve this. For further updates please visit our website: https://www.cambridge.org/news-and-insights/technical-incident

We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings .

## Login Alert

- > The Fundamentals of Political Science Research
- > Bivariate Hypothesis Testing

## Book contents

- Frontmatter
- List of Figures
- List of Tables
- Preface to the Second Edition
- Acknowledgments to the Second Edition
- Acknowledgments to the First Edition
- 1 The Scientific Study of Politics
- 2 The Art of Theory Building
- 3 Evaluating Causal Relationships
- 4 Research Design
- 5 Getting to Know Your Data: Evaluating Measurement and Variations
- 6 Probability and Statistical Inference
- 7 Bivariate Hypothesis Testing
- 8 Bivariate Regression Models
- 9 Multiple Regression: The Basics
- 10 Multiple Regression Model Specification
- 11 Limited Dependent Variables and Time-Series Data
- 12 Putting It All Together to Produce Effective Research
- Appendix A Critical Values of Chi-Square
- Appendix B Critical Values of t
- Appendix C The Λ Link Function for Binomial Logit Models
- Appendix D The Φ Link Function for Binomial Probit Models
- Bibliography

## 7 - Bivariate Hypothesis Testing

Once we have set up a hypothesis test and collected data, how do we evaluate what we have found? In this chapter we provide hands-on discussions of the basic building blocks used to make statistical inferences about the relationship between two variables. We deal with the often-misunderstood topic of “statistical significance” – focusing both on what it is and what it is not – as well as the nature of statistical uncertainty. We introduce three ways to examine relationships between two variables: tabular analysis, difference of means tests, and correlation coefficients. (We will introduce a fourth technique, bivariate regression analysis, in Chapter 8.)

BIVARIATE HYPOTHESIS TESTS AND ESTABLISHING CAUSAL RELATIONSHIPS

In the preceding chapters we introduced the core concepts of hypothesis testing. In this chapter we discuss the basic mechanics of hypothesis testing with three different examples of bivariate hypothesis testing. It is worth noting that, although this type of analysis was the main form of hypothesis testing in the professional journals up through the 1970s, it is seldom used as the primary means of hypothesis testing in the professional journals today. This is the case because these techniques are good at helping us with only the first principle for establishing causal relationships. Namely, bivariate hypothesis tests help us to answer the question, “Are X and Y related?” By definition – “bivariate” means “two variables” – these tests cannot help us with the important question, “Have we controlled for all confounding variables Z that might make the observed association between X and Y spurious?”

## Access options

Save book to kindle.

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle .

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service .

- Bivariate Hypothesis Testing
- Paul M. Kellstedt , Texas A & M University , Guy D. Whitten , Texas A & M University
- Book: The Fundamentals of Political Science Research
- Online publication: 05 May 2013
- Chapter DOI: https://doi.org/10.1017/CBO9781139104258.008

## Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox .

## Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive .

## 5 Examples of Bivariate Data in Real Life

Bivariate data refers to a dataset that contains exactly two variables.

This type of data occurs all the time in real-world situations and we typically use the following methods to analyze this type of data:

- Scatterplots
- Correlation Coefficients
- Simple Linear Regression

The following examples show different scenarios where bivariate data appears in real life.

## Example 1: Business

Businesses often collect bivariate data about total money spent on advertising and total revenue.

For example, a business may collect the following data for 12 consecutive sales quarters:

This is an example of bivariate data because it contains information on exactly two variables: advertising spend and total revenue.

The business may decide to fit a simple linear regression model to this dataset and find the following fitted model:

Total Revenue = 14,942.75 + 2.70*(Advertising Spend)

This tells the business that for each additional dollar spent on advertising, total revenue increases by an average of $2.70.

## Example 2: Medical

Medical researchers often collect bivariate data to gain a better understanding of the relationship between variables related to health.

For example, a researcher may collect the following data about age and resting heart rate for 15 individuals:

The researcher may then decide to calculate the correlation between the two variables and find it to be 0.812 .

This indicates that there is a strong positive correlation between the two variables. That is, as age increases resting heart rate tends to increase in a predictable manner as well.

Related: What is Considered to Be a “Strong” Correlation?

## Example 3: Academics

Researchers often collect bivariate data to understand what variables affect the performance of university students.

For example, a researcher may collect data on the number of hours studied per week and the corresponding GPA for students in a certain class:

She may then create a simple scatterplot to visualize the relationship between these two variables:

Clearly there is a positive association between the two variables: As the number of hours studied per week increases, the GPA of the student tends to increase as well.

## Example 4: Economics

Economists often collect bivariate data to understand the relationship between two socioeconomic variables.

For example, an economist may collect data on the total years of schooling and total annual income among individuals in a certain city:

He may then decide to fit the following simple linear regression model:

Annual Income = -45,353 + 7,120*(Years of Schooling)

This tells the economist that for each additional year of schooling, annual income increases by $7,120 on average.

## Example 5: Biology

Biologists often collect bivariate data to understand how two variables are related among plants or animals.

For example, a biologist may collect data on total rainfall and total number of plants in different regions:

The biologist may then decide to calculate the correlation between the two variables and find it to be 0.926 .

This indicates that there is a strong positive correlation between the two variables.

That is, higher rainfall is closely associated with an increased number of plants in a region.

## Additional Resources

The following tutorials provide additional information about bivariate data and how to analyze it.

Introduction to Bivariate Analysis Introduction to Univariate Analysis Introduction to the Pearson Correlation Coefficient Introduction to Simple Linear Regression

## How to Interpret glm Output in R (With Example)

What is considered a good confidence interval, related posts, how to normalize data between -1 and 1, how to interpret f-values in a two-way anova, how to create a vector of ones in..., vba: how to check if string contains another..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to find the mode of a histogram..., how to find quartiles in even and odd..., how to calculate sxy in statistics (with example), how to calculate sxx in statistics (with example).

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

- Advanced Search
- Journal List
- Indian J Anaesth
- v.60(9); 2016 Sep

## Basic statistical tools in research and data analysis

Zulfiqar ali.

Department of Anaesthesiology, Division of Neuroanaesthesiology, Sheri Kashmir Institute of Medical Sciences, Soura, Srinagar, Jammu and Kashmir, India

## S Bala Bhaskar

1 Department of Anaesthesiology and Critical Care, Vijayanagar Institute of Medical Sciences, Bellary, Karnataka, India

Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

## INTRODUCTION

Statistics is a branch of science that deals with the collection, organisation, analysis of data and drawing of inferences from the samples to the whole population.[ 1 ] This requires a proper design of the study, an appropriate selection of the study sample and choice of a suitable statistical test. An adequate knowledge of statistics is necessary for proper designing of an epidemiological study or a clinical trial. Improper statistical methods may result in erroneous conclusions which may lead to unethical practice.[ 2 ]

Variable is a characteristic that varies from one individual member of population to another individual.[ 3 ] Variables such as height and weight are measured by some type of scale, convey quantitative information and are called as quantitative variables. Sex and eye colour give qualitative information and are called as qualitative variables[ 3 ] [ Figure 1 ].

Classification of variables

## Quantitative variables

Quantitative or numerical data are subdivided into discrete and continuous measurements. Discrete numerical data are recorded as a whole number such as 0, 1, 2, 3,… (integer), whereas continuous data can assume any value. Observations that can be counted constitute the discrete data and observations that can be measured constitute the continuous data. Examples of discrete data are number of episodes of respiratory arrests or the number of re-intubations in an intensive care unit. Similarly, examples of continuous data are the serial serum glucose levels, partial pressure of oxygen in arterial blood and the oesophageal temperature.

A hierarchical scale of increasing precision can be used for observing and recording the data which is based on categorical, ordinal, interval and ratio scales [ Figure 1 ].

Categorical or nominal variables are unordered. The data are merely classified into categories and cannot be arranged in any particular order. If only two categories exist (as in gender male and female), it is called as a dichotomous (or binary) data. The various causes of re-intubation in an intensive care unit due to upper airway obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary oedema and neurological impairment are examples of categorical variables.

Ordinal variables have a clear ordering between the variables. However, the ordered data may not have equal intervals. Examples are the American Society of Anesthesiologists status or Richmond agitation-sedation scale.

Interval variables are similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. A good example of an interval scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit scale, the difference between 70° and 75° is equal to the difference between 80° and 85°: The units of measurement are equal throughout the full range of the scale.

Ratio scales are similar to interval scales, in that equal differences between scale values have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property. For example, the system of centimetres is an example of a ratio scale. There is a true zero point and the value of 0 cm means a complete absence of length. The thyromental distance of 6 cm in an adult may be twice that of a child in whom it may be 3 cm.

## STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics[ 4 ] try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. Inferential statistics[ 4 ] use a random sample of data taken from a population to describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire population. The examples if descriptive and inferential statistics are illustrated in Table 1 .

Example of descriptive and inferential statistics

## Descriptive statistics

The extent to which the observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.

## Measures of central tendency

The measures of central tendency are mean, median and mode.[ 6 ] Mean (or the arithmetic average) is the sum of all the scores divided by the number of scores. Mean may be influenced profoundly by the extreme variables. For example, the average stay of organophosphorus poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5 months because of septicaemia. The extreme values are called outliers. The formula for the mean is

where x = each observation and n = number of observations. Median[ 6 ] is defined as the middle of a distribution in a ranked data (with half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a distribution. Range defines the spread, or variability, of a sample.[ 7 ] It is described by the minimum and maximum values of the variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other percentile amount. The median is the 50 th percentile. The interquartile range will be the observations in the middle 50% of the observations about the median (25 th -75 th percentile). Variance[ 7 ] is a measure of how spread out is the distribution. It gives an indication of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:

where σ 2 is the population variance, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The variance of a sample is defined by slightly different formula:

where s 2 is the sample variance, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. The formula for the variance of a population has the value ‘ n ’ as the denominator. The expression ‘ n −1’ is known as the degrees of freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of observation, the square root of variance is used. The square root of the variance is the standard deviation (SD).[ 8 ] The SD of a population is defined by the following formula:

where σ is the population SD, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The SD of a sample is defined by slightly different formula:

where s is the sample SD, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. An example for calculation of variation and SD is illustrated in Table 2 .

Example of mean, variance, standard deviation

## Normal distribution or Gaussian distribution

Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point.[ 1 ] The standard normal distribution curve is a symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of the mean [ Figure 2 ].

Normal distribution curve

## Skewed distribution

It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the right of Figure 1 . In a positively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the left of the figure leading to a longer right tail.

Curves showing negatively skewed and positively skewed distribution

## Inferential statistics

In inferential statistics, data are analysed from a sample to make inferences in the larger collection of the population. The purpose is to answer or test the hypotheses. A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects.

Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).

In inferential statistics, the term ‘null hypothesis’ ( H 0 ‘ H-naught ,’ ‘ H-null ’) denotes that there is no relationship (difference) between the population variables in question.[ 9 ]

Alternative hypothesis ( H 1 and H a ) denotes that a statement between the variables is expected to be true.[ 9 ]

The P value (or the calculated probability) is the probability of the event occurring by chance if the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by researchers in deciding whether to reject or retain the null hypothesis [ Table 3 ].

P values with interpretation

If P value is less than the arbitrarily chosen value (known as α or the significance level), the null hypothesis (H0) is rejected [ Table 4 ]. However, if null hypotheses (H0) is incorrectly rejected, this is known as a Type I error.[ 11 ] Further details regarding alpha error, beta error and sample size calculation and factors influencing them are dealt with in another section of this issue by Das S et al .[ 12 ]

Illustration for null hypothesis

## PARAMETRIC AND NON-PARAMETRIC TESTS

Numerical data (quantitative variables) that are normally distributed are analysed with parametric tests.[ 13 ]

Two most basic prerequisites for parametric statistical analysis are:

- The assumption of normality which specifies that the means of the sample group are normally distributed
- The assumption of equal variance which specifies that the variances of the samples and of their corresponding population are equal.

However, if the distribution of the sample is skewed towards one side or the distribution is unknown due to the small sample size, non-parametric[ 14 ] statistical techniques are used. Non-parametric tests are used to analyse ordinal and categorical data.

## Parametric tests

The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal distribution of the underlying population. The samples have the same variance (homogeneity of variances). The samples are randomly drawn from the population, and the observations within a group are independent of each other. The commonly used parametric tests are the Student's t -test, analysis of variance (ANOVA) and repeated measures ANOVA.

Student's t -test

Student's t -test is used to test the null hypothesis that there is no difference between the means of the two groups. It is used in three circumstances:

where X = sample mean, u = population mean and SE = standard error of mean

where X 1 − X 2 is the difference between the means of the two groups and SE denotes the standard error of the difference.

- To test if the population means estimated by two dependent samples differ significantly (the paired t -test). A usual setting for paired t -test is when measurements are made on the same subjects before and after a treatment.

The formula for paired t -test is:

where d is the mean difference and SE denotes the standard error of this difference.

The group variances can be compared using the F -test. The F -test is the ratio of variances (var l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ significantly.

Analysis of variance

The Student's t -test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant difference between the means of two or more groups.

In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.

However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the F-test.

A simplified formula for the F statistic is:

where MS b is the mean squares between the groups and MS w is the mean squares within groups.

Repeated measures analysis of variance

As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.

As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should be used.

## Non-parametric tests

When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality assumption.[ 15 ] Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they usually have less power.

As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis techniques are delineated in Table 5 .

Analogue of parametric and non-parametric tests

Median test for one sample: The sign test and Wilcoxon's signed rank test

The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value.

This test examines the hypothesis about the median θ0 of a population. It tests the null hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If the observed value is equal to the reference value (θ0), it is eliminated from the sample.

If the null hypothesis is true, there will be an equal number of + signs and − signs.

The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is useful when it is difficult to measure the values.

Wilcoxon's signed rank test

There is a major limitation of sign test as we lose the quantitative information of the given data and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed values in comparison with θ0 but also takes into consideration the relative sizes, adding more statistical power to the test. As in the sign test, if there is an observed value that is equal to the reference value θ0, this observed value is eliminated from the sample.

Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample and compares the difference in the rank sums.

Mann-Whitney test

It is used to test the null hypothesis that two samples have the same median or, alternatively, whether observations in one sample tend to be larger than observations in the other.

Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states that P (xi > yi) ≠1/2.

Kolmogorov-Smirnov test

The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves.

Kruskal-Wallis test

The Kruskal–Wallis test is a non-parametric test to analyse the variance.[ 14 ] It analyses if there is any difference in the median values of three or more independent samples. The data values are ranked in an increasing order, and the rank sums calculated followed by calculation of the test statistic.

Jonckheere test

In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a more statistical power than the Kruskal–Wallis test.[ 14 ]

Friedman test

The Friedman test is a non-parametric test for testing the difference between several related samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when the same parameter has been measured under different conditions on the same subjects.[ 13 ]

## Tests to analyse the categorical data

Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or nominal variables. The Chi-square test compares the frequencies and tests whether the observed data differ significantly from that of the expected data if there were no differences between groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between observed ( O ) and the expected ( E ) data (or the deviation, d ) divided by the expected data by the following formula:

A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.

## SOFTWARES AVAILABLE FOR STATISTICS, SAMPLE SIZE CALCULATION AND POWER ANALYSIS

Numerous statistical software systems are available currently. The commonly used software systems are Statistical Package for the Social Sciences (SPSS – manufactured by IBM corporation), Statistical Analysis System ((SAS – developed by SAS Institute North Carolina, United States of America), R (designed by Ross Ihaka and Robert Gentleman from R core team), Minitab (developed by Minitab Inc), Stata (developed by StataCorp) and the MS Excel (developed by Microsoft).

There are a number of web resources which are related to statistical power analyses. A few are:

- StatPages.net – provides links to a number of online power calculators
- G-Power – provides a downloadable power analysis program that runs under DOS
- Power analysis for ANOVA designs an interactive site that calculates power or sample size needed to attain a given power for one effect in a factorial ANOVA design
- SPSS makes a program called SamplePower. It gives an output of a complete report on the computer screen which can be cut and paste into another document.

It is important that a researcher knows the concepts of the basic statistical methods used for conduct of a research study. This will help to conduct an appropriately well-designed study leading to valid and reliable results. Inappropriate use of statistical techniques may lead to faulty conclusions, inducing errors and undermining the significance of the article. Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research which can be utilised for formulating the evidence-based guidelines.

## Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

## 5 research proposal titles that involve bivariate data

Bivariate data: examples, definition and analysis.

On this page:

- What is bivariate data? Definition.
- Examples of bivariate data: with table.
- Bivariate data analysis examples: including linear regression analysis, correlation (relationship), distribution, and scatter plot.

Let’s define bivariate data:

We have bivariate data when we studying two variables . These variables are changing and are compared to find the relationships between them.

For example, if you are studying a group of students to find out their average math score and their age, you have two variables (math score and age).

If you are studying only one variable, for example only math score for these students, then we have univariate data .

When we are examining bivariate data, the two variables could depend on each other. One variable could influence another. In this case, we say that the bivariate data has:

- an independent variable and
- a dependent variable .

A classical example of dependent and independent variables are age and heights of the babies and toddlers. When age increases, the height also increases.

Let’s move on to some real-life and practical bivariate data examples.

Look at the following bivariate data table. It represents the age and average height of a group of babies and kids.

Commonly, bivariate data is stored in a table with two columns.

There are 2 types of relationship between the dependent and independent variable:

- A positive relationship (also called positive correlation) – that means if the independent variable increases, then the dependent variable would also increase and vice versa. The above example about the kids’ age and height is a classical positive relationship.
- A negative relationship (negative correlation) – when the independent variable increases and the dependent variable decrease and vice versa. Example: when the car age increases, the car price decreases.

So, we use bivariate data to compare two sets of data and to discover any relationships between them.

Bivariate Data Analysis

Bivariate analysis allows you to study the relationship between 2 variables and has many practical uses in the real life. It aims to find out whether there exists an association between the variables and what is its strength.

Bivariate analysis also allows you to test a hypothesis of association and causality. It also helps you to predict the values of a dependent variable based on the changes of an independent variable.

Let’s see how the bivariate data work with linear regression models .

Let’s say you have to study the relationship between the age and the systolic blood pressure in a company. You have a sample of 10 workers aged thirty to fifty-five years. The results are presented in the following bivariate data table.

Now, we need to display this table graphically to be able to make some conclusions.

Bivariate data is most often displayed using a scatter plot. This is a plot on a grid paper of y (y-axis) against x (x-axis) and indicates the behavior of given data sets.

Scatter plot is one of the popular types of graphs that give us a much more clear picture of a possible relationship between the variables.

Let’s build our Scatter Plot based on the table above:

The above scatter plot illustrates that the values seem to group around a straight line i.e it shows that there is a possible linear relationship between the age and systolic blood pressure.

You can create scatter plots very easily with a variety of free graphing software available online.

What does this graph show us?

It is obvious that there is a relationship between age and blood pressure and moreover this relationship is positive (i.e. we have positive correlation). The older the age, the higher the systolic blood pressure.

The line that you see in the graph is called “line of best fit” (or the regression line). The line of best fit aims to answer the question whether these two variables correlate. It can be used to help you determine trends within the data sets.

Furthermore, the line of best fit illustrates the strength of the correlation .

Let’s investigate further.

We constated that in our example, there is a positive and strong linear relationship between age and blood pressure. However, how strong is that relationship? What is its strength?

This is where the correlation coefficient comes to answer this question.

The correlation coefficient (R) is a numerical value measured between -1 and 1 . It indicates the strength of the linear relationship between two given variables. For describing a linear regression, the coefficient is called Pearson’s correlation coefficient.

When the correlation coefficient is closer to 1 it shows a strong positive relationship. When it is close to -1, there is a strong negative relationship. A value of 0 tells us that there is no relationship.

We need to calculate our correlation coefficient between age and blood pressure. There is a long formula (for Pearson’s correlation coefficient) for this but you don’t need to remember it.

All you need to do is to use a free or premium calculator such as those on www.socscistatistics.com . When we put our bivariate data on this calculator we got the following result:

The value of the correlation coefficient (R) is 0.8435. It shows a strong positive correlation.

Now, let’s calculate the equation of the regression line (the best fit line) to find out the slope of the line.

For that purpose let’s remind the simple linear regression equation :

Y = Β 0 + Β 1 X

X – the value of the independent variable, Y – the value of the dependent variable. Β 0 – is a constant (shows the value of Y when the value of X=0) Β 1 – the regression coefficient (shows how much Y changes for each unit change in X)

Again, we will use the same online software ( socscistatistics.com ) to calculate the linear regression equation. The result is:

Y = 1.612*X + 74.35

More on linear regression equation and explanation, you can see in our post for linear regression examples .

So, from the above bivariate data analysis example that includes workers of the company, we can say that blood pressure increased as the age increased. This indicates that age is a significant factor that influences the change of blood pressure.

Other popular positive bivariate data correlation examples are: temperature and the amount of the ice cream sales, alcohol consumption and cholesterol levels, weights and heights of college students, and etc.

Let’s see bivariate data analysis examples for a negative correlation.

The below bivariate data table shows the number of student absences and their final grades in a class.

It is quite obvious that these two variables have a negative correlation between them.

When the number of student absences increases, the final grades decrease.

Now, let’s plot the bivariate data from the table on a scatter plot and to create the best-fit line:

Note how the regression line looks – it has a downward slope.

This downward slope indicates there is a negative linear association.

We can calculate the correlation coefficient and linear regression equation. Here are the results:

- The value of the correlation coefficient (R) is -0.9061 . This is a strong negative correlation.
- The linear regression equation is Y = -3.971*X + 90.71.

We can conclude that the least number of lessons the students skip, the higher grade could be reached.

Conclusion:

The above bivariate data examples aim to help you understand better how does the bivariate analysis work.

Analyzing two variables is a common part of the inferential statistics types and calculations. Many business and scientific investigations include only two continuous variables.

The main questions that bivariate analysis has to answer are:

- Is there a correlation between 2 given variables?
- Is the relationship positive or negative?
- What is the degree of the correlation? Is it strong or weak?

If you need other practical examples in the area of management and analysis, our posts Venn diagram examples and decision tree examples might be helpful for you.

## About The Author

## Silvia Valcheva

Silvia Valcheva is a digital marketer with over a decade of experience creating content for the tech industry. She has a strong passion for writing about emerging software and technologies such as big data, AI (Artificial Intelligence), IoT (Internet of Things), process automation, etc.

EXCELLENT and illustrative presentation

Excellent. Simple and effective explanations

Thanks. Happy to help!

Clears the Whole concept in very simple and easy way,thanks a lot

## Leave a Reply Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Statistics Made Easy

## 5 Examples of Bivariate Data in Real Life

Bivariate data refers to a dataset that contains exactly two variables.

This type of data occurs all the time in real-world situations and we typically use the following methods to analyze this type of data:

- Scatterplots
- Correlation Coefficients
- Simple Linear Regression

The following examples show different scenarios where bivariate data appears in real life.

## Example 1: Business

Businesses often collect bivariate data about total money spent on advertising and total revenue.

For example, a business may collect the following data for 12 consecutive sales quarters:

This is an example of bivariate data because it contains information on exactly two variables: advertising spend and total revenue.

The business may decide to fit a simple linear regression model to this dataset and find the following fitted model:

Total Revenue = 14,942.75 + 2.70*(Advertising Spend)

This tells the business that for each additional dollar spent on advertising, total revenue increases by an average of $2.70.

## Example 2: Medical

Medical researchers often collect bivariate data to gain a better understanding of the relationship between variables related to health.

For example, a researcher may collect the following data about age and resting heart rate for 15 individuals:

The researcher may then decide to calculate the correlation between the two variables and find it to be 0.812 .

This indicates that there is a strong positive correlation between the two variables. That is, as age increases resting heart rate tends to increase in a predictable manner as well.

Related: What is Considered to Be a “Strong” Correlation?

## Example 3: Academics

Researchers often collect bivariate data to understand what variables affect the performance of university students.

For example, a researcher may collect data on the number of hours studied per week and the corresponding GPA for students in a certain class:

She may then create a simple scatterplot to visualize the relationship between these two variables:

Clearly there is a positive association between the two variables: As the number of hours studied per week increases, the GPA of the student tends to increase as well.

## Example 4: Economics

Economists often collect bivariate data to understand the relationship between two socioeconomic variables.

For example, an economist may collect data on the total years of schooling and total annual income among individuals in a certain city:

He may then decide to fit the following simple linear regression model:

Annual Income = -45,353 + 7,120*(Years of Schooling)

This tells the economist that for each additional year of schooling, annual income increases by $7,120 on average.

## Example 5: Biology

Biologists often collect bivariate data to understand how two variables are related among plants or animals.

For example, a biologist may collect data on total rainfall and total number of plants in different regions:

The biologist may then decide to calculate the correlation between the two variables and find it to be 0.926 .

This indicates that there is a strong positive correlation between the two variables.

That is, higher rainfall is closely associated with an increased number of plants in a region.

## Additional Resources

The following tutorials provide additional information about bivariate data and how to analyze it.

Introduction to Bivariate Analysis Introduction to Univariate Analysis Introduction to the Pearson Correlation Coefficient Introduction to Simple Linear Regression

## Featured Posts

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

## Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

## Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

- Advanced Search
- Journal List
- J Thorac Dis
- v.10(2); 2018 Feb

## How to describe bivariate data

Alessandro bertani.

1 Department for the Treatment and Study of Cardiothoracic Diseases and Cardiothoracic Transplantation, Division of Thoracic Surgery and Lung Transplantation, IRCCS ISMETT – UPMC, Palermo, Italy;

## Gioacchino Di Paola

2 Office of Research, IRCCS ISMETT, Palermo, Italy

## Emanuele Russo

Fabio tuzzolino.

The role of scientific research is not limited to the description and analysis of single phenomena occurring independently one from each other (univariate analysis). Even though univariate analysis has a pivotal role in statistical analysis, and is useful to find errors inside datasets, to familiarize with and to aggregate data, to describe and to gather basic information on simple phenomena, it has a limited cognitive impact. Therefore, research also and mostly focuses on the relationship that single phenomena may have with each other. More specifically, bivariate analysis explores how the dependent (“outcome”) variable depends or is explained by the independent (“explanatory”) variable (asymmetrical analysis), or it explores the association between two variables without any cause and effect relationship (symmetrical analysis). In this paper we will introduce the concept of “causation”, dependent (“outcome”) and independent (“explanatory”) variable. Also, some statistical techniques used for the analysis of the relationship between the two variables will be presented, based on the type of variable (categorical or continuous).

Association between explanatory and outcome variables, causation and covariation

One of the main goals of statistical analysis is to study the association between variables.

There is an association between two variables if one variable tends to display specific values when the other one changes. For example, let’s take into account a variable called “Response to treatment” (displaying the values: “Worsened/Stable/Improved”) and a variable called “Treatment” (displaying the values “Treatment A” and “Treatment B”). If treatment B is placebo, it is likely that individuals receiving treatment A will be mostly improved compared to individuals receiving treatment B. In this case, there is an association between the variables “Response to treatment” and “Treatment” because the proportion of individuals who are responding to treatment changes along with different type of treatments.

Usually, when an association between two variables is analyzed (the so called “Bivariate analysis”), one variable is defined as the “Outcome variable” and its different values are compared based on the different values displayed by the other variable, which is defined as the “Explanatory variable”. The values displayed by the explanatory variable define a subset of groups that will be compared; differences among different groups will be assessed based on the values displayed by the outcome variable.

Bivariate Analysis, as outlined above, allows an assessment of how the value of the outcome variable depends on (or is explained by) the values displayed by the explanatory variable ( 1 ). For example, if we try to compare gender and income, the latter is the outcome variable while the former is the explanatory variable; income, in fact, may be influenced by gender but gender many not depend on the income.

Two types of bivariate analysis may be defined, each with definite features and properties ( 2 ):

- Describes how the outcome variable changes when the independent or explanatory variable changes. The bond between the two variables is unidirectional or asymmetrical;
- Logic dependence: there is a cause and effect relationship between two or more variables;
- Logic independence: there isn’t any cause and effect relationship between the variables that are considered.
- Describes the interaction between the values displayed by two variables (bidirectional or symmetrical bond);
- A relationship of dependence is not possible;
- A dependent character may not be found.

A causal explanation is one of the key goals of scientific research. When we define a cause and effect relationship, we are referring to the existence of a bond between two events, so that the occurrence of one specific event is the direct consequence of the occurrence of another event (or a group of events). A simple empirical relationship between two events does not necessarily define the concept of causation. In fact, “Co-variation” does not mean “Causation”.

Covariation (correlation or association) means that we are just looking at the fact that two variables called X and Y present concurrent variations: when one changes the other changes too. Causation means that the hypothesis that the variation of X is determining a variation of Y is true.

Attributing a causal bond to any relationship between two variables is actually a weak attribution. Reality is—per se—a “multi variated” world, and every phenomenon is related with an infinity of other phenomena that interact and link with each other. In fact, multivariate analysis helps finding a better approximation of the reality and therefore represents the ultimate goal of data analysis. Nevertheless, univariate analysis and bivariate analysis are a basic and necessary step before proceeding to more complex multivariate analysis.

Unfortunately, there is no perfect statistical methodology available to define the true direction of causality. Other important available tools are the researchers’ experience and the ability to appropriately recognize the nature of the variables and the different types of studies, from cohort studies to randomized controlled studies and systematic reviews.

Therefore, bivariate statistics are used to analyze two variables simultaneously. Many studies are performed to analyze how the value of an outcome variable may change based on the modifications of an explanatory variable. The methodology used in these cases depends on the type of variable that is being considered:

- Qualitative nominal variables (in these cases we will be dealing with “Association”);
- Qualitative ordinal variables (in these cases we will be dealing with “Co-graduation”);
- Quantitative variables (in these cases we will be dealing with “Correlation”).

## Qualitative bivariate data

Given two categorical variables, a contingency table shows how many observations are recorded for all the different combinations of the values of each variable. It allows to observe how the values of a given outcome variable are contingent to the categories of the explanatory variable. Using this model, a first synthetic analysis maybe given by the marginal, conditional or conjugate distribution ( 3 - 5 ). The marginal distributions correspond to the totals of the rows and of the columns of the table; conditional distributions correspond to all the percentages of the outcome variable calculated within the categories of the explanatory variable; conjugate distribution is given by a single group of percentages for all the cells of the table, divided by the overall size of the sample ( Table 1 ).

When it is possible to distinguish between an outcome and an explanatory variable, conditional distributions are much more informative than conjugate distributions. Using a contingency table to analyze the relationship between two categorical variables, we must distinguish between row percentages and column percentages. This choice is performed based on the position that a given dependent variable is holding. The column percentage is chosen if we want to analyze the influence that the variable placed in column has on the variable in the row; the row percentage is chosen when we want to assess the influence that the row variable has on the variable in the column ( Table 2 ).

The principle of assigning a percentage to the independent variable is our best choice when our aim is to study the causal relationship between the independent and the dependent variable. In other situations, it might be useful to calculate the percentages in the opposite directions or in both ways. This last approach is usually adopted when it is not clearly possible to distinguish between a dependent and an independent variable ( Table 1 ).

There are statistical techniques that are able to measure the strength of the relationship between the variables of the study, and these techniques may contribute to reduce the subjectivity of the analysis. As previously mentioned, we may distinguish measures of association for nominal variables and co-graduation measures for ordinal variables. Among these, the most common are:

- ❖ Association: chi-squared test (χ 2 ), Fisher’s exact test;
- ❖ Co-graduation: Kendall’s tau-c (τc), Kruskal’s gamma (ϒ), Somers’D.

Specific interest is provided by the “2×2” tables, which are tables where both variables are dichotomous. In this type of tables we may calculate other measures of association, for example:

- ❖ d = difference between proportions;
- ❖ OR (ψ) = odds ratio;
- ❖ RR = relative risk.

All these measures are used after verification of all the basic assumptions of the standard practice of calculation, and are based on the type of study that we need to perform (retrospective/prospective).

Sometimes it may be useful to graphically present the relationship between two categorical variables. In order to do this, there are tools that are used to describe the frequency distributions of univariate variables: these tools are bar charts ( Figure 1 ).

Example of a bar chart.

## Quantitative bivariate data

In case of two quantitative variables, the most relevant technique for bivariate analysis is correlation analysis and simple linear regression.

Using the latter methodology, it is possible to understand how the independent variable may influence the dependent variable or, more specifically, it is possible to assess the intensity of the effect of the independent variable on the dependent variable ( 6 ).

The first step in the construction of a model for bivariate analysis with quantitative variables is to display a graphical representation of the relationship by using a scatterplot (or dispersion diagram), which is able to show visually how the two variates co-variate (= variate together) in a linear or non-linear fashion ( Figure 2 ). This diagram is able to show us the shape of the relationship but cannot measure the intensity of the causal effect.

Example of a scatterplot box.

The second step is measuring the strength of the linear association bond between the variables, by using the correlation analysis. This is expressed by a number between −1 and +1 and it shows if the values of the two variables tend to increase or decrease simultaneously (positive correlation) or if one increases and the other decreases (negative correlation) Figure 3 .

Examples of linear correlation.

What is even more interesting is to perform a quantitative assessment of the variation of one of the two variables (chosen as dependent variable) compared to the changes of the second variable (independent variable), using a mathematical equation. This equation, if the linear profile of the relationship is confirmed, is the basis of simple linear regression: a mathematical function describing how the mean value of the outcome variable changes according to the modifications of the explanatory variable.

## Comparing groups with bivariate analysis

The comparison of two populations displaying, for example, a quantitative and a qualitative variable may also be performed using bivariate analysis ( 7 ). In this case, it is particularly useful to compare the mean values of the continuous variable to the different categories of the other variable, using the plot box graph as a preliminary analysis.

Specific bivariate statistical models are available for the cases where a given variable is analyzed according to different categories of a further variable, for example the analysis of variance (ANOVA).

## Take home messages

- Bivariate statistics are used in research in order to analyze two variables simultaneously;
- Real world phenomena such as many topics of scientific research are usually complex and multi-variate. Bivariate analysis is a mandatory step to describe the relationships between the observed variables;
- Many studies have the aim of analyzing how the values of a dependent variable may vary based on the modification of an explanatory variable (asymmetrical analysis);
- Bivariate statistical analysis, and, accordingly, the strength of the relationship between the observed variables, may change based on the type of variable that is observed (qualitative or quantitative).

## Acknowledgements

Conflicts of Interest: The authors have no conflicts of interest to declare.

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

## 24 15. Bivariate analysis

Chapter outline.

- What is bivariate data analysis? (5 minute read time)
- Chi-square (4 minute read time)
- Correlations (5 minute read time)
- T-tests (5-minute read time)
- ANOVA (6-minute read time)

Content warning: examples include discussions of anxiety symptoms.

So now we get to the math! Just kidding. Mostly. In this chapter, you are going to learn more about bivariate analysis , or analyzing the relationship between two variables. I don’t expect you to finish this chapter and be able to execute everything you just read about – instead, the big goal here is for you to be able to understand what bivariate analysis is, what kinds of analyses are available, and how you can use them in your research.

Take a deep breath, and let’s look at some numbers!

## 15.1 What is bivariate analysis?

Learning objectives.

Learners will be able to…

- Define bivariate analysis
- Explain when we might use bivariate analysis in social work research

Did you know that ice cream causes shark attacks? It’s true! When ice cream sales go up in the summer, so does the rate of shark attacks. So you’d better put down that ice cream cone, unless you want to make yourself look more delicious to a shark.

Ok, so it’s quite obviously not true that ice cream causes shark attacks. But if you looked at these two variables and how they’re related, you’d notice that during times of the year with high ice cream sales, there are also the most shark attacks. Despite the fact that the conclusion we drew about the relationship was wrong, it’s nonetheless true that these two variables appear related, and researchers figured that out through the use of bivariate analysis. (For a refresher on correlation versus causation, head back to Chapter 8 .)

Bivariate analysis consists of a group of statistical techniques that examine the relationship between two variables. We could look at how anti-depressant medications and appetite are related, whether there is a relationship between having a pet and emotional well-being, or if a policy-maker’s level of education is related to how they vote on bills related to environmental issues.

Bivariate analysis forms the foundation of multivariate analysis, which we don’t get to in this book. All you really need to know here is that there are steps beyond bivariate analysis, which you’ve undoubtedly seen in scholarly literature already! But before we can move forward with multivariate analysis, we need to understand whether there are any relationships between our variables that are worth testing.

A study from Kwate, Loh, White, and Saldana (2012) illustrates this point. These researchers were interested in whether the lack of retail stores in predominantly Black neighborhoods in New York City could be attributed to the racial differences of those neighborhoods. Their hypothesis was that race had a significant effect on the presence of retail stores in a neighborhood, and that Black neighborhoods experience “retail redlining” – when a retailer decides not to put a store somewhere because the area is predominantly Black.

The researchers needed to know if the predominant race of a neighborhood’s residents was even related to the number of retail stores. With bivariate analysis, they found that “predominantly Black areas faced greater distances to retail outlets; percent Black was positively associated with distance to nearest store for 65 % (13 out of 20) stores” (p. 640). With this information in hand, the researchers moved on to multivariate analysis to complete their research.

## Statistical significance

Before we dive into analyses, let’s talk about statistical significance. Statistical significance is the extent to which our statistical analysis has produced a result that is likely to represent a real relationship instead of some random occurrence. But just because a relationship isn’t random doesn’t mean it’s useful for drawing a sound conclusion.

We went into detail about statistical significance in Chapter 5 . You’ll hopefully remember that there, we laid out some key principles from the American Statistical Association for understanding and using p-values in social science:

- P-values can indicate how incompatible the data are with a specified statistical model. P-values can provide evidence against the null hypothesis or the underlying assumptions of the statistical model the researchers used.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Both are inaccurate, though common, misconceptions about statistical significance.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. More nuance is needed to interpret scientific findings, as a conclusion does not become true or false when it passes from p=0.051 to p=0.049.
- Proper inference requires full reporting and transparency, rather than cherry-picking promising findings or conducting multiple analyses and only reporting those with significant findings. For the authors of this textbook, we believe the best response to this issue is for researchers make their data openly available to reviewers and general public and register their hypotheses in a public database prior to conducting analyses.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. In our culture, to call something significant is to say it is larger or more important, but any effect, no matter how tiny, can produce a small p-value if the study is rigorous enough. Statistical significance is not equivalent to scientific, human, or economic significance.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. (adapted from Wasserstein & Lazar, 2016, p. 131-132). [1]

A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. The word significant can cause people to interpret these differences as strong and important, to the extent that they might even affect someone’s behavior. As we have seen however, these statistically significant differences are actually quite weak—perhaps even “trivial.” The correlation between ice cream sales and shark attacks is statistically significant, but practically speaking, it’s meaningless.

There is debate about acceptable p- values in some disciplines. In medical sciences, a p- value even smaller than 0.05 is often favored, given the stakes of biomedical research. Some researchers in social sciences and economics argue that a higher p -value of up to 0.10 still constitutes strong evidence. Other researchers think that p -values are entirely overemphasized and that there are better measures of statistical significance. At this point in your research career, it’s probably best to stick with 0.05 because you’re learning a lot at once, but it’s important to know that there is some debate about p- values and that you shouldn’t automatically discount relationships with a p -value of 0.06.

## A note about “assumptions”

For certain types of bivariate, and in general for multivariate, analysis, we assume a few things about our data and the way it’s distributed. The characteristics we assume about our data that makes it suitable for certain types of statistical tests are called assumptions . For instance, we assume that our data has a normal distribution. While I’m not going to go into detail about these assumptions because it’s beyond the scope of the book, I want to point out that it is important to check these assumptions before your analysis.

Something else that’s important to note is that going through this chapter, the data analyses presented are merely for illustrative purposes – the necessary assumptions have not been checked. So don’t draw any conclusions based on the results shared.

For this chapter, I’m going to use a data set from IPUMS USA , where you can get individual-level, de-identified U.S. Census and American Community Survey data. The data are clean and the data sets are large, so it can be a good place to get data you can use for practice.

## Key Takeaways

- Bivariate analysis is a group of statistical techniques that examine the relationship between two variables.
- You need to conduct bivariate analyses before you can begin to draw conclusions from your data, including in future multivariate analyses.
- Statistical significance and p-values help us understand the extent to which the relationships we see in our analyses are real relationships, and not just random or spurious .
- Find a study from your literature review that uses quantitative analyses. What kind of bivariate analyses did the authors use? You don’t have to understand everything about these analyses yet!
- What do the p -values of their analyses tell you?

## 15.2 Chi-square

- Explain the uses of Chi-square test for independence
- Explain what kind of variables are appropriate for a Chi-square test
- Interpret results of a Chi-square test and draw a conclusion about a hypothesis from the results

The first test we’re going to introduce you to is known as a Chi-square test (sometimes denoted as χ 2 ) and is foundational to analyzing relationships between nominal or ordinal variables. A Chi-square test for independence (Chi-square for short) is a statistical test to determine whether there is a significant relationship between two nominal or ordinal variables. The “test for independence” refers to the null hypothesis of our comparison – that the two variables are independent and have no relationship.

A Chi-square can only be used for the relationship between two nominal or ordinal variables – there are other tests for relationships between other types of variables that we’ll talk about later in this chapter. For instance, you could use a Chi-square to determine whether there is a significant relationship between a person’s self-reported race and whether they have health insurance through their employer. (We will actually take a look at this a little later.)

Chi-square tests the hypothesis that there is a relationship between two categorical variables by comparing the values we actually observed and the value we would expect to occur based on our null hypothesis. T he expected value is a calculation based on your data when it’s in a summarized form called a contingency table , which is a visual representation of a cross-tabulation of categorical variables to demonstrate all the possible occurrences of your categories. I know that sounds complex, so let’s look at an example.

Earlier, we talked about looking at the relationship between a person’s race and whether they have health insurance through an employer. Based on 2017 American Community Survey data from IPUMS, this is what a contingency table for these two variables would look like.

So now we know what our observed values for these categories are. Next, let’s think about our expected values. We don’t need to get so far into it as to put actual numbers to it, but we can come up with a hypothesis based on some common knowledge about racial differences in employment. (We’re going to be making some generalizations here, so remember that there can be exceptions.)

## An applied example

Let’s say research shows that people who identify as black, indigenous, and people of color ( BIPOC ) tend to hold multiple part-time jobs and have a higher unemployment rate in general. Given that, our hypothesis based on this data could be that BIPOC people are less likely to have employer-provided health insurance. Before we can assess a likelihood, we need to know if these to variables are even significantly related. Here’s where our Chi-square test comes in!

I’ve used SPSS to run these tests, so depending on what statistical program you use, your outputs might look a little different.

There are a number of different statistics reported here. What I want you to focus on is the first line, the Pearson Chi-Square, which is the most commonly used statistic for larger samples that have more than two categories each. (The other two lines are alternatives to Pearson that SPSS puts out automatically, but they are appropriate for data that is different from ours, so you can ignore them. You can also ignore the “df” column for now, as it’s a little advanced for what’s in this chapter.)

The last column gives us our statistical significance level, which in this case is 0.00. So what conclusion can we draw here? The significant Chi-square statistic means we can reject the null hypothesis (which is that our two variables are not related). There is likely a strong relationship between our two variables that is probably not random, meaning that we should further explore the relationship between a person’s race and whether they have employer-provided health insurance. Are there other factors that affect the relationship between these two variables? That seems likely. (One thing to keep in mind is that this is a large data set, which can inflate statistical significance levels. However, for the purposes of our exercises, we’ll ignore that for now.)

What we cannot conclude is that these two variables are causally related. That is, someone’s race doesn’t cause them to have employer-provided health insurance or not. It just appears to be a contributing factor, but we are not accounting for the effect of other variables on the relationship we observe (yet).

- The Chi-square test is designed to test the null hypothesis that our two variables are not related to each other.
- The Chi-square test is only appropriate for nominal and/or ordinal variables.
- A statistically significant Chi-square statistic means we can reject the null hypothesis and assume our two variables are, in fact, related.
- A Chi-square test doesn’t let us draw any conclusions about causality because it does not account for the influence of other variables on the relationship we observe.
- Which two variables would you most like to use in the analysis?
- What about the relationship between these two variables interests you in light of what your literature review has shown so far?

## 15.3 Correlations

- Define correlation and understand how to use it in quantitative analysis
- Explain what kind of variables are appropriate for a correlation
- Interpret a correlation coefficient
- Define the different types of correlation – positive and negative
- Interpret results of a correlation and draw a conclusion about a hypothesis from the results

A correlation is a relationship between two variables in which their values change together. For instance, we might expect education and income to be correlated – as a person’s educational attainment (how much schooling they have completed) goes up, so does their income. What about minutes of exercise each week and blood pressure? We would probably expect those who exercise more have lower blood pressures than those who don’t. We can test these relationships using correlation analyses. Correlations are appropriate only for two interval/ratio variables.

It’s very important to understand that correlations can tell you about relationships, but not causes – as you’ve probably already heard, correlation is not causation! Go back to our example about shark attacks and ice cream sales from the beginning of the chapter. Clearly, ice cream sales don’t cause shark attacks, but the two are strongly correlated (most likely because both increase in the summer for other reasons). This relationship is an example of a spurious relationship , or a relationship that appears to exist between to variables, but in fact does not and is caused by other factors. We hear about these all the time in the news and correlation analyses are often misrepresented. As we talked about in Chapter 4 when discussing critical information literacy, your job as a researcher and informed social worker is to make sure people aren’t misstating what these analyses actually mean, especially when they are being used to harm vulnerable populations.

Let’s say we’re looking at the relationship between age and income among indigenous people in the United States. In the data set we’ve been using so far, these folks generally fall into the racial category of American Indian/Alaska native, so we’ll use that category because it’s the best we can do. Using SPSS, this is the output you’d get with these two variables for this group. We’ll also limit the analysis to people age 18 and over since children are unlikely to report an individual income.

Here’s Pearson again, but don’t be confused – this is not the same test as the Chi-square, it just happens to be named after the same person. First, let’s talk about the number next to Pearson Correlation, which is the correlation coefficient. The c orrelation coefficient is a statistically derived value between -1 and 1 that tells us the magnitude and direction of the relationship between two variables. A statistically significant correlation coefficient like the one in this table (denoted by a p -value of 0.01) means the relationship is not random.

The magnitude of the relationship is how strong the relationship is and can be determined by the absolute value of the coefficient. In the case of our analysis in the table above, the correlation coefficient is 0.108, which denotes a pretty weak relationship. This means that, among the population in our sample, age and income don’t have much of an effect on each other. (If the correlation coefficient were -0.108, the conclusion about its strength would be the same.)

In general, you can say that a correlation coefficient with an absolute value below 0.5 represents a weak correlation. Between 0.5 and 0.75 represents a moderate correlation, and above 0.75 represents a strong correlation. Although the relationship between age and income in our population is statistically significant, it’s also very weak.

The sign on your correlation coefficient tells you the direction of your relationship. A p ositive correlation or direct relationship occurs w hen two variables move together in the same direction – as one increases, so does the other, or, as one decreases, so does the other. Correlation coefficients will be positive, so that means the correlation we calculated is a positive correlation and the two variables have a direct, though very weak, relationship. For instance, in our example about shark attacks and ice cream, the number of both shark attacks and pints of ice cream sold would go up, meaning there is a direct relationship between the two.

A negative correlation or inverse relationship occurs w hen two variables change in opposite directions – one goes up, the other goes down and vice versa. The correlation coefficient will be negative. For example, if you were studying social media use and found that time spent on social media corresponded to lower scores on self-esteem scales, this would represent an inverse relationship.

Correlations are important to run at the outset of your analyses so you can start thinking about how variables relate to each other and whether you might want to include them in future multivariate analyses. For instance, if you’re trying to understand the relationship between receipt of an intervention and a particular outcome, you might want to test whether client characteristics like race or gender are correlated with your outcome; if they are, they should be plugged into subsequent multivariate models. If not, you might want to consider whether to include them in multivariate models.

## A final note

Just because the correlation between your dependent variable and your primary independent variable is weak or not statistically significant doesn’t mean you should stop your work. For one thing, disproving your hypothesis is important for knowledge-building. For another, the relationship can change when you consider other variables in multivariate analysis, as they could mediate or moderate the relationships.

- Correlations are a basic measure of the strength of the relationship between two interval/ratio variables.
- A correlation between two variables does not mean one variable causes the other one to change. Drawing conclusions about causality from a simple correlation is likely to lead to you to describing a spurious relationship, or one that exists at face value, but doesn’t hold up when more factors are considered.
- Correlations are a useful starting point for almost all data analysis projects.
- The magnitude of a correlation describes its strength and is indicated by the correlation coefficient, which can range from -1 to 1.
- A positive correlation, or direct relationship, occurs when the values of two variables move together in the same direction.
- A negative correlation, or inverse relationship, occurs when the value of one variable moves one direction, while the value of the other variable moves the opposite direction.

## 15.4 T-tests

- Describe the three different types of t-tests and when to use them.
- Explain what kind of variables are appropriate for t-tests.

At a very basic level, t-tests compare the means between two groups, the same group at two points in time, or a group and a hypothetical mean. By doing so using this set of statistical analyses, you can learn whether these differences are reflective of a real relationship or not (whether they are statistically significant).

Say you’ve got a data set that includes information about marital status and personal income (which we do!). You want to know if married people have higher personal (not family) incomes than non-married people, and whether the difference is statistically significant. Essentially, you want to see if the difference in average income between these two groups is down to chance or if it warrants further exploration. What analysis would you run to find this information? A t-test!

A lot of social work research focuses on the effect of interventions and programs, so t-tests can be particularly useful. Say you were studying the effect of a smoking cessation hotline on the number of days participants went without smoking a cigarette. You might want to compare the effect for men and women, in which case you’d use an independent samples t-test. If you wanted to compare the effect of your smoking cessation hotline to others in the country and knew the results of those, you would use a one-sample t-test. And if you wanted to compare the average number of cigarettes per day for your participants before they started a tobacco education group and then again when they finished, you’d use a paired-samples t-test. Don’t worry – we’re going into each of these in detail below.

So why are they called t-tests? Basically, when you conduct a t-test, you’re comparing your data to a theoretical distribution of data known as the t distribution to get the t statistic. The t distribution is normal, so when your data are not normally distributed, a t distribution can approximate a normal distribution well enough for you to test some hypotheses. (Remember our discussion of assumptions in section 15.1 – one of them is that data be normally distributed.) Ultimately, the t statistic that the test produces allows you to determine if any differences are statistically significant.

For t-tests, you need to have an interval/ratio dependent variable and a nominal or ordinal independent variable. Basically, you need an average (using an interval or ratio variable) to compare across mutually exclusive groups (using a nominal or ordinal variable).

Let’s jump into the three different types of t- tests.

## Paired samples t- test

The paired samples t -test is used to compare two means for the same sample tested at two different times or under two different conditions. This comparison is appropriate for pretest-post-test designs or within-subjects experiments. The null hypothesis is that the means at the two times or under the two conditions are the same in the population. The alternative hypothesis is that they are not the same.

For example, say you are testing the effect of pet ownership on anxiety symptoms. You have access to a group of people who have the same diagnosis involving anxiety who do not have pets, and you give them a standardized anxiety inventory questionnaire. Then, each of these participants gets some kind of pet and after 6 months, you give them the same standardized anxiety questionnaire.

To compare their scores on the questionnaire at the beginning of the study and after 6 months of pet ownership, you would use paired samples t-test. Since the sample includes the same people, the samples are “paired” (hence the name of the test). If the t-statistic is statistically significant, there is evidence that owning a pet has an effect on scores on your anxiety questionnaire.

## Independent samples/two samples t-test

An independent/two samples t-test is used to compare the means of two separate samples. The two samples might have been tested under different conditions in a between-subjects experiment, or they could be pre-existing groups in a cross-sectional design (e.g., women and men, extroverts and introverts). The null hypothesis is that the means of the two populations are the same. The alternative hypothesis is that they are not the same.

Let’s go back to our example related to anxiety diagnoses and pet ownership. Say you want to know if people who own pets have different scores on certain elements of your standard anxiety questionnaire than people who don’t own pets.

You have access to two groups of participants: pet owners and non-pet owners. These groups both fit your other study criteria. You give both groups the same questionnaire at one point in time. You are interested in two questions, one about self-worth and one about feelings of loneliness. You can calculate mean scores for the questions you’re interested in and then compare them across two groups. If the t-statistic is statistically significant, then there is evidence of a difference in these scores that may be due to pet ownership.

## One-sample t-test

Finally, let’s talk about a one sample t-test. This t-test is appropriate when there is an external benchmark to use for your comparison mean, either known or hypothesized. The null hypothesis for this kind of test is that the mean in your sample is different from the mean of the population. The alternative hypothesis is that the means are different.

Let’s say you know the average years of post-high school education for Black women, and you’re interested in learning whether the Black women in your study are on par with the average. You could use a one-sample t-test to determine how your sample’s average years of post-high school education compares to the known value in the population. This kind of t-test is useful when a phenomenon or intervention has already been studied, or to see how your sample compares to your larger population.

- There are three types of t- tests that are each appropriate for different situations. T-tests can only be used with an interval/ratio dependent variable and a nominal/ordinal independent variable.
- T-tests in general compare the means of one variable between either two points in time or conditions for one group, two different groups, or one group to an external benchmark variable..
- In a paired-samples t-test , you are comparing the means of one variable in your data for the same group , either at two different times or under two different conditions, and testing whether the difference is statistically significant.
- In an independent samples t-test , you are comparing the means of one variable in your data for two different groups to determine if any difference is statistically significant.
- In a one-sample t-test , you are comparing the mean of one variable in your data to an external benchmark, either observed or hypothetical.
- Which t-test makes the most sense for your data and research design? Why?
- Which variable would be an appropriate dependent variable? Why?
- Which variable would be an interesting independent variable? Why?

## 15.5 ANOVA ( AN alysis O f VA riance)

- Explain what kind of variables are appropriate for ANOVA
- Explain the difference between one-way and two-way ANOVA
- Come up with an example of when each type of ANOVA is appropriate

Analysis of variance , generally abbreviated to ANOVA for short, is a statistical method to examine how a dependent variable changes as the value of a categorical independent variable changes. It serves the same purpose as the t-tests we learned in 15.4: it tests for differences in group means. ANOVA is more flexible in that it can handle any number of groups, unlike t-tests, which are limited to two groups (independent samples) or two time points (dependent samples). Thus, the purpose and interpretation of ANOVA will be the same as it was for t-tests.

There are two types of ANOVA: a one-way ANOVA and a two-way ANOVA. One-way ANOVAs are far more common than two-way ANOVAs.

## One-way ANOVA

The most common type of ANOVA that researchers use is the one-way ANOVA , which is a statistical procedure to compare the means of a variable across three or more groups of an independent variable. Let’s take a look at some data about income of different racial and ethnic groups in the United States. The data in Table 15.2 below comes from the US Census Bureau’s 2018 American Community Survey [2] . The racial and ethnic designations in the table reflect what’s reported by the Census Bureau, which is not fully representative of how people identify racially.

Off the bat, of course, we can see a difference in the average income between these groups. Now, we want to know if the difference between average income of these racial and ethnic groups is statistically significant, which is the perfect situation to use one-way ANOVA. To conduct this analysis, we need the person-level data that underlies this table, which I was able to download from IPUMS. For this analysis, race is the independent variable (nominal) and total income is the dependent variable (interval/ratio). Let’s assume for this exercise that we have no other data about the people in our data set besides their race and income. (If we did, we’d probably do another type of analysis.)

I used SPSS to run a one-way ANOVA using this data. With the basic analysis, the first table in the output was the following.

Without going deep into the statistics, the column labeled “F” represents our F statistic, which is similar to the T statistic in a t-test in that it gives a statistical point of comparison for our analysis. The important thing to noticed here, however, is our significance level, which is .000. Sounds great! But we actually get very little information here – all we know is that the between-group differences are statistically significant as a whole, but not anything about the individual groups.

This is where post hoc tests come into the picture. Because we are comparing each race to each other race, that adds up to a lot of comparisons, and statistically, this increases the likelihood of a type I error. A post hoc test in ANOVA is a way to correct and reduce this error after the fact (hence “post hoc”). I’m only going to talk about one type – the Bonferroni correction – because it’s commonly used. However, there are other types of post hoc tests you may encounter.

When I tell SPSS to run the ANOVA with a Bonferroni correction, in addition to the table above, I get a very large table that runs through every single comparison I asked it to make among the groups in my independent variable – in this case, the different races. Figure 15.4 below is the first grouping in that table – they will all give the same conceptual information, though some of the signs on the mean difference and, consequently the confidence intervals, will vary.

Now we see some points of interest. As you’d expect knowing what we know from prior research, race seems to have a pretty strong influence on a person’s income. (Notice I didn’t say “effect” – we don’t have enough information to establish causality!) The significance levels for the mean of White people’s incomes compared to the mean of several races are .000. Interestingly, for Asian people in the US, race appears to have no influence on their income compared to White people in the US. The significance level for Native Hawaiians and Pacific Islanders is also relatively high.

So what does this mean? We can say with some confidence that, overall, race seems to influence a person’s income. In our hypothetical data set, since we only have race and income, this is a great analysis to conduct. But do we think that’s the only thing that influences a person’s income? Probably not. To look at other factors if we have them, we can use a two-way ANOVA.

## Two-way ANOVA and n- way ANOVA

A two-way ANOVA is a statistical procedure to compare the means of a variable across groups using multiple independent variables to distinguish among groups. For instance, we might want to examine income by both race and gender, in which case, we would use a two-way ANOVA. Fundamentally, the procedures and outputs for two-way ANOVA are almost identical to one-way ANOVA, just with more cross-group comparisons, so I am not going to run through an example in SPSS for you.

You may also see textbooks or scholarly articles refer to n- way ANOVAs. Essentially, just like you’ve seen throughout this book, the n can equal just about any number. However, going far beyond a two-way ANOVA increases your likelihood of a type I error, for the reasons discussed in the previous section.

You may notice that this book doesn’t get into multivariate analysis at all. Regression analysis, which you’ve no doubt seen in many academic articles you’ve read, is an incredibly complex topic. There are entire courses and textbooks on the multiple different types of regression analysis, and we did not think we could adequately cover regression analysis at this level. Don’t let that scare you away from learning about it – just understand that we don’t expect you to know about it at this point in your research learning.

- One-way ANOVA is a statistical procedure to compare the means of a variable across three or more categories of an independent variable. This analysis can help you understand whether there are meaningful differences in your sample based on different categories like race, geography, gender, or many others.
- Two-way ANOVA is almost identical to one-way ANOVA, except that you can compare the means of a variable across multiple independent variables.
- Would you want to conduct a two-way or n -way ANOVA? If so, what other independent variables would you use, and why?
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70 , p. 129-133. ↵
- Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0 ↵

a group of statistical techniques that examines the relationship between two variables

"Assuming that the null hypothesis is true and the study is repeated an infinite number times by drawing random samples from the same populations(s), less than 5% of these results will be more extreme than the current result" (Cassidy et al., 2019, p. 233).

The characteristics we assume about our data, like that it is normally distributed, that makes it suitable for certain types of statistical tests

A relationship where it appears that two variables are related BUT they aren't. Another variable is actually influencing the relationship.

a statistical test to determine whether there is a significant relationship between two categorical variables

variables whose values are organized into mutually exclusive groups but whose numerical values cannot be used in mathematical operations.

a visual representation of across-tabulation of categorical variables to demonstrate all the possible occurrences of categories

a relationship between two variables in which their values change together.

when a relationship between two variables appears to be causal but can in fact be explained by influence of a third variable

Graduate research methods in social work Copyright © 2020 by Matthew DeCarlo, Cory Cummings, Kate Agnelli is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

## Share This Book

Quantitative Data Analysis

## 4 Bivariate Analyses: Crosstabulation

Crosstabulation

Roger Clark

In most research projects involving variables, researchers do indeed investigate the central tendency and variation of important variables, and such investigations can be very revealing. But the typical researcher, using quantitative data analysis, is interested in testing hypotheses or answering research questions that involve at least two variables. A relationship is said to exist between two variables when certain categories of one variable are associated, or go together, with, certain categories of the other variable. Thus, for example, one might expect that in any given sample of men and women (assume, for the purposes of this discussion, that the sample leaves out nonbinary folks), men would tend to be taller than women. If this turned out to be true, one would have shown that there is a relationship between gender and height.

But before we go further, we need to make a couple of distinctions. One crucial distinction is that between an independent variable and a dependent variable . An independent variable is a variable a researcher suspects may affect or influence another variable. A dependent variable , on the other hand, is a variable that a researcher suspects may be affected or influenced by (or dependent upon ) another variable. In the example of the previous paragraph, gender is the variable that is expected to affect or influence height and is therefore the independent variable. Height is the variable that is expected to be affected or influenced by gender and is therefore the dependent variable. Any time one states an expected relationship between two (or more) variables, one is stating a hypothesis . The hypothesis stated in the second-to-last sentence of the previous paragraph is that men will tend to be taller than women. We can map two-variable hypotheses in the following way (Figure 3.1):

When mapping a hypothesis, we normally put the variable we think to be affecting the other variable on the left and the variable we expect to be affected on the right and then draw arrows between the categories of the first variable and the categories of the second that we expect to be connected.

Quiz at the End of The Paragraph

Read the following report by Annie Lowrey about a study done by two researchers, Kearney and Levine. What is the main hypothesis, or at least the main finding, of Kearney and Levine’s study on the effects of Watching 16 and Pregnant on adolescent women? How might you map this hypothesis (or finding)?

https://www.nytimes.com/2014/01/13/business/media/mtvs-16-and-pregnant-derided-by-some-may-resonate-as-a-cautionary-tale.html

We’d like to say a couple of things about what we think Kearney and Levine’s major hypothesis was and then introduce you to a way you might analyze data collected to test the hypothesis. Kearney and Levine’s basic hypothesis is that adolescent women who watched 16 and Pregnant were less likely to become pregnant than women who did not watch it. They find some evidence not only to support this basic hypothesis but also to support the idea that the ones who watched the show were less likely to get pregnant because they were more likely to seek information about contraception (and presumably to use it) than others. Your map of the basic hypothesis, at least as it applied to individual adolescent women, might look like this:

Let’s look at a way of showing a relationship between two nominal level variables: crosstabulation . Crosstabulation is process of making a bivariate table for nominal level variables to show their relationship. But how does crosstabulation work?

Suppose you collected data from 8 adolescent women and the data looked like this:

Table 1: Data from Hypothetical Sample A

Quick Check : What percentage of those who have watched 16 and Pregnant in the sample have become pregnant? What percentage of those who have NOT watched 16 and Pregnant have become pregnant?

If you found that 25 percent of those who had watched the show became pregnant, while 75 percent of those who had not watched it did so, you have essentially done a crosstabulation in your head. But here’s how you can do it more formally and more generally.

First you need to take note of the number of categories in your independent variable (for “Watched 16 and Pregnant ” it was 2: Yes and No). Then note the number of categories in your dependent variable (for “Got Pregnant” it was also 2: again, Yes and No). Now you prepare a “2 by 2” table like the one in Table 3.2, [1] labeling the columns with the categories of the independent variables and the rows with the categories of the dependent variable. Then decide where the first case should be put, as we’ve done, by determining which cell is where its appropriate row and column “cross.” We’ve “crosstabulated” Person 1’s data by putting a mark in the box where the “Yes” for “watched” and the “No” for “got pregnant” cross.

Table 2. Crosstabulating Person 1’s Data from Table 3.1 Above

We’ve “crosstabulated” the first case for you. Can you crosstabulate the other seven cases? We’re going to call the cell in the upper left corner of the table cell “A,” the one in the upper right, cell “B,” the one in the lower left, cell “C,” and the one in the lower right, cell “D.” If you’ve finished your crosstabulation and had one case in cell A, 3 in cell B, 3 in cell C, and 1 in cell D, you’ve done great!

In order to interpret and understand the meaning of your crosstabulation, you need to take one more step, and that is converting those tally marks to percentages. To do this, you add up all the tally marks in each column, and then you determine what percentage of the column total is found in each cell in that column. You’ll see what that looks like in Table 3 below.

## Direction of the Relationship

Now, there are three characteristics of a crosstabulated relationship that researchers are often interested in: its direction , its strength , and its generalizability . We’ll define each of these in turn, and as we come to it. The direction of a relationship refers to how categories of the independent variable are related to categories of the dependent variable. There are two steps involved in working out the direction of a crosstabulated relationship… and these are almost indecipherable until you’ve seen it done:

1. Percentage in the direction of the independent variable.

2. Compare percentages in one category of the dependent variable.

The first step actually involves three substeps. First you change the tally marks to numbers. Thus, in the example above, cell A would get a 1, B, a 3, C, a 3, and D, a 1. Second, you’d add up all the numbers in each category of the independent variable and put the total on the side of the table at the end of that column. Third, you would calculate the percentage of that total that falls into each cell along that column (as noted above). Once you’d done all that with the data we gave you above, you should get a table that looks like this (Table 3.3):

Table 3 Crosstabulation of Our Imaginary Data from a 16 and Pregnant Study

Step 2 in determining the direction of a crosstabulated relationship involves comparing percentages in one category of the dependent variable. When we look at the “yes” category, we find that 25% of those who watched the show got pregnant, while 75% of those who did NOT watch the show got pregnant. Turning this from a percentage comparison to plain English, this crosstabulation would have shown us that those who did watch the show were less likely to get pregnant than those who did not. And that is the direction of the relationship.

Note: because we are designing our crosstabulations to have the independent variable in the columns, one of the simplest ways to look at the direction or nature of the relationship is to compare the percentages across the rows. Whenever you look at a crosstabulation, start by making sure you know which is the independent and which is the dependent variable and comparing the percentages accordingly.

## Strength of the Relationship

When we deal with the strength of a relationship, we’re dealing with the question of how reliably we can predict a sample member’s value or category of the dependent variable based on knowledge of that member’s value or category on the independent variables, just knowing the direction of the relationship. Thus, for the table above, it’s clear that if you knew that a person had watched 16 and Pregnant and you guessed she’d not gotten pregnant, you’d have a 75% (3 out of 4) chance of being correct; if you knew she hadn’t watched, and you guessed she had gotten pregnant, you’d have a 75% (3 out of 4) chance of being correct. Knowing the direction of this relationship would greatly improve your chances of making good guesses…but they wouldn’t necessarily be perfect all the time.

There are several measures of the strength of association and, if they’ve been designed for nominal level variables, they all vary between 0 and 1. When one of the measures is 0.00, it indicates that knowing a value of the independent variable won’t help you at all in guessing what a value of the dependent variable will be. When one of these measures is 1.00, it indicates that knowing a value of the independent variable and the direction of the relationship, you could make perfect guesses all the time. One of the simplest of these measures of strength, which can only be used when you have 2 categories in both the independent and dependent variables, is the absolute value of Yule’s Q . Because the “absolute value of Yule’s Q” is so relatively easy to compute, we will be using it a lot from now on, and it is the one formula in this book we would like you to learn by heart. We will be referring to it simply as |Yule’s Q|—note that the “|” symbols on both sides of the ‘Yule’s Q’ are asking us to take whatever Yule’s Q computes to be and turn it into a positive number (its absolute value). So here’s the formula for Yule’s Q:

[latex]|\mbox{Yule's Q}| = \frac{|(A \times B) - (B \times C) |}{|(A \times D) + (B \times C)|} \\ \text{Where }\\ \text{$A$ is the number of cases in cell $A$ }\\ \text{$B$ is the number of cases in cell $B$ }\\ \text{$C$ is the number of cases in cell $C$ }\\ \text{$D$ is the number of cases in cell $D$}[/latex]

For the crosstabulation of Table 3,

[latex]|{\mbox{Yule's Q}}| = \frac{|(1 \times 1) - (3 \times 3)|}{|(1 \times 1) + (3 \times 3)|} = \frac{|1-9|}{|1+9|} = \frac{|-8|}{|10|} = \frac{8}{10} = .80[/latex]

In other words, the Yule’s Q is .80, much close to the upper limit of Yule’s Q (1.00) than it is to its lower limit (0.00). So the relationship is very strong, indicating, as we already knew, that, given knowledge of the direction of the relationship, we could make a pretty good guess about what value on the dependent variable a case would have if we knew what value on the independent variable it had.

Practice Exercise

Suppose you took three samples of four adolescent women apiece and obtained the following data on the 16 and Pregnant topic:

See if you can determine both the direction and strength of the relationship between having watched “16 and Pregnant” in each of these imaginary samples. In what ways does each sample, other than sample size, differ from the Sample A above? Answers to be found in the footnote. [2]

Roger now wants to share with you a discovery he made after analyzing some data that two now post-graduate students of his, Angela Leonardo and Alyssa Pollard, have made using crosstabulation. At the time of this writing, they had just coded their first night of TV commercials, looking for the gender of the authoritative “voice-over”—the disembodied voice that tells viewers key stuff about the product. It’s been generally found in gender studies that these voice-overs are overwhelmingly male (e.g., O’Donnell and O’Donnell 1978; Lovdal 1989; Bartsch et al. 2000), even though the percentage of such voice-overs that were male had dropped from just over 90 percent in the 1970s and 1980s to just over 70 percent in 1998. We will be looking at considerably more data, but so far things are so interesting that Roger wants to share them with you…and you’re now sophisticated enough about crosstabs (shorthand for crosstabulations) to appreciate them. Thus, Table 3.4 suggests that things have changed a great deal. In fact the direction of the relationship between the time period of the commercials and the gender of the voice-over is clearly that more recent commercials are much more likely to have a female voice-over than older ones. While only 29 percent of commercials in 1998 had a female voice-over, 71 percent in 2020 did so. And a Yule’s Q of .72 indicates that the relationship is very strong.

Table 3.4 Crosstabulation of Year of Commercial and Gender of the Voice-Over

Yule’s Q, while relatively easy to calculate, has a couple of notable limitations. One is that if one of the four cells in a 2 x 2 table (a table based on an independent variable with 2 categories and a dependent variable with 2 categories) has no cases, the calculated Yule’s Q will be 1.00, even if the relationship isn’t anywhere near that strong. (Why don’t you try it with a sample that has 5 cases on cell A, 5 in cell B, 5 in cell C, and 0 in cell D?)

Another problem with Yule’s Qis that it can only be used to describe 2 x 2 tables. But not all variables have just 2 categories. As a consequence, there are several other measures of strength of association for nominal level variables that can handle bigger tables. (One that we recommend for sheep farmers is lambda. Bahhh!) But, we most typically use one called Cramer’s V, which shares with Yule’s Q (and lambda) the property of varying between 0 and 1. Roger normally advises students that values of Cramer’s V between 0.00 and 0.10 suggests that the relationship is weak; between 0.11 and 0.30, that the relationship is moderately strong; between 0.31 and and 0.59, that the relationship is strong; and between 0.60 and 1.00, that the relationship is very strong. Associations (a fancy word for the strength of the relationship) above 0.59 are not so common in social science research.

An example of the use of Cramer’s V? Roger used statistical software called the Statistical Package for the Social Sciences (SPSS) to analyze the data Angela, Alyssa and he collected about commercials (on one night) to see whether men or women, both or neither, were more likely to appear as the main characters in commercials focused on domestic goods (goods used inside the home) and non-domestic goods (goods used outside the home). Who (men or women or both) would you expect to be such (main) characters in commercials involving domestic products? Non-domestic products? If you guessed that females might be the major characters in commercials for domestic products (e.g., food, laundry detergent, and home remedies) and males might be major characters in commercials for non-domestic products (e.g., cars, trucks, cameras), your guesses would be consistent with findings of previous researchers (e.g., O’Donnell and O’Donnell, 1978; Lovdal, 1989; Bartsch et al., 2001). The data we collected on our first night of data collection suggest some support for these findings (and your expectations), but also some support for another viewpoint. Table 3.5, for instance, shows that women were, in fact, the main characters in about 48 percent of commercials for domestic products, while they were the main characters in only about 13 percent of commercials for non-domestic products. So far, so good. But males, too, were more likely to be main characters in commercials for domestic products (they were these characters about 24 percent of the time) than they were in commercials for non-domestic products (for which they were the main character only about 4 percent of the time). So who were the main product “representatives” for non-domestic commercials? We found that in these commercials at least one man and one woman were together the main characters about 50 percent of the time, while men and women together were the main characters in only about 18 percent of the time in commercials for domestic products.

But the analysis involving gender of main character and whether products were domestic or non-domestic involved more than a 2 x 2 table. In fact, it involved a 2 x 4 table because our dependent variable, gender of main character, had four categories: female, male, both, and neither. Consequently, we couldn’t use Yule’s Q as a measure of strength of association. But we could ask, and did ask (using SPSS), for Cramer’s V, which turned out to be about 0.53, suggesting (if you re-examine Roger’s advice above) that the relationship is a strong one.

Table 3.5 Crosstabulation of Type of Commercial and Gender of Main Character

## Generalizability of the Relationship

When we speak of the generalizability of a relationship, we’re dealing with the question of whether something like the relationship (in direction, if not strength) that is found in the sample can be safely generalized to the larger population from which the sample was drawn. If, for instance, we drew a probability sample of eight adolescent women like the ones we pretended to draw in the first example above, we’d know we have a sample in which a strong relationship existed between watching “16 and Pregnant” and not becoming pregnant. But how could one tell that this sample relationship was likely to be representative of the true relationship in the larger population?

If you recall the distinction we drew between descriptive and inferential statistics in the Chapter on Univariate Analysis , you won’t be surprised to learn that we are now entering the realm of inferential statistics for bivariate relationships. When we use percentage comparisons within one category of the dependent variable to determine the direction of a relationship and measures like Yule’s Q and Cramer’s V to get at its strength, we’re using descriptive statistics—ones that describe the relationship in the sample. But when we talk about Pearson’s chi-square (or Χ ² ), we’re referring to an inferential statistic—one that can help us determine whether we can generalize that something like the relationship in the sample exists in the larger population from which the sample was drawn.

But, before we learn how to calculate and interpret Pearson’s chi-square, let’s get a feel for the logic of this inferential statistic first. Scientists generally, and social scientists in particular, are very nervous about inferring that a relationship exists in the larger population when it really doesn’t exist there. This kind of error—the one you’d make if you inferred that a relationship existed in the larger population when it didn’t really exist there—has a special name: a Type I error. Social scientists are so anxious about making Type 1 errors that they want to keep the chances of making them very low, but not impossibly low. If they made them impossibly low, then they’d risk making the opposite of a Type 1 error: a Type 2 error —the kind of error you’d make when you failed to infer that a relationship existed in the larger population when it really did exist there. The chances, or probability, of something happening can vary from 0.00 (when there’s no chance at all of it happening) to 1.00, when there’s a perfect chance that it will happen. In general, social scientists aim to keep the chances of making a Type 1 error below .05, or below a 1 in 20 chance. They thus aim for a very small, but not impossibly small, chance of making the inference that a relationship exists in the larger population when it doesn’t really exist there.

Karl Pearson, the statistician whose name is associated with Pearson’s chi-square, studied the statistic’s property in about 1900. He found, among other things, that crosstabulations of different sizes (i.e., different numbers of cells) required a different chi-square to be associated with a .05 chance, or probability ( p ), of making a Type 1 error or less. As the number of cells increase, the required chi-square increases as well. For a 2 x 2 table, the critical chi-square is 3.84 (that is, the computed chi-square value should be 3.84 or more for you to infer that a relationship exists in the larger population with only a .05 chance, or less, of being wrong); for a 2 x 3 table, the critical chi-square is 5.99, and so on. Before we were able to use statistical processing software like SPSS, statistical researchers relied on tables that outlined the critical values of chi-quare for different size tables (degrees of freedom, to be discussed below) and different probabilities of making a Type 1 error. A truncated (shortened) version of such a table can be seen in Table 6.

Table 6: Table of Critical Values of the Chi-Square Distribution

Now you’re ready to see how to calculate chi-square. The formula for chi-square (Χ²) is:

[latex]\chi^2 = \sum\frac{(O-E)^2}{E}\\ \text{where}\\ \text{$\chi$ means "the sum of"}\\ \text{$O = $ the number of observed number of cases in each cell in the sample}\\ \text{$E =$ the expected number in each cell, if there were no relationship between the two variables}[/latex]

Let’s see how this would work with the example of the imaginary data in Table 3.3. This table, if you recall, looked (mostly) like this:

Table 7 (Slightly Revised) Crosstabulation of Our Imaginary Data from a “16 and Pregnant” Study

How do you figure out what the expected number of cases would be in each cell? You use the following formula:

[latex]E = \frac{M_r \times M_c}{N}\\ \text{Where}\\ \text{$M_r$ is the row marginal for the cell}\\ \text{$M_c$ is the column marginal for the cell}\\ \text{$N$ is the total number of cases in the sample}[/latex]

A row marginal is the total number of cases in a given row of a table. A column marginal is the total number of cases in a given column of a table. For this table, the N is 8, the total number of cases involved in the crosstabulation. For cell A, the row marginal is 4 and the column marginal is 4, which means its expected number of cases would be 4 x 4 = 16/8 = 2. In this particular table, all the cells would have had an expected frequency (or number of cases) of 2. So now all we have to do to compute χ 2 is to make a series of calculation columns:

And the sum of all the numbers in the (0-E)²/E column is 2.00. This is less than the 3.84 that χ² needs to be for us to conclude that the chances of making a Type 1 error are less than .05 (see Table 3.6), so we cannot safely generalize that something like the relationship in this small sample exists in the larger population. Aren’t you glad that these days programs like SPSS can do these calculations for us? Even though they can, it’s important to go through the process a few times on your own so that you understand what it is that the computer is doing.

Chi-square varies based on three characteristics of the sample relationship. The first of these is the number of cells. Higher chi-squares are more easily achieved in tables with more cells; hence the 3.84 standard for 2 x 2 tables and the 5.99 standard for 2 x 3 tables. You’ll recall from Table 3.6 that we used the term degrees of freedom to refer to the calculation of table size. To figure out the degrees of freedom for a crosstabulation, you simply count the number of columns in the table (only the columns with data in them, not columns with category names) and subtract one. Then you count the number of rows in the table, again only those with data in them, and subtract one. Finally, you multiply the two numbers you have computed. Therefore, the degrees of freedom for a 2×2 table will be 1 [(2-1)*(2-1)], while the degrees of freedom for a 4×6 table will be 15 [(4-1)*(6-1)].

Higher chi-squares will also be achieved when the relationship is stronger. If, instead of the 1, 3, 3, 1 pattern in the four cells above (a relationship that yields a Yule’s Q of 0.80, one had a 0, 4, 4, 0 pattern (a relationship that yields a Yule’s Q of 1.00), the chi-square would be 8.00, [3] considerably greater than the 3.84 standard, and one could then generalize that something like the relationship in the sample also existed in the larger population.

But chi-square also varies with the size of the sample. Thus, if instead of the 1, 3, 3, 1 pattern above, one had a 10, 30, 30, 10 pattern—both of which would yield a Yule’s Q of 0.80 and are therefore of the same strength, and both of which have the same number of cells (4)—the chi-square would compute to be 20, instead of 2, and give pretty clear guidance to infer that a relationship exists in the larger population. The message of this last co-variant of chi-square—that it grows as the sample grows—implies that researchers who want to find generalizable results do well to increase sample size. A sample that tells us that the relationship under investigation is generalizable is said to be significant —sometimes a desirable and often an interesting thing. [4] Incidentally, SPSS computed the chi-square for the crosstabulation in Table 3.5, the one that showed the relationship between type of product advertised (domestic or non-domestic) and the gender of the product representative, to be 17.5. Even for a 2 x 4 table like that one, this is high enough to infer that a relationship exists in the larger population, with less than a .05 chance of being wrong. In fact, SPSS went even further, telling us that the chances of making a Type 1 error were less than .001. (Aren’t computers great?)

## Crosstabulation with Two Ordinal Level Variables

We’ve introduced crosstabulation as a technique designed for the analysis of the relationship between two nominal level variables. But because all variables are at least nominal level, one could theoretically use crosstabulation to analyze the relation between variables of any scale. [5] In the case of two interval level variables, however, there are much more elegant techniques for doing so and we’ll be looking at those in the chapter on correlation and regression . If one were looking into the relationship between a nominal level variable (say, gender, with the categories male and female) [6] and an ordinal level variable (say, happiness with marriage with the three categories: very happy, happy, not so happy), one could simply use all the same techniques for determining the direction, strength, and generalizability we’ve discussed above.

If we chose to analyze the relationship between two ordinal level variables, however, we could still use crosstabulation, but we might want to use a more elegant way of determining direction and strength of relationship than by comparing percentages and seeing what Cramer’s V tells us. One very cool statistic used for determining the direction and strength of a relationship between two ordinal level variables is gamma . Unlike Cramer’s V and Yule’s Q, whose values only vary between 0.00 and 1.00, and therefore can only speak to the strength of a relationship, gamma’s possible values are between -1.00 and 1.00. This one statistic can tell us about both the direction and the strength of the relationship. Thus, a gamma of zero still means there is no relationship between the two variables. But a gamma with a positive sign not only reveals strength (a gamma of 0.30 indicates a stronger relationship than one of 0.10), but it also says that as values of the independent variable increase, so do values of the dependent variable. And a gamma with a negative sign not only reveals strength (a gamma of -0.30 indicates a stronger relationship than one of -0.10), but also says that as values of the independent variable increase, values of the dependent variable decrease. But what exactly do we mean by “values,” here?

Let’s explore a couple of examples from the GSS (via the Social Data Archive, or SDA ). Table 8 shows the relationship between the happiness of GSS respondents’ marriages (HAPMAR) and their general happiness (HAPPY) over the years. Using our earlier way of determining direction, we can see that 90 percent of those that are “very happy” generally are also happy in their marriages, while only 19.5 percent of those who are “not too happy” generally are pretty happy in their marriages. Pretty clear that marital happiness and general happiness are related, right?

Table 8. Crosstabulation of Marital Happiness and General Happiness, GSS data from SDA

The more elegant way is to look at the statistics at the bottom of the table. Most of these statistics aren’t helpful to us now. But one, gamma, certainly is. You’ll note that gamma is 0.75. There are two important attributes of this statistic: its sign (positive) and its magnitude (0.75). The former tells you that as coded values of marital happiness—1=very happy; 2 happy; 3=not so happy—go up, values of general happiness—1=very happy; 2=happy; 3=not so happy—tend to go up as well. We can interpret this by saying that respondents who are less happy with their marriages are likely to be less happy generally than others. (Notice that this also means that people who are happy in their marriages are also likely to be more generally happy than others.) But the 0.75, independent of the sign, means that this relationship is very strong. By the way, you might also notice that there is a little parenthetical expression at the end of the row gamma is on in the statistics box—(p=0.00). The “p” stands for the chances (probability) of making a Type 1 error, and is sometimes called the “ p value ” or the significance level. The fact that the “p value” here is 0.00 does NOT mean that there is zero chance of making an error if you infer that there is a relationship between marital happiness and general happiness in the larger population. There will always be such a chance. But the SDA printouts of such values give up after two digits to the right of the decimal point. All one can really say is that the chances of making a Type 1 error, then, are less than 0.01 (which itself is less than 0.05)—and so researchers would conclude that they could reasonably generalize.

To emphasize the importance of the sign of gamma (+ or -), let’s have a look at Table 9, which displays the relationship between job satisfaction, whose coded values are 1=very dissatisfied; 2=a little dissatisfied; 3= moderately satisfied; 4=very satisfied, and general happiness, whose codes are the same as they were in Table 3.7. You can probably tell from looking at the internal percentages of the table that as job satisfaction increases so does general happiness—as one might expect. But sign of the gamma of -0.43 might at first persuade you that there is a negative association between job satisfaction and happiness, until you remember that what it’s really telling you is that when the coded values of job satisfaction go up, from 1 (very dissatisfied) to 4 (very satisfied), the coded values of happiness go down, from 3 (not so happy) to 1 (very happy). Which really means that as job satisfaction goes up, happiness goes up as well, right? Note, however, that if we reversed the coding for the job satisfaction variable, so that 1 represented being very satisfied with your job while 4 represented being very dissatisfied, the direction of gamma would reverse. Thus, it is essential that data analysts do not stop by looking at whether gamma is positive or negative, but rather also ensure they understand the way the variable is coded (its attributes ).

Also note here that the 0.43 portion of the gamma tells you how strong this relationship is—it’s strong, but not as strong as the relationship between marital happiness and general happiness (which had a gamma of 0.75). The “p value” here again is .00, which means that it’s less than .01, which of course is less than .05, and we can infer that there’s very probably a relationship between job satisfaction and general happiness in the larger population from which this sample was drawn.

Table 9. Crosstabulation of Job Satisfaction and General Happiness, GSS data from SDA

We haven’t shown you the formula for gamma, but it’s not that difficult to compute. In fact, when you have a 2 x 2 table gamma is the same as Yule’s Q, except that it can take on both positive and negative values. Obviously, Yule’s Q could do that as well, if it weren’t for the absolute value symbols surrounding it. As a consequence, you can use gamma as a substitute for Yule’s Q for 2 x 2 tables when using the SDA interface to access GSS data—as long as you remember to take the absolute value of gamma that is calculated for you. Thus, in Table 10, showing the relationship between gender and whether or not a respondent was married, the absolute value of the reported gamma—that is, |-0.11|=0.11—is the Yule’s Q for the relationship. And it is clearly weak. By the way, the p value here, 0.07, indicates that we cannot safely infer that a similar relationship existed in the larger population in 2010.

Table 10. Crosstabulation of Gender and Marital Status in 2010, GSS data from SDA

One problem with an SDA output is that none of the statistics reported (not the Eta, the R, the Tau-b, etc.) are actually designed to measure the strength of relationship between two purely nominal level variables—Cramer’s V and Yule’s Q, for instance, are not provided in the output. All of the measures that are provided, however, do have important uses. To learn more about these and other measures of association and the circumstances in which they should be used, see the chapter focusing on measures of association .

- independent variable
- dependent variable
- crosstabulation
- direction of a relationship
- strength of a relationship
- generalizability of relationship
- Type 1 error
- Type 2 error
- Pearson’s chi-square
- null hypothesis

Return to the Social Data Archive we’ve explored before. The data, again, are available at https://sda.berkeley.edu/ . Go down to the second full paragraph and click on the “SDA Archive” link you’ll find there. Then scroll down to the section labeled “General Social Surveys” and click on the first link there: General Social Survey (GSS) Cumulative Datafile 1972-2021 release.

Now type “hapmar” in the row box and “satjob” in the column box. Hit “output options” and find the “percentaging” options and make sure “column” is clicked. (Satjob will be our independent variable here, so we want column percentages.) Now click on “summary statistics,” under “other options.” Hit on “run the table,” examine the resulting printout and write a short paragraph in which you use gamma and the p-value to evaluate the hypothesis that people who are more satisfied with their jobs are more likely to be happily married than those who are less satisfied with their jobs. Your paragraph should mention the direction, strength and generalizability of the relationship as well as what determinations you can make in terms of the null and research hypotheses.

## Media Attributions

- A Mapping of the Hypothesis that Men Will Tend to be Taller than Women © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
- A Mapping of Kearney and Levine’s Hypothesis © Mikaila Mariel Lemonik Arthur is licensed under a CC BY-NC-SA (Attribution NonCommercial ShareAlike) license
- If one of your variables had three categories, it might be a “2 by 3” table. If both variables had 3 categories, you’d want a 3 by 3 table, etc. ↵
- Answers: In Sample 1, the direction of the relationship is the same as it was in Sample A (those who watched the show were less likely than those who didn’t), but its strength is greater (Yule’s Q= 1.00, rather than 0.80). In Sample 2, there is no direction of the relationship (those who watched the show were just as likely to get pregnant as those who didn’t) and its strength is as weak as it could be (Yule’s = 0.00). In Sample 3, the direction of the relationship is the opposite of what it was in Sample A. In this case, those who watched the show were more likely to get pregnant than those who didn’t. And the strength of the relationship was as strong as it could be (Yule’s Q= 1.00). ↵
- Can you double-check Roger’s calculation of chi-square for this arrangement to make sure he’s right? He’d appreciate the help. ↵
- Of course, with very large samples, like the entire General Social Survey (GSS) since it was begun, it is sometimes possible to uncover significant relationships—i.e., ones that almost surely exist in the larger population—that aren’t all that strong. Does that make sense? ↵
- You would generate some pretty gnarly tables that would be very hard to interpret, though. ↵
- While there are clearly more than two genders, we are at the mercy of the way the General Social Survey asked its questions in any given year, and thus for the examples presented in this text only data for males and females is available. While this is unfortunate, it's also an important lesson about the limits of existing survey data and the importance of ensuring proper survey question design. ↵

When certain categories of one variable are associated, or go together, with certain categories of the other variable(s).

A variable that may affect or influence another variable; the cause in a causal relationship.

A variable that is affected or influenced by (or depends on) another variable; the effect in a causal relationship.

A statement of the expected or predicted relationship between two or more variables.

An analytical method in which a bivariate table is created using discrete variables to show their relationship.

How categories of an independent variable are related to categories of a dependent variable.

A measure of how well we can predict the value or category of the dependent variable for any given unit in our sample based on knowing the value or category of the independent variable(s).

A measure of the strength of association use with binary variables

The degree to which a finding based on data from a sample can be assumed to be true for the larger population from which the population was drawn.

A measure of statistical significance used in crosstabulation to determine the generalizability of results.

The error made if one infers that a relationship exists in a larger population when it does not really exist; in other words, a false positive error.

The error you make when you do not infer a relationship exists in the larger population when it actually does exist; in other words, a false negative conclusion.

The total number of cases in a given row of a table.

The total number of cases in a given column of a table.

The number of cells in a table that can vary if we know something about the row and column totals of that table, calculated according to the formula (# of columns-1)*(# of rows-1).

A statistical measure that suggests that sample results can be generalized to the larger population, based on a low probability of having made a Type 1 error.

A measure of the direction and strength of a crosstabulated relationship between two ordinal-level variables.

The measure of statistical significance typically used in quantitative analysis. The lower the p value, the more likely you are to reject the null hypothesis.

The possible levels or response choices of a given variable.

Social Data Analysis Copyright © 2021 by Roger Clark is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

- school Campus Bookshelves
- menu_book Bookshelves
- perm_media Learning Objects
- login Login
- how_to_reg Request Instructor Account
- hub Instructor Commons

## Margin Size

- Download Page (PDF)
- Download Full Book (PDF)
- Periodic Table
- Physics Constants
- Scientific Calculator
- Reference & Cite
- Tools expand_more
- Readability

selected template will load here

This action is not available.

## 9.1: Introduction to Bivariate Data

- Last updated
- Save as PDF
- Page ID 28947

- Rice University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Learning Objectives

- Define "bivariate data"
- Define "scatter plot"
- Distinguish between a linear and a nonlinear relationship
- Identify positive and negative associations from a scatter plot

Measures of central tendency, variability, and spread summarize a single variable by providing important information about its distribution. Often, more than one variable is collected on each individual. For example, in large health studies of populations it is common to obtain variables such as age, sex, height, weight, blood pressure, and total cholesterol on each individual. Economic studies may be interested in, among other things, personal income and years of education. As a third example, most university admissions committees ask for an applicant's high school grade point average and standardized admission test scores (e.g., SAT). In this chapter we consider bivariate data, which for now consists of two quantitative variables for each individual. Our first interest is in summarizing such data in a way that is analogous to summarizing univariate (single variable) data.

By way of illustration, let's consider something with which we are all familiar: age. Let’s begin by asking if people tend to marry other people of about the same age. Our experience tells us "yes," but how good is the correspondence? One way to address the question is to look at pairs of ages for a sample of married couples. Table \(\PageIndex{1}\) below shows the ages of \(10\) married couples. Going across the columns we see that, yes, husbands and wives tend to be of about the same age, with men having a tendency to be slightly older than their wives. This is no big surprise, but at least the data bear out our experiences, which is not always the case.

The pairs of ages in Table \(\PageIndex{1}\) are from a dataset consisting of \(282\) pairs of spousal ages, too many to make sense of from a table. What we need is a way to summarize the \(282\) pairs of ages. We know that each variable can be summarized by a histogram (see Figure \(\PageIndex{1}\)) and by a mean and standard deviation (See Table \(\PageIndex{2}\)).

Each distribution is fairly skewed with a long right tail. From Table \(\PageIndex{1}\) we see that not all husbands are older than their wives and it is important to see that this fact is lost when we separate the variables. That is, even though we provide summary statistics on each variable, the pairing within couple is lost by separating the variables. We cannot say, for example, based on the means alone what percentage of couples has younger husbands than wives. We have to count across pairs to find this out. Only by maintaining the pairing can meaningful answers be found about couples per se. Another example of information not available from the separate descriptions of husbands and wives' ages is the mean age of husbands with wives of a certain age. For instance, what is the average age of husbands with \(45\)-year-old wives? Finally, we do not know the relationship between the husband's age and the wife's age.

We can learn much more by displaying the bivariate data in a graphical form that maintains the pairing. Figure \(\PageIndex{2}\) shows a scatter plot of the paired ages. The \(x\)-axis represents the age of the husband and the \(y\)-axis the age of the wife.

There are two important characteristics of the data revealed by Figure \(\PageIndex{2}\). First, it is clear that there is a strong relationship between the husband's age and the wife's age: the older the husband, the older the wife. When one variable (\(Y\)) increases with the second variable (\(X\)), we say that \(X\) and \(Y\) have a positive association. Conversely, when \(Y\) decreases as \(X\) increases, we say that they have a negative association.

Second, the points cluster along a straight line. When this occurs, the relationship is called a linear relationship.

Figure \(\PageIndex{3}\) shows a scatter plot of Arm Strength and Grip Strength from \(149\) individuals working in physically demanding jobs including electricians, construction and maintenance workers, and auto mechanics. Not surprisingly, the stronger someone's grip, the stronger their arm tends to be. There is therefore a positive association between these variables. Although the points cluster along a line, they are not clustered quite as closely as they are for the scatter plot of spousal age.

Not all scatter plots show linear relationships. Figure \(\PageIndex{4}\) shows the results of an experiment conducted by Galileo on projectile motion. In the experiment, Galileo rolled balls down an incline and measured how far they traveled as a function of the release height. It is clear from Figure \(\PageIndex{4}\) that the relationship between "Release Height" and "Distance Traveled" is not described well by a straight line: If you drew a line connecting the lowest point and the highest point, all of the remaining points would be above the line. The data are better fit by a parabola.

D. Dickey and T. Arnold's description of the study including a movie

Scatter plots that show linear relationships between variables can differ in several ways including the slope of the line about which they cluster and how tightly the points cluster about the line. A statistical measure of the strength of the relationship between two quantitative variables that takes these factors into account is the subject of the section "Values of Pearson's Correlation."

## Contributor

Online Statistics Education: A Multimedia Course of Study ( http://onlinestatbook.com/ ). Project Leader: David M. Lane, Rice University.

- Rudy Guerra and David M. Lane

## 14.3: Bivariate Analysis

- Page ID 26297

- Anol Bhattacherjee
- University of South Florida via Global Text Project

Bivariate analysis examines how two variables are related to one another. The most common bivariate statistic is the bivariate correlation —often, simply called ‘correlation’—which is a number between -1 and +1 denoting the strength of the relationship between two variables. Say that we wish to study how age is related to self-esteem in a sample of 20 respondents—i.e., as age increases, does self-esteem increase, decrease, or remain unchanged?. If self-esteem increases, then we have a positive correlation between the two variables, if self-esteem decreases, then we have a negative correlation, and if it remains the same, we have a zero correlation. To calculate the value of this correlation, consider the hypothetical dataset shown in Table 14.1.

After computing bivariate correlation, researchers are often interested in knowing whether the correlation is significant (i.e., a real one) or caused by mere chance. Answering such a question would require testing the following hypothesis:

## Bivariate Analysis: Associations, Hypotheses, and Causal Stories

- Open Access
- First Online: 04 October 2022

## Cite this chapter

You have full access to this open access chapter

- Mark Tessler 2

Part of the book series: SpringerBriefs in Sociology ((BRIEFSSOCY))

3065 Accesses

Every day, we encounter various phenomena that make us question how, why, and with what implications they vary. In responding to these questions, we often begin by considering bivariate relationships, meaning the way that two variables relate to one another. Such relationships are the focus of this chapter.

You have full access to this open access chapter, Download chapter PDF

## 3.1 Description, Explanation, and Causal Stories

There are many reasons why we might be interested in the relationship between two variables. Suppose we observe that some of the respondents interviewed in Arab Barometer surveys and other surveys report that they have thought about emigrating, and we are interested in this variable. We may want to know how individuals’ consideration of emigration varies in relation to certain attributes or attitudes. In this case, our goal would be descriptive , sometimes described as the mapping of variance. Our goal may also or instead be explanation , such as when we want to know why individuals have thought about emigrating.

## Description

Description means that we seek to increase our knowledge and refine our understanding of a single variable by looking at whether and how it varies in relation to one or more other variables. Descriptive information makes a valuable contribution when the structure and variance of an important phenomenon are not well known, or not well known in relation to other important variables.

Returning to the example about emigration, suppose you notice that among Jordanians interviewed in 2018, 39.5 percent of the 2400 men and women interviewed reported that they have considered the possibility of emigrating.

Our objective may be to discover what these might-be migrants look like and what they are thinking. We do this by mapping the variance of emigration across attributes and orientations that provide some of this descriptive information, with the descriptions themselves each expressed as bivariate relationships. These relationships are also sometimes labeled “associations” or “correlations” since they are not considered causal relationships and are not concerned with explanation.

Of the 39.5 percent of Jordanians who told interviewers that they have considered emigrating, 57.3 percent are men and 42.7 percent are women. With respect to age, 34 percent are age 29 or younger and 19.2 percent are age 50 or older. It might have been expected that a higher percentage of respondents age 29 or younger would have considered emigrating. In fact, however, 56 percent of the 575 men and women in this age category have considered emigrating. And with respect to destination, the Arab country most frequently mentioned by those who have considered emigration is the UAE, named by 17 percent, followed by Qatar at 10 percent and Saudi Arabia at 9.8 percent. Non-Arab destinations were mentioned more frequently, with Turkey named by 18.1 percent, Canada by 21.1 percent, and the U.S. by 24.2 percent.

With the variables sex, age, and prospective destination added to the original variable, which is consideration of emigration, there are clearly more than two variables under consideration. But the variables are described two at a time and so each relationship is bivariate.

These bivariate relationships, between having considered emigration on the one hand and sex, age, and prospective destination on the other, provide descriptive information that is likely to be useful to analysts, policymakers, and others concerned with emigration. They tell, or begin to tell, as noted above, what might-be migrants look like and what they are thinking. Still additional insight may be gained by adding descriptive bivariate relationships for Jordanians interviewed in a different year to those interviewed in 2018. In addition, of course, still more information and possibly a more refined understanding, may be gained by examining the attributes and orientations of prospective emigrants who are citizens of other Arab (and perhaps also non-Arab) countries.

With a focus on description, these bivariate relationships are not constructed to shed light on explanation, that is to contribute to causal stories that seek to account for variance and tell why some individuals but not others have considered the possibility of emigrating. In fact, however, as useful as bivariate relationships that provide descriptive information may be, researchers usually are interested as much if not more in bivariate relationships that express causal stories and purport to provide explanations.

## Explanation and Causal Stories

There is a difference in the origins of bivariate relationships that seek to provide descriptive information and those that seek to provide explanatory information. The former can be thought to be responding to what questions: What characterizes potential emigrants? What do they look like? What are their thoughts about this or that subject? If the objective is description, a researcher collects and uses her data to investigate the relationship between two variables without a specific and firm prediction about the relationship between them. Rather, she simply wonders about the “what” questions listed above and believes that finding out the answers will be instructive. In this case, therefore, she selects the bivariate relationships to be considered based on what she thinks it will be useful to know, and not based on assessing the accuracy of a previously articulated causal story that specifies the strength and structure of the effect that one variable has on the other.

A researcher is often interested in causal stories and explanation, however, and this does usually begin with thinking about the relationship between two variables, one of which is the presumed cause and the other the presumed effect. The presumed cause is the independent variable, and the presumed effect is the dependent variable . Offering evidence that there is a strong relationship between two variables is not sufficient to demonstrate that the variables are likely to be causally related, but it is a necessary first step. In this respect it is a point of departure for the fuller, probably multivariate analysis, required to persuasively argue that a relationship is likely to be causal. In addition, as discussed in Chap. 4 , multivariate analysis often not only strengthens the case for inferring that a relationship is causal, but also provides a more elaborate and more instructive causal story. The foundation, however, on which a multivariate analysis aimed at causal inference is built, is a bivariate relationship composed of a presumed independent variable and a presumed dependent variable.

A hypothesis that posits a causal relationship between two variables is not the same as a causal story, although the two are of course closely connected. The former specifies a presumed cause, a presumed determinant of variance on the dependent variable. It probably also specifies the structure of the relationship, such as linear as opposed to non-linear, or positive (also called direct) as opposed to negative (also called inverse).

On the other hand, a causal story describes in more detail what the researcher believes is actually taking place in the relationship between the variables in her hypothesis; and accordingly, why she thinks this involves causality. A causal story provides a fuller account of operative processes, processes that the hypothesis references but does not spell out. These processes may, for example, involve a pathway or a mechanism that tells how it is that the independent variable causes and thus accounts for some of the variance on the dependent variable. Expressed yet another way, the causal story describes the researcher’s understandings, or best guesses, about the real world, understandings that have led her to believe, and then propose for testing, that there is a causal connection between her variables that deserves investigation. The hypothesis itself does not tell this story; it is rather a short formulation that references and calls attention to the existence, or hypothesized existence, of a causal story. Research reports present the causal story as well as the hypothesis, as the hypothesis is often of limited interpretability without the causal story.

A causal story is necessary for causal inference. It enables the researcher to formulate propositions that purport to explain rather than merely describe or predict. There may be a strong relationship between two variables, and if this is the case, it will be possible to predict with relative accuracy the value, or score, of one variable from knowledge of the value, or score, of the other variable. Prediction is not explanation, however. To explain, or attribute causality, there must be a causal story to which a hypothesized causal relationship is calling attention.

An instructive illustration is provided by a recent study of Palestinian participation in protest activities that express opposition to Israeli occupation. Footnote 1 There is plenty of variance on the dependent variable: There are many young Palestinians who take part in these activities, and there are many others who do not take part. Education is one of the independent variables that the researcher thought would be an important determinant of participation, and so she hypothesized that individuals with more education would be more likely to participate in protest activities than individuals with less education.

But why would the researcher think this? The answer is provided by the causal story. To the extent that this as yet untested story is plausible, or preferably, persuasive, at least in the eyes of the investigator, it gives the researcher a reason to believe that education is indeed a determinant of participation in protest activities in Palestine. By spelling out in some detail how and why the hypothesized independent variable, education in this case, very likely impacts a person’s decision about whether or not to protest, the causal story provides a rationale for the researcher’s hypothesis.

In the case of Palestinian participation in protest activities, another investigator offered an insightful causal story about the ways that education pushes toward greater participation, with emphasis on its role in communication and coordination. Footnote 2 Schooling, as the researcher theorizes and subsequently tests, integrates young Palestinians into a broader institutional environment that facilitates mass mobilizations and lowers informational and organizational barriers to collective action. More specifically, she proposes that those individuals who have had at least a middle school education, compared to those who have not finished middle school, have access to better and more reliable sources of information, which, among other things, enables would-be protesters to assess risks. More schooling also makes would-be protesters better able to forge inter-personal relationships and establish networks that share information about needs, opportunities, and risks, and that in this way facilitate engaging in protest activities in groups, rather than on an individual basis. This study offers some additional insights to be discussed later.

The variance motivating the investigation of a causal story may be thought of as the “variable of interest,” and it may be either an independent variable or a dependent variable. It is a variable of interest because the way that it varies poses a question, or puzzle, that a researcher seeks to investigate. It is the dependent variable in a bivariate relationship if the researcher seeks to know why this variable behaves, or varies, as it does, and in pursuit of this objective, she will seek to identify the determinants and drivers that account for this variance. The variable of interest is an independent variable in a particular research project if the researcher seeks to know what difference it makes—on what does its variance have an impact, of what other variable or variables is it a driver or determinant.

The variable in which a researcher is initially interested, that is to say the variable of interest, can also be both a dependent variable and an independent variable. Returning to the variable pertaining to consideration of emigration, but this time with country as the unit of analysis, the variance depicted in Table 3.1 provides an instructive example. The data are based on Arab Barometer surveys conducted in 2018–2019, and the table shows that there is substantial variation across twelve countries. Taking the countries together, the mean percentage of citizens that have thought about relocating to another country is 30.25 percent. But in fact, there is very substantial variation around this mean. Kuwait is an outlier, with only 8 percent having considered emigration. There are also countries in which only 21 percent or 22 percent of the adult population have thought about this, figures that may be high in absolute terms but are low relative to other Arab countries. At the other end of the spectrum are countries in which 45 percent or even 50 percent of the citizens report having considered leaving their country and relocating elsewhere.

The very substantial variance shown in Table 3.1 invites reflection on both the causes and the consequences of this country-level variable, aggregate thinking about emigration. As a dependent variable, the cross-country variance brings the question of why the proportion of citizens that have thought about emigrating is higher in some countries than in others; and the search for an answer begins with the specification of one or more bivariate relationships, each of which links this dependent variable to a possible cause or determinant. As an independent variable, the cross-country variance brings the question of what difference does it make—of what is it a determinant or driver and what are the consequences for a country if more of its citizens, rather than fewer, have thought about moving to another country.

## 3.2 Hypotheses and Formulating Hypotheses

Hypotheses emerge from the research questions to which a study is devoted. Accordingly, a researcher interested in explanation will have something specific in mind when she decides to hypothesize and then evaluate a bivariate relationship in order to determine whether, and if so how, her variable of interest is related to another variable. For example, if the researcher’s variable of interest is attitude toward gender equality and one of her research questions asks why some people support gender equality and others do not, she might formulate the hypothesis below to see if education provides part of the answer.

Hypothesis 1. Individuals who are better educated are more likely to support gender equality than are individuals who are less well-educated.

The usual case, and the preferred case, is for an investigator to be specific about the research questions she seeks to answer, and then to formulate hypotheses that propose for testing part of the answer to one or more of these questions. Sometimes, however, a researcher will proceed without formulating specific hypotheses based on her research questions. Sometimes she will simply look at whatever relationships between her variable of interest and a second variable her data permit her to identify and examine, and she will then follow up and incorporate into her study any findings that turn out to be significant and potentially instructive. This is sometimes described as allowing the data to “speak.” When this hit or miss strategy of trial and error is used in bivariate and multivariate analysis, findings that are significant and potentially instructive are sometimes described as “grounded theory.” Some researchers also describe the latter process as “inductive” and the former as “deductive.”

Although the inductive, atheoretical approach to data analysis might yield some worthwhile findings that would otherwise have been missed, it can sometimes prove misleading, as you may discover relationships between variables that happened by pure chance and are not instructive about the variable of interest or research question. Data analysis in research aimed at explanation should be, in most cases, preceded by the formulation of one or more hypotheses. In this context, when the focus is on bivariate relationships and the objective is explanation rather than description, each hypothesis will include a dependent variable and an independent variable and make explicit the way the researcher thinks the two are, or probably are, related. As discussed, the dependent variable is the presumed effect; its variance is what a hypothesis seeks to explain. The independent variable is the presumed cause; its impact on the variance of another variable is what the hypothesis seeks to determine.

Hypotheses are usually in the form of if-then, or cause-and-effect, propositions. They posit that if there is variance on the independent variable, the presumed cause, there will then be variance on the dependent variable, the presumed effect. This is because the former impacts the latter and causes it to vary.

An illustration of formulating hypotheses is provided by a study of voting behavior in seven Arab countries: Algeria, Bahrain, Jordan, Lebanon, Morocco, Palestine, and Yemen. Footnote 3 The variable of interest in this individual-level study is electoral turnout, and prominent among the research questions is why some citizens vote and others do not. The dependent variable in the hypotheses proposed in response to this question is whether a person did or did not vote in the country’s most recent parliamentary election. The study initially proposed a number of hypotheses, which include the two listed here and which would later be tested with data from Arab Barometer surveys in the seven countries in 2006–2007. We will return to this illustration later in this chapter.

Hypothesis 1: Individuals who have used clientelist networks in the past are more likely to turn out to vote than are individuals who have not used clientelist networks in the past.

Hypothesis 2: Individuals with a positive evaluation of the economy are more likely to vote than are individuals with a negative evaluation of the economy.

Another example pertaining to voting, which this time is hypothetical but might be instructively tested with Arab Barometer data, considers the relationship between perceived corruption and turning out to vote at the individual level of analysis.

The normal expectation in this case would be that perceptions of corruption influence the likelihood of voting. Even here, however, competing causal relationships are plausible. More perceived corruption might increase the likelihood of voting, presumably to register discontent with those in power. But greater perceived corruption might also actually reduce the likelihood of voting, presumably in this case because the would-be voter sees no chance that her vote will make a difference. But in this hypothetical case, even the direction of the causal connection might be ambiguous. If voting is complicated, cumbersome, and overly bureaucratic, it might be that the experience of voting plays a role in shaping perceptions of corruption. In cases like this, certain variables might be both independent and dependent variables, with causal influence pushing in both directions (often called “endogeneity”), and the researcher will need to carefully think through and be particularly clear about the causal story to which her hypothesis is designed to call attention.

The need to assess the accuracy of these hypotheses, or any others proposed to account for variance on a dependent variable, will guide and shape the researcher’s subsequent decisions about data collection and data analysis. Moreover, in most cases, the finding produced by data analysis is not a statement that the hypothesis is true or that the hypothesis is false. It is rather a statement that the hypothesis is probably true or it is probably false. And more specifically still, when testing a hypothesis with quantitative data, it is often a statement about the odds, or probability, that the researcher will be wrong if she concludes that the hypothesis is correct—if she concludes that the independent variable in the hypothesis is indeed a significant determinant of the variance on the dependent variable. The lower the probability of being wrong, of course, the more confident a researcher can be in concluding, and reporting, that her data and analysis confirm her hypothesis.

## Exercise 3.1

Hypotheses emerge from the research questions to which a study is devoted. Thinking about one or more countries with which you are familiar: (a) Identify the independent and dependent variables in each of the example research questions below. (b) Formulate at least one hypothesis for each question. Make sure to include your expectations about the directionality of the relationship between the two variables; is it positive/direct or negative/inverse? (c) In two or three sentences, describe a plausible causal story to which each of your hypotheses might call attention.

Does religiosity affect people’s preference for democracy?

Does preference for democracy affect the likelihood that a person will vote? Footnote 4

## Exercise 3.2

Since its establishment in 2006, the Arab Barometer has, as of spring 2022, conducted 68 social and political attitude surveys in the Middle East and North Africa. It has conducted one or more surveys in 16 different Arab countries, and it has recorded the attitudes, values, and preferences of more than 100,000 ordinary citizens.

The Arab Barometer website ( arabbarometer.org ) provides detailed information about the Barometer itself and about the scope, methodology, and conduct of its surveys. Data from the Barometer’s surveys can be downloaded in either SPSS, Stata, or csv format. The website also contains numerous reports, articles, and summaries of findings.

In addition, the Arab Barometer website contains an Online Data Analysis Tool that makes it possible, without downloading any data, to find the distribution of responses to any question asked in any country in any wave. The tool is found in the “Survey Data” menu. After selecting the country and wave of interest, click the “See Results” tab to select the question(s) for which you want to see the response distributions. Click the “Cross by” tab to see the distributions of respondents who differ on one of the available demographic attributes.

The charts below present, in percentages, the response distributions of Jordanians interviewed in 2018 to two questions about gender equality. Below the charts are questions that you are asked to answer. These questions pertain to formulating hypotheses and to the relationship between hypotheses and causal stories.

For each of the two distributions, do you think (hypothesize) that the attitudes of Jordanian women are:

About the same as those of Jordanian men

More favorable toward gender equality than those of Jordanian men

Less favorable toward gender equality than those of Jordanian men

For each of the two distributions, do you think (hypothesize) that the attitudes of younger Jordanians are:

About the same as those of older Jordanians

More favorable toward gender equality than those of older Jordanians

Less favorable toward gender equality than those of older Jordanians

Restate your answers to Questions 1 and 2 as hypotheses.

Give the reasons for your answers to Questions 1 and 2. In two or three sentences, make explicit the presumed causal story on which your hypotheses are based.

Using the Arab Barometer’s Online Analysis Tool, check to see whether your answers to Questions 1 and 2 are correct. For those instances in which an answer is incorrect, suggest in a sentence or two a causal story on which the correct relationship might be based.

In which other country surveyed by the Arab Barometer in 2018 do you think the distributions of responses to the questions about gender equality are very similar to the distributions in Jordan? What attributes of Jordan and the other country informed your selection of the other country?

In which other country surveyed by the Arab Barometer in 2018 do you think the distributions of responses to the questions about gender equality are very different from the distributions in Jordan? What attributes of Jordan and the other country informed your selection of the other country?

Using the Arab Barometer’s Online Analysis Tool, check to see whether your answers to Questions 6 and 7 are correct. For those instances in which an answer is incorrect, suggest in a sentence or two a causal story on which the correct relationship might be based.

We will shortly return to and expand the discussion of probabilities and of hypothesis testing more generally. First, however, some additional discussion of hypothesis formulation is in order. Three important topics will be briefly considered. The first concerns the origins of hypotheses; the second concerns the criteria by which the value of a particular hypothesis or set of hypotheses should be evaluated; and the third, requiring a bit more discussion, concerns the structure of the hypothesized relationship between an independent variable and a dependent variable, or between any two variables that are hypothesized to be related.

## Origins of Hypotheses

Where do hypotheses come from? How should an investigator identify independent variables that may account for much, or at least some, of the variance on a dependent variable that she has observed and in which she is interested? Or, how should an investigator identify dependent variables whose variance has been determined, presumably only in part, by an independent variable whose impact she deems it important to assess.

Previous research is one place the investigator may look for ideas that will shape her hypotheses and the associated causal stories. This may include previous hypothesis-testing research, and this can be particularly instructive, but it may also include less systematic and structured observations, reports, and testimonies. The point, very simply, is that the investigator almost certainly is not the first person to think about, and offer information and insight about, the topic and questions in which the researcher herself is interested. Accordingly, attention to what is already known will very likely give the researcher some guidance and ideas as she strives for originality and significance in delineating the relationship between the variables in which she is interested.

Consulting previous research will also enable the researcher to determine what her study will add to what is already known—what it will contribute to the collective and cumulative work of researchers and others who seek to reduce uncertainty about a topic in which they share an interest. Perhaps the researcher’s study will fill an important gap in the scientific literature. Perhaps it will challenge and refine, or perhaps even place in doubt, distributions and explanations of variance that have thus far been accepted. Or perhaps her study will produce findings that shed light on the generalizability or scope conditions of previously accepted variable relationships. It need not do any of these things, but that will be for the researcher to decide, and her decision will be informed by knowledge of what is already known and reflection on whether and in what ways her study should seek to add to that body of knowledge.

Personal experience will also inform the researcher’s search for meaningful and informative hypotheses. It is almost certainly the case that a researcher’s interest in a topic in general, and in questions pertaining to this topic in particular, have been shaped by her own experience. The experience itself may involve many different kinds of connections or interactions, some more professional and work-related and some flowing simply and perhaps unintentionally from lived experience. The hypotheses about voting mentioned earlier, for example, might be informed by elections the researcher has witnessed and/or discussions with friends and colleagues about elections, their turnout, and their fairness. Or perhaps the researcher’s experience in her home country has planted questions about the generalizability of what she has witnessed at home.

All of this is to some extent obvious. But the take-away is that an investigator should not endeavor to set aside what she has learned about a topic in the name of objectivity, but rather, she should embrace whatever personal experience has taught her as she selects and refines the puzzles and propositions she will investigate. Should it happen that her experience leads her to incorrect or perhaps distorted understandings, this will be brought to light when her hypotheses are tested. It is in the testing that objectivity is paramount. In hypothesis formation, by contrast, subjectivity is permissible, and, in fact, it may often be unavoidable.

A final arena in which an investigator may look for ideas that will shape her hypotheses overlaps with personal experience and is also to some extent obvious. This is referenced by terms like creativity and originality and is perhaps best captured by the term “sociological imagination.” The take-away here is that hypotheses that deserve attention and, if confirmed, will provide important insights, may not all be somewhere out in the environment waiting to be found, either in the relevant scholarly literature or in recollections about relevant personal experience. They can and sometimes will be the product of imagination and wondering, of discernments that a researcher may come upon during moments of reflection and deliberation.

As in the case of personal experience, the point to be retained is that hypothesis formation may not only be a process of discovery, of finding the previous research that contains the right information. Hypothesis formation may also be a creative process, a process whereby new insights and proposed original understandings are the product of an investigator’s intellect and sociological imagination.

## Crafting Valuable Hypotheses

What are the criteria by which the value of a hypothesis or set of hypotheses should be evaluated? What elements define a good hypothesis? Some of the answers to these questions that come immediately to mind pertain to hypothesis testing rather than hypothesis formation. A good hypothesis, it might be argued, is one that is subsequently confirmed. But whether or not a confirmed hypothesis makes a positive contribution depends on the nature of the hypothesis and goals of the research. It is possible that a researcher will learn as much, and possibly even more, from findings that lead to rejection of a hypothesis. In any event, findings, whatever they may be, are valuable only to the extent that the hypothesis being tested is itself worthy of study.

Two important considerations, albeit somewhat obvious ones, are that a hypothesis should be non-trivial and non-obvious. If a proposition is trivial, suggesting a variable relationship with little or no significance, discovering whether and how the variables it brings together are related will not make a meaningful contribution to knowledge about the determinants and/or impact of the variance at the heart of the researcher’s concern. Few will be interested in findings, however rigorously derived, about a trivial proposition. The same is true of an obvious hypothesis, obvious being an attribute that makes a proposition trivial. As stated, these considerations are themselves somewhat obvious, barely deserving mention. Nevertheless, an investigator should self-consciously reflect on these criteria when formulating hypotheses. She should be sure that she is proposing variable relationships that are neither trivial nor obvious.

A third criterion, also somewhat obvious but nonetheless essential, has to do with the significance and salience of the variables being considered. Will findings from research about these variables be important and valuable, and perhaps also useful? If the primary variable of interest is a dependent variable, meaning that the primary goal of the research is to account for variance, then the significance and salience of the dependent variable will determine the value of the research. Similarly, if the primary variable of interest is an independent variable, meaning that the primary goal of the research is to determine and assess impact, then the significance and salience of the independent variable will determine the value of the research.

These three criteria—non-trivial, non-obvious, and variable importance and salience—are not very different from one another. They collectively mean that the researcher must be able to specify why and how the testing of her hypothesis, or hypotheses, will make a contribution of value. Perhaps her propositions are original or innovative; perhaps knowing whether they are true or false makes a difference or will be of practical benefit; perhaps her findings add something specific and identifiable to the body of existing scholarly literature on the subject. While calling attention to these three connected and overlapping criteria might seem unnecessary since they are indeed somewhat obvious, it remains the case that the value of a hypothesis, regardless of whether or not it is eventually confirmed, is itself important to consider, and an investigator should, therefore, know and be able to articulate the reasons and ways that consideration of her hypothesis, or hypotheses, will indeed be of value.

## Hypothesizing the Structure of a Relationship

Relevant in the process of hypothesis formation are, as discussed, questions about the origins of hypotheses and the criteria by which the value of any particular hypothesis or set of hypotheses will be evaluated. Relevant, too, is consideration of the structure of a hypothesized variable relationship and the causal story to which that relationship is believed to call attention.

The point of departure in considering the structure of a hypothesized variable relationship is an understanding that such a relationship may or may not be linear. In a direct, or positive, linear relationship, each increase in the independent variable brings a constant increase in the dependent variable. In an inverse, or negative, linear relationship, each increase in the independent variable brings a constant decrease in the dependent variable. But these are only two of the many ways that an independent variable and a dependent variable may be related, or hypothesized to be related. This is easily illustrated by hypotheses in which level of education or age is the independent variable, and this is relevant in hypothesis formation because the investigator must be alert to and consider the possibility that the variables in which she is interested are in fact related in a non-linear way.

Consider, for example, the relationship between age and support for gender equality, the latter measured by an index based on several questions about the rights and behavior of women that are asked in Arab Barometer surveys. A researcher might expect, and might therefore want to hypothesize, that an increase in age brings increased support for, or alternatively increased opposition to, gender equality. But these are not the only possibilities. Likely, perhaps, is the possibility of a curvilinear relationship, in which case increases in age bring increases in support for gender equality until a person reaches a certain age, maybe 40, 45, or 50, after which additional increases in age bring decreases in support for gender equality. Or the researcher might hypothesize that the curve is in the opposite direction, that support for gender equality initially decreases as a function of age until a particular age is reached, after which additional increases in age bring an increase in support.

Of course, there are also other possibilities. In the case of education and gender equality, for example, increased education may initially have no impact on attitudes toward gender equality. Individuals who have not finished primary school, those who have finished primary school, and those who have gone somewhat beyond primary school and completed a middle school program may all have roughly the same attitudes toward gender equality. Thus, increases in education, within a certain range of educational levels, are not expected to bring an increase or a decrease in support for gender equality. But the level of support for gender equality among high school graduates may be higher and among university graduates may be higher still. Accordingly, in this hypothetical illustration, an increase in education does bring increased support for gender equality but only beginning after middle school.

A middle school level of education is a “floor” in this example. Education does not begin to make a difference until this floor is reached, and thereafter it does make a difference, with increases in education beyond middle school bringing increases in support for gender equality. Another possibility might be for middle school to be a “ceiling.” This would mean that increases in education through middle school would bring increases in support for gender equality, but the trend would not continue beyond middle school. In other words, level of education makes a difference and appears to have explanatory power only until, and so not after, this ceiling is reached. This latter pattern was found in the study of education and Palestinian protest activity discussed earlier. Increases in education through middle school brought increases in the likelihood that an individual would participate in demonstrations and protests of Israeli occupation. However, additional education beyond middle school was not associated with greater likelihood of taking part in protest activities.

This discussion of variation in the structure of a hypothesized relationship between two variables is certainly not exhaustive, and the examples themselves are straightforward and not very complicated. The purpose of the discussion is, therefore, to emphasize that an investigator must be open to and think through the possibility and plausibility of different kinds of relationships between her two variables, that is to say, relationships with different structures. Bivariate relationships with several different kinds of structures are depicted visually by the scatter plots in Fig. 3.4 .

These possibilities with respect to structure do not determine the value of a proposed hypothesis. As discussed earlier, the value of a proposed relationship depends first and foremost on the importance and salience of the variable of interest. Accordingly, a researcher should not assume that the value of a hypothesis varies as a function of the degree to which it posits a complicated variable relationship. More complicated hypotheses are not necessarily better or more correct. But while she should not strive for or give preference to variable relationships that are more complicated simply because they are more complicated, she should, again, be alert to the possibility that a more complicated pattern does a better job of describing the causal connection between the two variables in the place and time in which she is interested.

This brings the discussion of formulating hypotheses back to our earlier account of causal stories. In research concerned with explanation and causality, a hypothesis for the most part is a simplified stand-in for a causal story. It represents the causal story, as it were. Expressing this differently, the hypothesis states the causal story’s “bottom line;” it posits that the independent variable is a determinant of variance on the dependent variable, and it identifies the structure of the presumed relationship between the independent variable and the dependent variable. But it does not describe the interaction between the two variables in a way that tells consumers of the study why the researcher believes that the relationship involves causality rather than an association with no causal implications. This is left to the causal story, which will offer a fuller account of the way the presumed cause impacts the presumed effect.

## 3.3 Describing and Visually Representing Bivariate Relationships

Once a researcher has collected or otherwise obtained data on the variables in a bivariate relationship she wishes to examine, her first step will be to describe the variance on each of the variables using the univariate statistics described in Chap. 2 . She will need to understand the distribution on each variable before she can understand how these variables vary in relation to one another. This is important whether she is interested in description or wishes to explore a bivariate causal story.

Once she has described each one of the variables, she can turn to the relationship between them. She can prepare and present a visual representation of this relationship, which is the subject of the present section. She can also use bivariate statistical tests to assess the strength and significance of the relationship, which is the subject of the next section of this chapter.

## Contingency Tables

Contingency tables are used to display the relationship between two categorical variables. They are similar to the univariate frequency distributions described in Chap. 2 , the difference being that they juxtapose the two univariate distributions and display the interaction between them. Also called cross-tabulation tables, the cells of the table may present frequencies, row percentages, column percentages, and/or total percentages. Total frequencies and/or percentages are displayed in a total row and a total column, each one of which is the same as the univariate distribution of one of the variables taken alone.

Table 3.2 , based on Palestinian data from Wave V of the Arab Barometer, crosses gender and the average number of hours watching television each day. Frequencies are presented in the cells of the table. In the cell showing the number of Palestinian men who do not watch television at all, row percentage, column percentage, and total percentage are also presented. Note that total percentage is based on the 10 cells showing the two variables taken together, which are summed in the lower right-hand cell. Thus, total percent for this cell is 342/2488 = 13.7. Only frequencies are given in the other cells of the table; but in a full table, these four figures – frequency, row percent, column percent and total percent – would be given in every cell.

## Exercise 3.3

Compute the row percentage, the column percentage, and the total percentage in the cell showing the number of Palestinian women who do not watch television at all.

Describe the relationship between gender and watching television among Palestinians that is shown in the table. Do the television watching habits of Palestinian men and women appear to be generally similar or fairly different? You might find it helpful to convert the frequencies in other cells to row or column percentages.

## Stacked Column Charts and Grouped Bar Charts

Stacked column charts and grouped bar charts are used to visually describe how two categorical variables, or one categorical and one continuous variable, relate to one another. Much like contingency tables, they show the percentage or count of each category of one variable within each category of the second variable. This information is presented in columns stacked on each other or next to each other. The charts below show the number of male Palestinians and the number of female Palestinians who watch television for a given number of hours each day. Each chart presents the same information as the other chart and as the contingency table shown above (Fig. 3.1 ).

Stacked column charts and grouped bar charts comparing Palestinian men and Palestinian women on hours watching television

## Box Plots and Box and Whisker Plots

Box plots, box and whisker plots, and other types of plots can also be used to show the relationship between one categorical variable and one continuous variable. They are particularly useful for showing how spread out the data are. Box plots show five important numbers in a variable’s distribution: the minimum value; the median; the maximum value; and the first and third quartiles (Q1 and Q2), which represent, respectively, the number below which are 25 percent of the distribution’s values and the number below which are 75 percent of the distribution’s values. The minimum value is sometimes called the lower extreme, the lower bound, or the lower hinge. The maximum value is sometimes called the upper extreme, the upper bound, or the upper hinge. The middle 50 percent of the distribution, the range between Q1 and Q3 that represents the “box,” constitutes the interquartile range (IQR). In box and whisker plots, the “whiskers” are the short perpendicular lines extending outside the upper and lower quartiles. They are included to indicate variability below Q1 and above Q3. Values are usually categorized as outliers if they are less than Q1 − IQR*1.5 or greater than Q3 + IQR*1.5. A visual explanation of a box and whisker plot is shown in Fig. 3.2a and an example of a box plot that uses actual data is shown in Fig. 3.2b .

The box plot in Fig. 3.2b uses Wave V Arab Barometer data from Tunisia and shows the relationship between age, a continuous variable, and interpersonal trust, a dichotomous categorical variable. The line representing the median value is shown in bold. Interpersonal trust, sometimes known as generalized trust, is an important personal value. Previous research has shown that social harmony and prospects for democracy are greater in societies in which most citizens believe that their fellow citizens for the most part are trustworthy. Although the interpersonal trust variable is dichotomous in Fig. 3.2b , the variance in interpersonal trust can also be measured by a set of ordered categories or a scale that yields a continuous measure, the latter not being suitable for presentation by a box plot. Figure 3.2b shows that the median age of Tunisians who are trusting is slightly higher than the median age of Tunisians who are mistrustful of other people. Notice also that the box plot for the mistrustful group has an outlier.

( a ) A box and whisker plot. ( b ) Box plot comparing the ages of trusting and mistrustful Tunisians in 2018

Line plots may be used to visualize the relationship between two continuous variables or a continuous variable and a categorical variable. They are often used when time, or a variable related to time, is one of the two variables. If a researcher wants to show whether and how a variable changes over time for more than one subgroup of the units about which she has data (looking at men and women separately, for example), she can include multiple lines on the same plot, with each line showing the pattern over time for a different subgroup. These lines will generally be distinguished from each other by color or pattern, with a legend provided for readers.

Line plots are a particularly good way to visualize a relationship if an investigator thinks that important events over time may have had a significant impact. The line plot in Fig. 3.3 shows the average support for gender equality among men and among women in Tunisia from 2013 to 2018. Support for gender equality is a scale based on four questions related to gender equality in the three waves of the Arab Barometer. An answer supportive of gender equality on a question adds +.5 to the scale and an answer unfavorable to gender equality adds −.5 to the scale. Accordingly, a scale score of 2 indicates maximum support for gender equality and a scale score of −2 indicates maximum opposition to gender equality.

Line plot showing level of support for gender equality among Tunisian women and men in 2013, 2016, and 2018

## Scatter Plots

Scatter plots are used to visualize a bivariate relationship when both variables are numerical. The independent variable is put on the x-axis, the horizontal axis, and the dependent variable is put on the y-axis, the vertical axis. Each data point becomes a dot in the scatter plot’s two-dimensional field, with its precise location being the point at which its value on the x-axis intersects with its value on the y-axis. The scatter plot shows how the variables are related to one another, including with respect to linearity, direction, and other aspects of structure. The scatter plots in Fig. 3.4 illustrate a strong positive linear relationship, a moderately strong negative linear relationship, a strong non-linear relationship, and a pattern showing no relationship. Footnote 5 If the scatter plot displays no visible and clear pattern, as in the lower left hand plot shown in Fig. 3.4 , the scatter plot would indicate that the independent variable, by itself, has no meaningful impact on the dependent variable.

Scatter plots showing bivariate relationships with different structures

Scatter plots are also a good way to identify outliers—data points that do not follow a pattern that characterizes most of the data. These are also called non-scalar types. Figure 3.5 shows a scatter plot with outliers.

Outliers can be informative, making it possible, for example, to identify the attributes of cases for which the measures of one or both variables are unreliable and/or invalid. Nevertheless, the inclusion of outliers may not only distort the assessment of measures, raising unwarranted doubts about measures that are actually reliable and valid for the vast majority of cases, they may also bias bivariate statistics and make relationships seem weaker than they really are for most cases. For this reason, researchers sometimes remove outliers prior to testing a hypothesis. If one does this, it is important to have a clear definition of what is an outlier and to justify the removal of the outlier, both using the definition and perhaps through substantive analysis. There are several mathematical formulas for identifying outliers, and researchers should be aware of these formulas and their pros and cons if they plan to remove outliers.

If there are relatively few outliers, perhaps no more than 5–10 percent of the cases, it may be justifiable to remove them in order to better discern the relationship between the independent variable and the dependent variable. If outliers are much more numerous, however, it is probably because there is not a significant relationship between the two variables being considered. The researcher might in this case find it instructive to introduce a third variable and disaggregate the data. Disaggregation will be discussed in Chap. 4 .

A scatter plot with outliers marked in red

## Exercise 3.4 Exploring Hypotheses through Visualizing Data: Exercise with the Arab Barometer Online Analysis Tool

Go to the Arab Barometer Online Analysis Tool ( https://www.arabbarometer.org/survey-data/data-analysis-tool/ )

Select Wave V and a country that interests you

Select “See Results”

Select “Social, Cultural and Religious topics”

Select “Religion: frequency: pray”

Questions: What does the distribution of this variable look like? How would you describe the variance?

Click on “Cross by,” then

Select “Show all variables”

Select “Kind of government preferable” and click

Select “Options,” then “Show % over Row total,” then “Apply”

Questions: Does there seem to be a relationship between religiosity and preference for democracy? If so, what might explain the relationship you observe—what is a plausible causal story? Is it consistent with the hypothesis you wrote for Exercise 3.1?

What other variables could be used to measure religiosity and preference for democracy? Explore your hypothesis using different items from the list of Arab Barometer variables

Do these distributions support the previous results you found? Do you learn anything additional about the relationship between religiosity and preference for democracy?

Now it is your turn to explore variables and variable relationships that interest you!

Pick two variables that interest you from the list of Arab Barometer variables. Are they continuous or categorical? Ordinal or nominal? (Hint: Most Arab Barometer variables are categorical, even if you might be tempted to think of them as continuous. For example, age is divided into the ordinal categories 18–29, 30–49, and 50 and more.)

Do you expect there to be a relationship between the two variables? If so, what do you think will be the structure of that relationship, and why?

Select the wave (year) and the country that interest you

Select one of your two variables of interest

Click on “Cross by,” and then select your second variable of interest.

On the left side of the page, you’ll see a contingency table. On the right side at the top, you’ll see several options to graphically display the relationship between your two variables. Which type of graph best represents the relationship between your two variables of interest?

Do the two variables seem to be independent of each other, or do you think there might be a relationship between them? Is the relationship you see similar to what you had expected

## 3.4 Probabilities and Type I and Type II Errors

As in visual presentations of bivariate relationships, selecting the appropriate measure of association or bivariate statistical test depends on the types of the two variables. The data on both variables may be categorical; the data on both may be continuous; or the data may be categorical on one variable and continuous on the other variable. These characteristics of the data will guide the way in which our presentation of these measures and tests is organized. Before briefly describing some specific measures of association and bivariate statistical tests, however, it is necessary to lay a foundation by introducing a number of terms and concepts. Relevant here are the distinction between population and sample and the notions of the null hypothesis, of Type I and Type II errors, and of probabilities and confidence intervals. As concepts, or abstractions, these notions may influence the way a researcher thinks about drawing conclusions about a hypothesis from qualitative data, as was discussed in Chap. 2 . In their precise meaning and application, however, these terms and concepts come into play when hypothesis testing involves the statistical analysis of quantitative data.

To begin, it is important to distinguish between, on the one hand, the population of units—individuals, countries, ethnic groups, political movements, or any other unit of analysis—in which the researcher is interested and about which she aspires to advance conclusions and, on the other hand, the units on which she has actually acquired the data to be analyzed. The latter, the units on which she actually has data, is her sample. In cases where the researcher has collected or obtained data on all of the units in which she is interested, there is no difference between the sample and the population, and drawing conclusions about the population based on the sample is straightforward. Most often, however, a researcher does not possess data on all of the units that make up the population in which she is interested, and so the possibility of error when making inferences about the population based on the analysis of data in the sample requires careful and deliberate consideration.

This concern for error is present regardless of the size of the sample and the way it was constructed. The likelihood of error declines as the size of the sample increases and thus comes closer to representing the full population. It also declines if the sample was constructed in accordance with random or other sampling procedures designed to maximize representation. It is useful to keep these criteria in mind when looking at, and perhaps downloading and using, Arab Barometer data. The Barometer’s website gives information about the construction of each sample. But while it is possible to reduce the likelihood of error when characterizing the population from findings based on the sample, it is not possible to eliminate entirely the possibility of erroneous inference. Accordingly, a researcher must endeavor to make the likelihood of this kind of error as small as possible and then decide if it is small enough to advance conclusions that apply to the population as well as the sample.

The null hypothesis, frequently designated as H0, is a statement to the effect that there is no meaningful and significant relationship between the independent variable and the dependent variable in a hypothesis, or indeed between two variables even if the relationship between them has not been formally specified in a hypothesis and does not purport to be causal or explanatory. The null hypothesis may or may not be stated explicitly by an investigator, but it is nonetheless present in her thinking; it stands in opposition to the hypothesized variable relationship. In a point and counterpoint fashion, the hypothesis, H1, posits that the variables are significantly related, and the null hypothesis, H0, replies and says no, they are not significantly related. It further says that they are not related in any meaningful way, neither in the way proposed in H1 nor in any other way that could be proposed.

Based on her analysis, the researcher needs to determine whether her findings permit rejecting the null hypothesis and concluding that there is indeed a significant relationship between the variables in her hypothesis, concluding in effect that the research hypothesis, H1, has been confirmed. This is most relevant and important when the investigator is basing her analysis on some but not all of the units to which her hypothesis purports to apply—when she is analyzing the data in her sample but seeks to advance conclusions that apply to the population in which she is interested. The logic here is that the findings produced by an analysis of some of the data, the data she actually possesses, may be different than the findings her analysis would hypothetically produce were she able to use data from very many more, or ideally even all, of the units that make up her population of interest.

This means, of course, that there will be uncertainty as the researcher adjudicates between H0 and H1 on the basis of her data. An analysis of these data may suggest that there is a strong and significant relationship between the variables in H1. And the stronger the relationship, the more unlikely it is that the researcher’s sample is a subset of a population characterized by H0 and that, therefore, the researcher may consider H1 to have been confirmed. Yet, it remains at least possible that the researcher’s sample, although it provides strong support for H1, is actually a subset of a population characterized by the null hypothesis. This may be unlikely, but it is not impossible, and so, therefore, to consider H1 to have been confirmed is to run the risk, at least a small risk, of what is known as a Type I error. A Type I error is made when a researcher accepts a research hypothesis that is actually false, when she judges to be true a hypothesis that does not characterize the population of which her sample is a subset. Because of the possibility of a Type I error, even if quite unlikely, researchers will often write something like “We can reject the null hypothesis,” rather than “We can confirm our hypothesis.”

Another analysis related to voter turnout provides a ready illustration. In the Arab Barometer Wave V surveys in 12 Arab countries, Footnote 6 13,899 respondents answered a question about voting in the most recent parliamentary election. Of these, 46.6 percent said they had voted, and the remainder, 53.4 percent, said they had not voted in the last parliamentary election. Footnote 7 Seeking to identify some of the determinants of voting—the attitudes and experiences of an individual that increase the likelihood that she will vote, the researcher might hypothesize that a judgment that the country is going in the right direction will push toward voting. More formally:

H1. An individual who believes that her country is going in the right direction is more likely to vote in a national election than is an individual who believes her country is going in the wrong direction.

Arab Barometer surveys provide data with which to test this proposition, and in fact there is a difference associated with views about the direction in which the country is going. Among those who judged that their country is going in the right direction, 52.4 percent voted in the last parliamentary election. By contrast, among those who judged that their country is going in the wrong direction, only 43.8 percent voted in the last parliamentary election.

This illustrates the choice a researcher faces when deciding what to conclude from a study. Does the analysis of her data from a subset of her population of interest confirm or not confirm her hypothesis? In this example, based on Arab Barometer data, the findings are in the direction of her hypothesis, and differences in voting associated with views about the direction the country is going do not appear to be trivial. But are these differences big enough to justify the conclusion that judgements about the country’s path going forward are a determinant of voting, one among others of course, in the population from which her sample was drawn? In other words, although this relationship clearly characterizes the sample, it is unclear whether it characterizes the researcher’s population of interest, the population from which the sample was drawn.

Unless the researcher can gather data on the entire population of eligible voters, or at least almost all of this population, it is not possible to entirely eliminate uncertainty when the researcher makes inferences about the population of voters based on findings from the subset, or sample, of voters on which she has data. She can either conclude that her findings are sufficiently strong and clear to propose that the pattern she has observed characterizes the population as well, and that H1 is therefore confirmed; or she can conclude that her findings are not strong enough to make such an inference about the population, and that H1, therefore, is not confirmed. Either conclusion could be wrong, and so there is a chance of error no matter which conclusion the researcher advances.

The terms Type I error and Type II error are often used to designate the possible error associated with each of these inferences about the population based on the sample. Type I error refers to the rejection of a true null hypothesis. This means, in other words, that the investigator could be wrong if she concludes that her finding of a strong, or at least fairly strong, relationship between her variables characterizes Arab voters in the 12 countries in general, and if she thus judges H1 to have been confirmed when the population from which her sample was drawn is in fact characterized by H0. Type II error refers to acceptance of a false null hypothesis. This means, in other words, that the investigator could be wrong if she concludes that her finding of a somewhat weak relationship, or no relationship at all, between her variables characterizes Arab voters in the 12 countries in general, and that she thus judges H0 to be true when the population from which her sample was drawn is in fact characterized by H1.

In statistical analyses of quantitative data, decisions about whether to risk a Type I error or a Type II error are usually based on probabilities. More specifically, they are based on the probability of a researcher being wrong if she concludes that the variable relationship—or hypothesis in most cases—that characterizes her data, meaning her sample, also characterizes the population on which the researcher hopes her sample and data will shed light. To say this in yet another way, she computes the odds that her sample does not represent the population of which it is a subset; or more specifically still, she computes the odds that from a population that is characterized by the null hypothesis she could have obtained, by chance alone, a subset of the population, her sample, that is not characterized by the null hypothesis. The lower the odds, or probability, the more willing the researcher will be to risk a Type I error.

There are numerous statistical tests that are used to compute such probabilities. The nature of the data and the goals of the analysis will determine the specific test to be used in a particular situation. Most of these tests, frequently called tests of significance or tests of statistical significance, provide output in the form of probabilities, which always range from 0 to 1. The lower the value, meaning the closer to 0, the less likely it is that a researcher has collected and is working with data that produce findings that differ from what she would find were she to somehow have data on the entire population. Another way to think about this is the following:

If the researcher provisionally assumes that the population is characterized by the null hypothesis with respect to the variable relationship under study, what is the probability of obtaining from that population, by chance alone, a subset or sample that is not characterized by the null hypothesis but instead shows a strong relationship between the two variables;

The lower the probability value, meaning the closer to 0, the less likely it is that the researcher’s data, which support H1, have come from a population that is characterized by H0;

The lower the probability that her sample could have come from a population characterized by H0, the lower the possibility that the researcher will be wrong, that she will make a Type I error, if she rejects the null hypothesis and accepts that the population, as well as her sample, is characterized by H1;

When the probability value is low, the chance of actually making a Type I error is small. But while small, the possibility of an error cannot be entirely eliminated.

If it helps you to think about probability and Type I and Type II error, imagine that you will be flipping a coin 100 times and your goal is to determine whether the coin is unbiased, H0, or biased in favor of either heads or tails, H1. How many times more than 50 would heads have to come up before you would be comfortable concluding that the coin is in fact biased in favor of heads? Would 60 be enough? What about 65? To begin to answer these questions, you would want to know the odds of getting 60 or 65 heads from a coin that is actually unbiased, a coin that would come up heads and come up tails roughly the same number of times if it were flipped many more than 100 times, maybe 1000 times, maybe 10,000. With this many flips, would the ratio of heads to tails even out. The lower the odds, the less likely it is that the coin is unbiased. In this analogy, you can think of the mathematical calculations about an unbiased coin’s odds of getting heads as the population, and your actual flips of the coin as the sample.

But exactly how low does the probability of a Type I error have to be for a researcher to run the risk of rejecting H0 and accepting that her variables are indeed related? This depends, of course, on the implications of being wrong. If there are serious and harmful consequences of being wrong, of accepting a research hypothesis that is actually false, the researcher will reject H0 and accept H1 only if the odds of being wrong, of making a Type I error, are very low.

There are some widely used probability values, which define what are known as “confidence intervals,” that help researchers and those who read their reports to think about the likelihood that a Type I error is being made. In the social sciences, rejecting H0 and running the risk of a Type I error is usually thought to require a probability value of less than .05, written as p < .05. The less stringent value of p < .10 is sometimes accepted as sufficient for rejecting H0, although such a conclusion would be advanced with caution and when the consequences of a Type I error are not very harmful. Frequently considered safer, meaning that the likelihood of accepting a false hypothesis is lower, are p < .01 and p < .001. The next section introduces and briefly describes some of the bivariate statistics that may be used to calculate these probabilities.

## 3.5 Measures of Association and Bivariate Statistical Tests

The following section introduces some of the bivariate statistical tests that can be used to compute probabilities and test hypotheses. The accounts are not very detailed. They will provide only a general overview and refresher for readers who are already fairly familiar with bivariate statistics. Readers without this familiarity are encouraged to consult a statistics textbook, for which the accounts presented here will provide a useful guide. While the account below will emphasize calculating these test statistics by hand, it is also important to remember that they can be calculated with the assistance of statistical software as well. A discussion of statistical software is available in Appendix 4.

## Parametric and Nonparametric Statistics

Parametric and nonparametric are two broad classifications of statistical procedures. A parameter in statistics refers to an attribute of a population. For example, the mean of a population is a parameter. Parametric statistical tests make certain assumptions about the shape of the distribution of values in a population from which a sample is drawn, generally that it is normally distributed, and about its parameters, that is to say the means and standard deviations of the assumed distributions. Nonparametric statistical procedures rely on no or very few assumptions about the shape or parameters of the distribution of the population from which the sample was drawn. Chi-squared is the only nonparametric statistical test among the tests described below.

## Degrees of Freedom

Degrees of freedom (df) is the number of values in the calculation of a statistic that are free to vary. Statistical software programs usually give degrees of freedom in the output, so it is generally unnecessary to know the number of the degrees of freedom in advance. It is nonetheless useful to understand what degrees of freedom represent. Consistent with the definition above, it is the number of values that are not predetermined, and thus are free to vary, within the variables used in a statistical test.

This is illustrated by the contingency tables below, which are constructed to examine the relationship between two categorical variables. The marginal row and column totals are known since these are just the univariate distributions of each variable. df = 1 for Table 3.3a , which is a 4-cell table. You can enter any one value in any one cell, but thereafter the values of all the other three cells are determined. Only one number is not free to vary and thus not predetermined. df = 2 for Table 3.3b , which is a 6-cell table. You can enter any two values in any two cells, but thereafter the values of all the other cells are determined. Only two numbers are free to vary and thus not predetermined. For contingency tables, the formula for calculating df is:

## Chi-Squared

Chi-squared, frequently written X 2 , is a statistical test used to determine whether two categorical variables are significantly related. As noted, it is a nonparametric test. The most common version of the chi-squared test is the Pearson chi-squared test, which gives a value for the chi-squared statistic and permits determining as well a probability value, or p-value. The magnitude of the statistic and of the probability value are inversely correlated; the higher the value of the chi-squared statistic, the lower the probability value, and thus the lower the risk of making a Type I error—of rejecting a true null hypothesis—when asserting that the two variables are strongly and significantly related.

The simplicity of the chi-squared statistic permits giving a little more detail in order to illustrate several points that apply to bivariate statistical tests in general. The formula for computing chi-squared is given below, with O being the observed (actual) frequency in each cell of a contingency table for two categorical variables and E being the frequency that would be expected in each cell if the two variables are not related. Put differently, the distribution of E values across the cells of the two-variable table constitutes the null hypothesis, and chi-squared provides a number that expresses the magnitude of the difference between an investigator’s actual observed values and the values of E.

The computation of chi-squared involves the following procedures, which are illustrated using the data in Table 3.4 .

The values of O in the cells of the table are based on the data collected by the investigator. For example, Table 3.4 shows that of the 200 women on whom she collected information, 85 are majoring in social science.

The value of E for each cell is computed by multiplying the marginal total of the column in which the cell is located by the marginal total of the row in which the cell is located divided by N, N being the total number of cases. For the female students majoring in social science in Table 3.4 , this is: 200 * 150/400 = 30,000/400 = 75. For the female students majoring in math and natural science in Table 3.4 , this is: 200 * 100/400 = 20,000/400 = 50.

The difference between the value of O and the value of E is computed for each cell using the formula for chi-squared. For the female students majoring in social science in Table 3.4 , this is: (85–75) 2 /75 = 10 2 /75 = 100/75 = 1.33. For the female students majoring in math and natural science, the value resulting from the application of the chi-squared is: (45–50) 2 /50 = 5 2 /75 = 25/75 = .33.

The values in each cell of the table resulting from the application of the chi-squared formula are summed (Σ). This chi-squared value expresses the magnitude of the difference between a distribution of values indicative of the null hypothesis and what the investigator actually found about the relationship between gender and field of study. In Table 3.4 , the cell for female students majoring in social science adds 1.33 to the sum of the values in the eight cells, the cell for female students majoring in math and natural science adds .33 to the sum, and so forth for the remaining six cells.

A final point to be noted, which applies to many other statistical tests as well, is that the application of chi-squared and other bivariate (and multivariate) statistical tests yields a value with which can be computed the probability that an observed pattern does not differ from the null hypothesis and that a Type I error will be made if the null hypothesis is rejected and the research hypothesis is judged to be true. The lower the probability, of course, the lower the likelihood of an error if the null hypothesis is rejected.

Prior to the advent of computer assisted statistical analysis, the value of the statistic and the number of degrees of freedom were used to find the probability value in a table of probability values in an appendix in most statistics books. At present, however, the probability value, or p-value, and also the degrees of freedom, are routinely given as part of the output when analysis is done by one of the available statistical software packages.

Table 3.5 shows the relationship between economic circumstance and trust in the government among 400 ordinary citizens in a hypothetical country. The observed data were collected to test the hypothesis that greater wealth pushes people toward greater trust and less wealth pushes people toward lesser trust. In the case of all three patterns, the probability that the null hypothesis is true is very low. All three patterns have the same high chi-squared value and low probability value. Thus, the chi-squared and p-values show only that the patterns all differ significantly from what would be expected were the null hypothesis true. They do not show whether the data support the hypothesized variable relationship or any other particular relationship.

As the three patterns in Table 3.5 show, variable relationships with very different structures can yield similar or even identical statistical test and probability values, and thus these tests provide only some of the information a researcher needs to draw conclusions about her hypothesis. To draw the right conclusion, it may also be necessary for the investigator to “look at” her data. For example, as Table 3.5 suggests, looking at a tabular or visual presentation of the data may also be needed to draw the proper conclusion about how two variables are related.

How would you describe the three patterns shown in the table, each of which differs significantly from the null hypothesis? Which pattern is consistent with the research hypothesis? How would you describe the other two patterns? Try to visualize a plot of each pattern.

## Pearson Correlation Coefficient

The Pearson correlation coefficient, more formally known as the Pearson product-moment correlation, is a parametric measure of linear association. It gives a numerical representation of the strength and direction of the relationship between two continuous numerical variables. The coefficient, which is commonly represented as r , will have a value between −1 and 1. A value of 1 means that there is a perfect positive, or direct, linear relationship between the two variables; as one variable increases, the other variable consistently increases by some amount. A value of −1 means that there is a perfect negative, or inverse, linear relationship; as one variable increases, the other variable consistently decreases by some amount. A value of 0 means that there is no linear relationship; as one variable increases, the other variable neither consistently increases nor consistently decreases.

It is easy to think of relationships that might be assessed by a Pearson correlation coefficient. Consider, for example, the relationship between age and income and the proposition that as age increases, income consistently increases or consistently decreases as well. The closer a coefficient is to 1 or −1, the greater the likelihood that the data on which it is based are not the subset of a population in which age and income are unrelated, meaning that the population of interest is not characterized by the null hypothesis. Coefficients very close to 1 or −1 are rare; although it depends on the number of units on which the researcher has data and also on the nature of the variables. Coefficients higher than .3 or lower than −.03 are frequently high enough, in absolute terms, to yield a low probability value and justify rejecting the null hypothesis. The relationship in this case would be described as “statistically significant.”

## Exercise 3.5

Estimating Correlation Coefficients from scatter plots

Look at the scatter plots in Fig. 3.4 and estimate the correlation coefficient that the bivariate relationship shown in each scatter plot would yield.

Explain the basis for each of your estimates of the correlation coefficient.

## Spearman’s Rank-Order Correlation Coefficient

The Spearman’s rank-order correlation coefficient is a nonparametric version of the Pearson product-moment correlation . Spearman’s correlation coefficient, (ρ, also signified by r s ) measures the strength and direction of the association between two ranked variables.

## Bivariate Regression

Bivariate regression is a parametric measure of association that, like correlation analysis, assesses the strength and direction of the relationship between two variables. Also, like correlation analysis, regression assumes linearity. It may give misleading results if used with variable relationships that are not linear.

Regression is a powerful statistic that is widely used in multivariate analyses. This includes ordinary least squares (OLS) regression, which requires that the dependent variable be continuous and assumes linearity; binary logistic regression, which may be used when the dependent variable is dichotomous; and ordinal logistic regression, which is used with ordinal dependent variables. The use of regression in multivariate analysis will be discussed in the next chapter. In bivariate analysis, regression analysis yields coefficients that indicate the strength and direction of the relationship between two variables. Researchers may opt to “standardize” these coefficients. Standardized coefficients from a bivariate regression are the same as the coefficients produced by Pearson product-moment correlation analysis.

The t-test, also sometimes called a “difference of means” test, is a parametric statistical test that compares the means of two variables and determines whether they are different enough from each other to reject the null hypothesis and risk a Type I error. The dependent variable in a t-test must be continuous or ordinal—otherwise the investigator cannot calculate a mean. The independent variable must be categorical since t-tests are used to compare two groups.

An example, drawing again on Arab Barometer data, tests the relationship between voting and support for democracy. The hypothesis might be that men and women who voted in the last parliamentary election are more likely than men and women who did not vote to believe that democracy is suitable for their country. Whether a person did or did not vote would be the categorical independent variable, and the dependent variable would be the response to a question like, “To what extent do you think democracy is suitable for your country?” The question about democracy asked respondents to situate their views on a 11-point scale, with 0 indicating completely unsuitable and 10 indicating completely suitable.

Focusing on Tunisia in 2018, Arab Barometer Wave V data show that the mean response on the 11-point suitability question is 5.11 for those who voted and 4.77 for those who did not vote. Is this difference of .34 large enough to be statistically significant? A t-test will determine the probability of getting a difference of this magnitude from a population of interest, most likely all Tunisians of voting age, in which there is no difference between voters and non-voters in views about the suitability of democracy for Tunisia. In this example, the t-test showed p < .086. With this p-value, which is higher than the generally accepted standard of .05, a researcher cannot with confidence reject the null hypotheses, and she is unable, therefore, to assert that the proposed relationship has been confirmed.

This question can also be explored at the country level of analysis with, for example, regime type as the independent variable. In this illustration, the hypothesis is that citizens of monarchies are more likely than citizens of republics to believe that democracy is suitable for their country. Of course, a researcher proposing this hypothesis would also advance an associated causal story that provides the rationale for the hypothesis and specifies what is really being tested. To test this proposition, an investigator might merge data from surveys in, say, three monarchies, perhaps Morocco, Jordan, and Kuwait, and then also merge data from surveys in three republics, perhaps Algeria, Egypt, and Iraq. A t-test would then be used to compare the means of people in republics and people in monarchies and give the p-value.

A similar test, the Wilcoxon-Mann-Whitney test, is a nonparametric test that does not require that the dependent variable be normally distributed.

Analysis of variance, or ANOVA, is closely related to the t-test. It may be used when the dependent variable is continuous and the independent variable is categorical. A one-way ANOVA compares the mean and variance values of a continuous dependent variable in two or more categories of a categorical independent variable in order to determine if the latter affects the former.

ANOVA calculates the F-ratio based on the variance between the groups and the variance within each group. The F-ratio can then be used to calculate a p-value. However, if there are more than two categories of the independent variable, the ANOVA test will not indicate which pairs of categories differ enough to be statistically significant, making it necessary, again, to look at the data in order to draw correct conclusions about the structure of the bivariate relationships. Two-way ANOVA is used when an investigator has more than two variables.

Table 3.6 presents a summary list of the visual representations and bivariate statistical tests that have been discussed. It reminds readers of the procedures that can be used when both variables are categorical, when both variables are numerical/continuous, and when one variable is categorical and one variable is numerical/continuous.

## Bivariate Statistics and Causal Inference

It is important to remember that bivariate statistical tests only assess the association or correlation between two variables. The tests described above can help a researcher estimate how much confidence her hypothesis deserves and, more specifically, the probability that any significant variable relationships she has found characterize the larger population from which her data were drawn and about which she seeks to offer information and insight.

The finding that two variables in a hypothesized relationship are related to a statistically significant degree is not evidence that the relationship is causal, only that the independent variable is related to the dependent variable. The finding is consistent with the causal story that the hypothesis represents, and to that extent, it offers support for this story. Nevertheless, there are many reasons why an observed statistically significant relationship might be spurious. The correlation might, for example, reflect the influence of one or more other and uncontrolled variables. This will be discussed more fully in the next chapter. The point here is simply that bivariate statistics do not, by themselves, address the question of whether a statistically significant relationship between two variables is or is not a causal relationship.

## Only an Introductory Overview

As has been emphasized throughout, this chapter seeks only to offer an introductory overview of the bivariate statistical tests that may be employed when an investigator seeks to assess the relationship between two variables. Additional information will be presented in Chap. 4 . The focus in Chap. 4 will be on multivariate analysis, on analyses involving three or more variables. In this case again, however, the chapter will provide only an introductory overview. The overviews in the present chapter and the next provide a foundation for understanding social statistics, for understanding what statistical analyses involve and what they seek to accomplish. This is important and valuable in and of itself. Nevertheless, researchers and would-be researchers who intend to incorporate statistical analyses into their investigations, perhaps to test hypotheses and decide whether to risk a Type I error or a Type II error, will need to build on this foundation and become familiar with the contents of texts on social statistics. If this guide offers a bird’s eye view, researchers who implement these techniques will also need to expose themselves to the view of the worm at least once.

Chapter 2 makes clear that the concept of variance is central and foundational for much and probably most data-based and quantitative social science research. Bivariate relationships, which are the focus of the present chapter, are building blocks that rest on this foundation. The goal of this kind of research is very often the discovery of causal relationships, relationships that explain rather than merely describe or predict. Such relationships are also frequently described as accounting for variance. This is the focus of Chap. 4 , and it means that there will be, first, a dependent variable, a variable that expresses and captures the variance to be explained, and then, second, an independent variable, and possibly more than one independent variable, that impacts the dependent variable and causes it to vary.

Bivariate relationships are at the center of this enterprise, establishing the empirical pathway leading from the variance discussed in Chap. 2 to the causality discussed in Chap. 4 . Finding that there is a significant relationship between two variables, a statistically significant relationship, is not sufficient to establish causality, to conclude with confidence that one of the variables impacts the other and causes it to vary. But such a finding is necessary.

The goal of social science inquiry that investigates the relationship between two variables is not always explanation. It might be simply to describe and map the way two variables interact with one another. And there is no reason to question the value of such research. But the goal of data-based social science research is very often explanation; and while the inter-relationships between more than two variables will almost always be needed to establish that a relationship is very likely to be causal, these inter-relationships can only be examined by empirics that begin with consideration of a bivariate relationship, a relationship with one variable that is a presumed cause and one variable that is a presumed effect.

Against this background, with the importance of two-variable relationships in mind, the present chapter offers a comprehensive overview of bivariate relationships, including but not only those that are hypothesized to be causally related. The chapter considers the origin and nature of hypotheses that posit a particular relationship between two variables, a causal relationship if the larger goal of the research is explanation and the delineation of a causal story to which the hypothesis calls attention. This chapter then considers how a bivariate relationship might be described and visually represented, and thereafter it discusses how to think about and determine whether the two variables actually are related.

Presenting tables and graphs to show how two variables are related and using bivariate statistics to assess the likelihood that an observed relationship differs significantly from the null hypothesis, the hypothesis of no relationship, will be sufficient if the goal of the research is to learn as much as possible about whether and how two variables are related. And there is plenty of excellent research that has this kind of description as its primary objective, that makes use for purposes of description of the concepts and procedures introduced in this chapter. But there is also plenty of research that seeks to explain, to account for variance, and for this research, use of these concepts and procedures is necessary but not sufficient. For this research, consideration of a two-variable relationship, the focus of the present chapter, is a necessary intermediate step on a pathway that leads from the observation of variance to explaining how and why that variance looks and behaves as it does.

Dana El Kurd. 2019. “Who Protests in Palestine? Mobilization Across Class Under the Palestinian Authority.” In Alaa Tartir and Timothy Seidel, eds. Palestine and Rule of Power: Local Dissent vs. International Governance . New York: Palgrave Macmillan.

Yael Zeira. 2019. The Revolution Within: State Institutions and Unarmed Resistance in Palestine . New York: Cambridge University Press.

Carolina de Miguel, Amaney A. Jamal, and Mark Tessler. 2015. “Elections in the Arab World: Why do citizens turn out?” Comparative Political Studies 48, (11): 1355–1388.

Question 1: Independent variable is religiosity; dependent variable is preference for democracy. Example of hypothesis for Question 1: H1. More religious individuals are more likely than less religious individuals to prefer democracy to other political systems. Question 2: Independent variable is preference for democracy; dependent variable is turning out to vote. Example of hypothesis for Question 2: H2. Individuals who prefer democracy to other political systems are more likely than individuals who do not prefer democracy to other political systems to turn out to vote.

Mike Yi. “A complete Guide to Scatter Plots,” posted October 16, 2019 and seen at https://chartio.com/learn/charts/what-is-a-scatter-plot/

The countries are Algeria, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Palestine, Sudan, Tunisia, and Yemen. The Wave V surveys were conducted in 2018–2019.

Not considered in this illustration are the substantial cross-country differences in voter turnout. For example, 63.6 of the Lebanese respondents reported voting, whereas in Algeria the proportion who reported voting was only 20.3 percent. In addition to testing hypotheses about voting in which the individual is the unit of analysis, country could also be the unit of analysis, and hypotheses seeking to account for country-level variance in voting could be formulated and tested.

## Author information

Authors and affiliations.

Department of Political Science, University of Michigan, Ann Arbor, MI, USA

Mark Tessler

You can also search for this author in PubMed Google Scholar

## Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

## Copyright information

© 2023 The Author(s)

## About this chapter

Tessler, M. (2023). Bivariate Analysis: Associations, Hypotheses, and Causal Stories. In: Social Science Research in the Arab World and Beyond. SpringerBriefs in Sociology. Springer, Cham. https://doi.org/10.1007/978-3-031-13838-6_3

## Download citation

DOI : https://doi.org/10.1007/978-3-031-13838-6_3

Published : 04 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-13837-9

Online ISBN : 978-3-031-13838-6

eBook Packages : Social Sciences Social Sciences (R0)

## Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Publish with us

Policies and ethics

- Find a journal
- Track your research

## 2024 Machine Learning Roadmap: Steps and Resources for Beginners

## How to Study Data Science in 2024: Degree Programs vs. Self-Learning Options

## What is R Programming in Data Science? An Ultimate Guide

## AI Biology Revolution: From Data to Discovery

## Type and hit Enter to search

Bivariate data: types & characteristics with 5 examples.

Bivariate data is like a duo—a pair of pieces of information that go hand in hand. It’s like looking at two variables together to see if there’s a connection between them. Let’s delve into what bivariate data is all about, its types, characteristics, measures, and formulas. We’ll explore some fascinating examples from the biosciences realm, including healthcare, genomics, environmental science, clinical research, and pharmaceuticals.

## What is Bivariate Data?

Bivariate data is a fancy term for saying we’re looking at two things at once. It’s like comparing apples and oranges—you’re interested in how they relate to each other. In statistics, we use bivariate data to see if there’s a connection or relationship between two variables.

## Types of Bivariate Data

- Positive Bivariate Data: When one variable increases, the other tends to increase as well. It’s like saying more sunshine leads to more happiness.
- Negative Bivariate Data: Here, when one variable increases, the other decreases. Think of it as saying more pollution leads to fewer trees.

Also learn about Bioinformatics vs Biostatistics; A 2024 Analysis of Biological Data Trends.

## Characteristics of Bivariate Data

- Interdependence: The variables depend on each other in some way. For example, in healthcare, a patient’s weight might depend on their height.
- Scatterplot Representation: Bivariate data is often represented using scatterplots, where each point represents a pair of values for the two variables.
- Correlation: This tells us how closely the two variables are related. It ranges from -1 to 1. A correlation of 1 means a perfect positive relationship, -1 means a perfect negative relationship, and 0 means no relationship at all.

## Bivariate Measure

The measure used to assess the relationship between two variables is called correlation. It helps us understand how changes in one variable are associated with changes in another.

## Bivariate Data Formula

Correlation is usually calculated using a formula known as Pearson’s correlation coefficient. It looks complicated, but it’s not so bad once you get the hang of it:

- n = number of data points
- x = values of the first variable
- y = values of the second variable
- ∑= summation (add up all the values)
- ∑ xy = sum of the products of the paired values
- ∑ x = sum of the first variable values
- ∑ y = sum of the second variable values
- ∑ x 2 = sum of the squares of the first variable values
- ∑ y 2 = sum of the squares of the second variable values

## The Concept Behind the Bivariate Data Formula:

Now, let’s delve into why this formula works. Pearson’s correlation coefficient measures the strength and direction of the linear relationship between two variables. Here’s how it does that:

## Covariance:

The numerator of the formula, n(∑xy)−(∑x)(∑y), represents the covariance between the two variables. Covariance tells us how much two variables vary together. If the variables tend to increase or decrease together, the covariance will be positive; if one variable tends to increase as the other decreases, the covariance will be negative.

## Standard Deviations:

The denominator of the formula involves the standard deviations of each variable. Standard deviation is a measure of how spread out the values of a variable are. Dividing the covariance by the product of the standard deviations standardizes the measure, giving us the correlation coefficient.

## Normalization:

Dividing the covariance by the product of the standard deviations normalizes the measure, ensuring that the correlation coefficient r falls within the range of -1 to 1. This allows us to compare the strength and direction of the relationship between different pairs of variables, regardless of their units or scales .

Also explore Bivariate Analysis in Biological Data Science: Theory, Tools, and Practical Use Cases.

## Examples of Bivariate Data in Biosciences

1. healthcare: body mass index (bmi) vs. blood pressure.

Researchers study how BMI (body mass index) relates to blood pressure. They collect data on both BMI and blood pressure from a group of patients. By analyzing this bivariate data, they can see if there’s a correlation between high BMI and high blood pressure. Understanding this relationship helps in managing cardiovascular health.

## 2. Genomics: Gene Expression vs. Disease Susceptibility

In genomics, scientists analyze how gene expression levels relate to the susceptibility to certain diseases. By collecting bivariate data on gene expression and disease status across different individuals, they can identify genes that might play a role in disease development. This understanding contributes to personalized medicine and targeted therapies.

## 3. Environmental Science: Pollution Levels vs. Respiratory Illness

Environmental scientists examine how pollution levels affect respiratory illness rates in a community. By gathering bivariate data on pollution levels (such as air quality measurements) and the incidence of respiratory illnesses (such as asthma or bronchitis cases), they can assess the impact of environmental factors on public health. This information guides policies aimed at reducing pollution and protecting public health.

## 4. Clinical Research: Drug Dosage vs. Treatment Efficacy

Clinical researchers investigate the relationship between drug dosage and treatment efficacy for various medical conditions. By collecting bivariate data on the dosage of a medication administered to patients and their corresponding treatment outcomes, they can determine the optimal dosage that maximizes effectiveness while minimizing side effects. This research helps improve patient care and drug development processes.

## 5. Pharmaceuticals: Drug Concentration vs. Toxicity

Pharmaceutical companies conduct studies to assess the relationship between drug concentration levels in the body and the occurrence of adverse effects or toxicity. By analyzing bivariate data on drug concentrations measured in patients’ blood or tissues and any observed toxicity symptoms, they can establish safe dosage ranges for medications and identify potential safety concerns. This knowledge informs drug labeling, prescribing guidelines, and regulatory decisions to ensure patient safety.

Learn about 8 Data Science Portfolio Projects in Healthcare and Genomics; Step by Step Guidance and Resources

Bivariate data analysis is a powerful tool in various biosciences fields, allowing researchers to explore relationships between two variables and draw meaningful insights. Whether it’s understanding disease mechanisms, optimizing treatments, or safeguarding public health and the environment, bivariate data analysis plays a crucial role in advancing scientific knowledge and improving outcomes in the biosciences domain. So next time you’re faced with a bunch of data, don’t forget to look at it bivariately—you might uncover some fascinating connections!

Know more about 5 Top Statistical Programming Languages and Software for Biological Data Science.

## Share Article

## Tanzeela Arshad

Other articles.

## Bivariate Analysis in Data Science: Theory, Tools and Practical Use Cases

## An Ultimate Guide to How to Become a Clinical Data Manager

No comment be the first one., leave a reply cancel reply.

Save my name, email, and website in this browser for the next time I comment.

By using this form you agree with the storage and handling of your data by this website. *

- Privacy Policy

Home » How To Write A Research Proposal – Step-by-Step [Template]

## How To Write A Research Proposal – Step-by-Step [Template]

Table of Contents

## How To Write a Research Proposal

Writing a Research proposal involves several steps to ensure a well-structured and comprehensive document. Here is an explanation of each step:

## 1. Title and Abstract

- Choose a concise and descriptive title that reflects the essence of your research.
- Write an abstract summarizing your research question, objectives, methodology, and expected outcomes. It should provide a brief overview of your proposal.

## 2. Introduction:

- Provide an introduction to your research topic, highlighting its significance and relevance.
- Clearly state the research problem or question you aim to address.
- Discuss the background and context of the study, including previous research in the field.

## 3. Research Objectives

- Outline the specific objectives or aims of your research. These objectives should be clear, achievable, and aligned with the research problem.

4. Literature Review:

- Conduct a comprehensive review of relevant literature and studies related to your research topic.
- Summarize key findings, identify gaps, and highlight how your research will contribute to the existing knowledge.

## 5. Methodology:

- Describe the research design and methodology you plan to employ to address your research objectives.
- Explain the data collection methods, instruments, and analysis techniques you will use.
- Justify why the chosen methods are appropriate and suitable for your research.

## 6. Timeline:

- Create a timeline or schedule that outlines the major milestones and activities of your research project.
- Break down the research process into smaller tasks and estimate the time required for each task.

## 7. Resources:

- Identify the resources needed for your research, such as access to specific databases, equipment, or funding.
- Explain how you will acquire or utilize these resources to carry out your research effectively.

## 8. Ethical Considerations:

- Discuss any ethical issues that may arise during your research and explain how you plan to address them.
- If your research involves human subjects, explain how you will ensure their informed consent and privacy.

## 9. Expected Outcomes and Significance:

- Clearly state the expected outcomes or results of your research.
- Highlight the potential impact and significance of your research in advancing knowledge or addressing practical issues.

## 10. References:

- Provide a list of all the references cited in your proposal, following a consistent citation style (e.g., APA, MLA).

## 11. Appendices:

- Include any additional supporting materials, such as survey questionnaires, interview guides, or data analysis plans.

## Research Proposal Format

The format of a research proposal may vary depending on the specific requirements of the institution or funding agency. However, the following is a commonly used format for a research proposal:

1. Title Page:

- Include the title of your research proposal, your name, your affiliation or institution, and the date.

2. Abstract:

- Provide a brief summary of your research proposal, highlighting the research problem, objectives, methodology, and expected outcomes.

3. Introduction:

- Introduce the research topic and provide background information.
- State the research problem or question you aim to address.
- Explain the significance and relevance of the research.
- Review relevant literature and studies related to your research topic.
- Summarize key findings and identify gaps in the existing knowledge.
- Explain how your research will contribute to filling those gaps.

5. Research Objectives:

- Clearly state the specific objectives or aims of your research.
- Ensure that the objectives are clear, focused, and aligned with the research problem.

6. Methodology:

- Describe the research design and methodology you plan to use.
- Explain the data collection methods, instruments, and analysis techniques.
- Justify why the chosen methods are appropriate for your research.

7. Timeline:

8. Resources:

- Explain how you will acquire or utilize these resources effectively.

9. Ethical Considerations:

- If applicable, explain how you will ensure informed consent and protect the privacy of research participants.

10. Expected Outcomes and Significance:

11. References:

12. Appendices:

## Research Proposal Template

Here’s a template for a research proposal:

1. Introduction:

2. Literature Review:

3. Research Objectives:

4. Methodology:

5. Timeline:

6. Resources:

7. Ethical Considerations:

8. Expected Outcomes and Significance:

9. References:

10. Appendices:

## Research Proposal Sample

Title: The Impact of Online Education on Student Learning Outcomes: A Comparative Study

1. Introduction

Online education has gained significant prominence in recent years, especially due to the COVID-19 pandemic. This research proposal aims to investigate the impact of online education on student learning outcomes by comparing them with traditional face-to-face instruction. The study will explore various aspects of online education, such as instructional methods, student engagement, and academic performance, to provide insights into the effectiveness of online learning.

2. Objectives

The main objectives of this research are as follows:

- To compare student learning outcomes between online and traditional face-to-face education.
- To examine the factors influencing student engagement in online learning environments.
- To assess the effectiveness of different instructional methods employed in online education.
- To identify challenges and opportunities associated with online education and suggest recommendations for improvement.

3. Methodology

3.1 Study Design

This research will utilize a mixed-methods approach to gather both quantitative and qualitative data. The study will include the following components:

3.2 Participants

The research will involve undergraduate students from two universities, one offering online education and the other providing face-to-face instruction. A total of 500 students (250 from each university) will be selected randomly to participate in the study.

3.3 Data Collection

The research will employ the following data collection methods:

- Quantitative: Pre- and post-assessments will be conducted to measure students’ learning outcomes. Data on student demographics and academic performance will also be collected from university records.
- Qualitative: Focus group discussions and individual interviews will be conducted with students to gather their perceptions and experiences regarding online education.

3.4 Data Analysis

Quantitative data will be analyzed using statistical software, employing descriptive statistics, t-tests, and regression analysis. Qualitative data will be transcribed, coded, and analyzed thematically to identify recurring patterns and themes.

4. Ethical Considerations

The study will adhere to ethical guidelines, ensuring the privacy and confidentiality of participants. Informed consent will be obtained, and participants will have the right to withdraw from the study at any time.

5. Significance and Expected Outcomes

This research will contribute to the existing literature by providing empirical evidence on the impact of online education on student learning outcomes. The findings will help educational institutions and policymakers make informed decisions about incorporating online learning methods and improving the quality of online education. Moreover, the study will identify potential challenges and opportunities related to online education and offer recommendations for enhancing student engagement and overall learning outcomes.

6. Timeline

The proposed research will be conducted over a period of 12 months, including data collection, analysis, and report writing.

The estimated budget for this research includes expenses related to data collection, software licenses, participant compensation, and research assistance. A detailed budget breakdown will be provided in the final research plan.

8. Conclusion

This research proposal aims to investigate the impact of online education on student learning outcomes through a comparative study with traditional face-to-face instruction. By exploring various dimensions of online education, this research will provide valuable insights into the effectiveness and challenges associated with online learning. The findings will contribute to the ongoing discourse on educational practices and help shape future strategies for maximizing student learning outcomes in online education settings.

## About the author

## Muhammad Hassan

Researcher, Academic Writer, Web developer

## You may also like

## How To Write A Proposal – Step By Step Guide...

## Grant Proposal – Example, Template and Guide

## How To Write A Business Proposal – Step-by-Step...

## Business Proposal – Templates, Examples and Guide

## Proposal – Types, Examples, and Writing Guide

## How to choose an Appropriate Method for Research?

Snapsolve any problem by taking a picture. Try it in the Numerade app?

In statistics, many bivariate data examples can be given to help you understand the relationship between two variables and to grasp the idea behind the bivariate data analysis definition and meaning. Bivariate analysis is a statistical method that helps you study relationships (correlation) between data sets. Many businesses, marketing, and social science questions and problems could be solved ...

Example 1: Business. Businesses often collect bivariate data about total money spent on advertising and total revenue. For example, a business may collect the following data for 12 consecutive sales quarters: This is an example of bivariate data because it contains information on exactly two variables: advertising spend and total revenue.

More specifically, bivariate analysis explores how the dependent ("outcome") variable depends or is explained by the independent ("explanatory") variable (asymmetrical analysis), or it explores the association between two variables without any cause and effect relationship (symmetrical analysis). In this paper we will introduce the ...

The formula is: t = r√n − 2 √1 − r2. There are n - 2 degrees of freedom. This can be demonstrated with the example of Gini coefficients and poverty rates as provided in Chapter 4 and using a level of significance of 0.05. The correlation is -0.650. The sample size is 7, so there are 5 degrees of freedom.

Bivariate analysis is a fundamental technique in data science. It involves analyzing the relationship between two variables. Through bivariate analysis, data scientists can uncover patterns, correlations, and associations between variables, providing valuable insights into various fields, including biology, healthcare, genomics, the environment ...

Analysis of variance, generally abbreviated to ANOVA for short, is a statistical method to examine how a dependent variable changes as the value of a categorical independent variable changes. It serves the same purpose as the t-tests we learned in 15.4: it tests for differences in group means.

There are two steps involved in working out the direction of a crosstabulated relationship… and these are almost indecipherable until you've seen it done: 1. Percentage in the direction of the independent variable. 2. Compare percentages in one category of the dependent variable.

This page titled 9.1: Introduction to Bivariate Data is shared under a Public Domain license and was authored, remixed, and/or curated by David Lane via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. In this chapter we consider bivariate data, which for ...

The bivariate scatter plot in the right panel of Figure 14.3 is essentially a plot of self-esteem on the vertical axis, against age on the horizontal axis. This plot roughly resembles an upward sloping line—i.e., positive slope—which is also indicative of a positive correlation. If the two variables were negatively correlated, the scatter ...

Abstract. Quantitative data analysis serves as part of an essential process of evidence-making in health and social sciences. It is adopted for any types of research question and design whether it is descriptive, explanatory, or causal. However, compared with qualitative counterpart, quantitative data analysis has less flexibility.

Research proposal examples. Writing a research proposal can be quite challenging, but a good starting point could be to look at some examples. We've included a few for you below. Example research proposal #1: "A Conceptual Framework for Scheduling Constraint Management" Example research proposal #2: "Medical Students as Mediators of ...

More specifically, bivariate analysis explores how the dependent ("outcome") variable. depends or is explained by the independent ("explanatory") variable (asymmetrical analysis), or it ...

Data analysis in research aimed at explanation should be, in most cases, preceded by the formulation of one or more hypotheses. In this context, when the focus is on bivariate relationships and the objective is explanation rather than description, each hypothesis will include a dependent variable and an independent variable and make explicit ...

Characteristics of Bivariate Data. Interdependence: The variables depend on each other in some way. For example, in healthcare, a patient's weight might depend on their height. Scatterplot Representation: Bivariate data is often represented using scatterplots, where each point represents a pair of values for the two variables. Correlation ...

Bivariate data. In statistics, bivariate data is data on each of two variables, where each value of one of the variables is paired with a value of the other variable. [1] It is a specific but very common case of multivariate data. The association can be studied via a tabular or graphical display, or via sample statistics which might be used for ...

Here is an explanation of each step: 1. Title and Abstract. Choose a concise and descriptive title that reflects the essence of your research. Write an abstract summarizing your research question, objectives, methodology, and expected outcomes. It should provide a brief overview of your proposal. 2.

Namely, bivariate hypothesis tests help us to answer the question, "Are X and Y related?". By definition - "bivariate" means "two variables" - these tests cannot help us with the important question, "Have we controlled for all confounding variables Z that might make the observed association between X and Y spurious?". Type ...

Choosing the correct one is not difficult. You choose the bivariate statistic based on: (1) the type of risk factor and outcome variable you have; and (2) whether the data are unpaired or paired (repeated observations or matched data). Bivariate statistics for unpaired data are shown in Table 5.1. Bivariate statistics for repeated observations ...

12.2 Graphical Representation of Bivariate Data A standard plot on a grid paper of y (y-axis) against x (x-axis) gives a very good indication of the behaviour of data. This coordinate plot of the points (x1;y1), (x2;y2);:::; (xn;yn) on a grid paper is called a scatter plot. Note: The ﬁrst step of the analysis of bivariate data is to plot the ...

This chapter explores the fundamental aspects of quantitative data analysis and explains the most commonly used analytical techniques: descriptive, univariate, and bivariate statistics. Statistics ...

VIDEO ANSWER: We have enough to determine the variable in the question. The variable is independent and dependent in the statement. The purpose of the current study is to test the sex difference in self esteem of food grade students. The current

## IMAGES

## VIDEO

## COMMENTS

Bivariate statistics are used in research in order to analyze two variables simultaneously; Real world phenomena such as many topics of scientific research are usually complex and multi-variate. Bivariate analysis is a mandatory step to describe the relationships between the observed variables;

A list of bivariate data examples: including linear bivariate regression analysis, correlation (relationship), distribution, and scatter plot. What is bivariate data? Definition.

This tutorial provides several examples of bivariate data in real-life situations along with how to analyze it.

Quantitative bivariate data In case of two quantitative variables, the most relevant technique for bivariate analysis is correlation analysis and

Let's delve into what bivariate data is with fascinating examples from the biosciences, including healthcare, genomics, environmental science, clinical research, and pharmaceuticals.

A research proposal aims to show why your project is worthwhile. It should explain the context, objectives, and methods of your research.

4.1: Introduction to Bivariate Data. In this chapter we consider bivariate data, which for now consists of two quantitative variables for each individual. Our first interest is in summarizing such data in a way that is analogous to summarizing univariate (single variable) data.

Bivariate analysis is a fundamental technique in data science. It involves analyzing the relationship between two variables. Through bivariate analysis, data scientists can uncover patterns, correlations, and associations between variables, providing valuable insights into various fields, including biology, healthcare, genomics, the environment, and clinical research.

This tutorial provides a quick introduction to bivariate analysis, including a formal definition and several examples.

Roger Clark In most research projects involving variables, researchers do indeed investigate the central tendency and variation of important variables, and such investigations can be very revealing. But the typical researcher, using quantitative data analysis, is interested in testing hypotheses or answering research questions that involve at least two variables. A relationship is said to ...

Abstract The role of scientific research is not limited to the description and analysis of single phenomena occurring independently one from each other (univariate analysis). Even though univariate analysis has a pivotal role in statistical analysis, and is useful to find errors inside datasets, to familiarize with and to aggregate data, to describe and to gather basic information on simple ...

How can we analyze the relationship between two quantitative variables? This chapter introduces the concepts and methods of bivariate data analysis, such as scatterplots, correlation, and regression. Learn how to use these tools to explore and describe the patterns and trends in real-world data sets.

This page titled 9.1: Introduction to Bivariate Data is shared under a Public Domain license and was authored, remixed, and/or curated by David Lane via source content that was edited to the style and standards of the LibreTexts platform. In this chapter we consider bivariate data, which for now consists of two quantitative variables for each ...

Bivariate analysis is a group of statistical techniques that examine the relationship between two variables. You need to conduct bivariate analyses before you can begin to draw conclusions from your data, including in future multivariate analyses.

In this blog article, we will explore how to create a data analysis plan: the content and structure. This data analysis plan serves as a roadmap to how data collected will be organised and analysed. It includes the following aspects: Clearly states the research objectives and hypothesis. Identifies the dataset to be used.

3.2 Hypotheses and Formulating Hypotheses Hypotheses emerge from the research questions to which a study is devoted. Accordingly, a researcher interested in explanation will have something specific in mind when she decides to hypothesize and then evaluate a bivariate relationship in order to determine whether, and if so how, her variable of interest is related to another variable. For example ...

12.1 Introduction (P.187-191) Many scientific investigations often involve two continuous vari-ables and researchers are interested to know whether there is a (linear) relationship between the two variables. For example, a researcher wishes to investigate whether there is a relationship between the age and the blood pressure of people 50 years or older.

Academic Research Proposal. This is the most common type of research proposal, which is prepared by students, scholars, or researchers to seek approval and funding for an academic research project. It includes all the essential components mentioned earlier, such as the introduction, literature review, methodology, and expected outcomes.

In the preceding chapters we introduced the core concepts of hypothesis testing. In this chapter we discuss the basic mechanics of hypothesis testing with three different examples of bivariate hypothesis testing. It is worth noting that, although this type of analysis was the main form of hypothesis testing in the professional journals up through the 1970s, it is seldom used as the primary ...

This tutorial provides several examples of bivariate data in real-life situations along with how to analyze it.

Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, ...

Bivariate Data: Examples, Definition and Analysis. On this page: What is bivariate data? Definition. Examples of bivariate data: with table. Bivariate data analysis examples: incl