U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Exploratory data analysis: frequencies, descriptive statistics, histograms, and boxplots.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: November 3, 2023 .

  • Definition/Introduction

Researchers must utilize exploratory data techniques to present findings to a target audience and create appropriate graphs and figures. Researchers can determine if outliers exist, data are missing, and statistical assumptions will be upheld by understanding data. Additionally, it is essential to comprehend these data when describing them in conclusions of a paper, in a meeting with colleagues invested in the findings, or while reading others’ work.

  • Issues of Concern

This comprehension begins with exploring these data through the outputs discussed in this article. Individuals who do not conduct research must still comprehend new studies, and knowledge of fundamentals in analyzing data and interpretation of histograms and boxplots facilitates the ability to appraise recent publications accurately. Without this familiarity, decisions could be implemented based on inaccurate delivery or interpretation of medical studies.

Frequencies and Descriptive Statistics

Effective presentation of study results, in presentation or manuscript form, typically starts with frequencies and descriptive statistics (ie, mean, medians, standard deviations). One can get a better sense of the variables by examining these data to determine whether a balanced and sufficient research design exists. Frequencies also inform on missing data and give a sense of outliers (will be discussed below).

Luckily, software programs are available to conduct exploratory data analysis. For this chapter, we will be examining the following research question.

RQ: Are there differences in drug life (length of effect) for Drug 23 based on the administration site?

A more precise hypothesis could be: Is drug 23 longer-lasting when administered via site A compared to site B?

To address this research question, exploratory data analysis is conducted. First, it is essential to start with the frequencies of the variables. To keep things simple, only variables of minutes (drug life effect) and administration site (A vs B) are included. See Image. Figure 1 for outputs for frequencies.

Figure 1 shows that the administration site appears to be a balanced design with 50 individuals in each group. The excerpt for minutes frequencies is the bottom portion of Figure 1 and shows how many cases fell into each time frame with the cumulative percent on the right-hand side. In examining Figure 1, one suspiciously low measurement (135) was observed, considering time variables. If a data point seems inaccurate, a researcher should find this case and confirm if this was an entry error. For the sake of this review, the authors state that this was an entry error and should have been entered 535 and not 135. Had the analysis occurred without checking this, the data analysis, results, and conclusions would have been invalid. When finding any entry errors and determining how groups are balanced, potential missing data is explored. If not responsibly evaluated, missing values can nullify results.  

After replacing the incorrect 135 with 535, descriptive statistics, including the mean, median, mode, minimum/maximum scores, and standard deviation were examined. Output for the research example for the variable of minutes can be seen in Figure 2. Observe each variable to ensure that the mean seems reasonable and that the minimum and maximum are within an appropriate range based on medical competence or an available codebook. One assumption common in statistical analyses is a normal distribution. Image . Figure 2 shows that the mode differs from the mean and the median. We have visualization tools such as histograms to examine these scores for normality and outliers before making decisions.

Histograms are useful in assessing normality, as many statistical tests (eg, ANOVA and regression) assume the data have a normal distribution. When data deviate from a normal distribution, it is quantified using skewness and kurtosis. [1]  Skewness occurs when one tail of the curve is longer. If the tail is lengthier on the left side of the curve (more cases on the higher values), this would be negatively skewed, whereas if the tail is longer on the right side, it would be positively skewed. Kurtosis is another facet of normality. Positive kurtosis occurs when the center has many values falling in the middle, whereas negative kurtosis occurs when there are very heavy tails. [2]

Additionally, histograms reveal outliers: data points either entered incorrectly or truly very different from the rest of the sample. When there are outliers, one must determine accuracy based on random chance or the error in the experiment and provide strong justification if the decision is to exclude them. [3]  Outliers require attention to ensure the data analysis accurately reflects the majority of the data and is not influenced by extreme values; cleaning these outliers can result in better quality decision-making in clinical practice. [4]  A common approach to determining if a variable is approximately normally distributed is converting values to z scores and determining if any scores are less than -3 or greater than 3. For a normal distribution, about 99% of scores should lie within three standard deviations of the mean. [5]  Importantly, one should not automatically throw out any values outside of this range but consider it in corroboration with the other factors aforementioned. Outliers are relatively common, so when these are prevalent, one must assess the risks and benefits of exclusion. [6]

Image . Figure 3 provides examples of histograms. In Figure 3A, 2 possible outliers causing kurtosis are observed. If values within 3 standard deviations are used, the result in Figure 3B are observed. This histogram appears much closer to an approximately normal distribution with the kurtosis being treated. Remember, all evidence should be considered before eliminating outliers. When reporting outliers in scientific paper outputs, account for the number of outliers excluded and justify why they were excluded.

Boxplots can examine for outliers, assess the range of data, and show differences among groups. Boxplots provide a visual representation of ranges and medians, illustrating differences amongst groups, and are useful in various outlets, including evidence-based medicine. [7]  Boxplots provide a picture of data distribution when there are numerous values, and all values cannot be displayed (ie, a scatterplot). [8]  Figure 4 illustrates the differences between drug site administration and the length of drug life from the above example.

Image . Figure 4 shows differences with potential clinical impact. Had any outliers existed (data from the histogram were cleaned), they would appear outside the line endpoint. The red boxes represent the middle 50% of scores. The lines within each red box represent the median number of minutes within each administration site. The horizontal lines at the top and bottom of each line connected to the red box represent the 25th and 75th percentiles. In examining the difference boxplots, an overlap in minutes between 2 administration sites were observed: the approximate top 25 percent from site B had the same time noted as the bottom 25 percent at site A. Site B had a median minute amount under 525, whereas administration site A had a length greater than 550. If there were no differences in adverse reactions at site A, analysis of this figure provides evidence that healthcare providers should administer the drug via site A. Researchers could follow by testing a third administration site, site C. Image . Figure 5 shows what would happen if site C led to a longer drug life compared to site A.

Figure 5 displays the same site A data as Figure 4, but something looks different. The significant variance at site C makes site A’s variance appear smaller. In order words, patients who were administered the drug via site C had a larger range of scores. Thus, some patients experience a longer half-life when the drug is administered via site C than the median of site A; however, the broad range (lack of accuracy) and lower median should be the focus. The precision of minutes is much more compacted in site A. Therefore, the median is higher, and the range is more precise. One may conclude that this makes site A a more desirable site.

  • Clinical Significance

Ultimately, by understanding basic exploratory data methods, medical researchers and consumers of research can make quality and data-informed decisions. These data-informed decisions will result in the ability to appraise the clinical significance of research outputs. By overlooking these fundamentals in statistics, critical errors in judgment can occur.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All interprofessional healthcare team members need to be at least familiar with, if not well-versed in, these statistical analyses so they can read and interpret study data and apply the data implications in their everyday practice. This approach allows all practitioners to remain abreast of the latest developments and provides valuable data for evidence-based medicine, ultimately leading to improved patient outcomes.

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Exploratory Data Analysis Figure 1 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 2 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 3 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 4 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Exploratory Data Analysis Figure 5 Contributed by Martin Huecker, MD and Jacob Shreffler, PhD

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and Boxplots. [Updated 2023 Nov 3]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • Contour boxplots: a method for characterizing uncertainty in feature sets from simulation ensembles. [IEEE Trans Vis Comput Graph. 2...] Contour boxplots: a method for characterizing uncertainty in feature sets from simulation ensembles. Whitaker RT, Mirzargar M, Kirby RM. IEEE Trans Vis Comput Graph. 2013 Dec; 19(12):2713-22.
  • Review Univariate Outliers: A Conceptual Overview for the Nurse Researcher. [Can J Nurs Res. 2019] Review Univariate Outliers: A Conceptual Overview for the Nurse Researcher. Mowbray FI, Fox-Wasylyshyn SM, El-Masri MM. Can J Nurs Res. 2019 Mar; 51(1):31-37. Epub 2018 Jul 3.
  • Qualitative Study. [StatPearls. 2024] Qualitative Study. Tenny S, Brannan JM, Brannan GD. StatPearls. 2024 Jan
  • [Descriptive statistics]. [Rev Alerg Mex. 2016] [Descriptive statistics]. Rendón-Macías ME, Villasís-Keever MÁ, Miranda-Novales MG. Rev Alerg Mex. 2016 Oct-Dec; 63(4):397-407.
  • Review Graphics and statistics for cardiology: comparing categorical and continuous variables. [Heart. 2016] Review Graphics and statistics for cardiology: comparing categorical and continuous variables. Rice K, Lumley T. Heart. 2016 Mar; 102(5):349-55. Epub 2016 Jan 27.

Recent Activity

  • Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and ... Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and Boxplots - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Learn how to leverage the right databases for applications, analytics and generative AI.

Register for the ebook on generative AI

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning .

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

There are four primary types of EDA:

  • Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
  • Stem-and-leaf plots, which show all data values and the shape of the distribution.
  • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and a response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

Some of the most common data science tools used to create an EDA include:

  • Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
  • R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

For a deep dive into the differences between these approaches, check out " Python vs. R: What's the Difference? "

Use IBM Watson® Studio to determine whether the statistical techniques that you are considering for data analysis are appropriate.

Learn the importance and the role of EDA and data visualization techniques to find data quality issues and for data preparation, relevant to building ML pipelines.

Learn common techniques to retrieve your data, clean it, apply feature engineering, and have it ready for preliminary analysis and hypothesis testing.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

What Is Exploratory Data Analysis?

Exploratory data analysis is one of the first steps in the data analytics process. In this post, we explore what EDA is, why it’s important, and a few techniques worth familiarizing yourself with.

Data analytics requires a mixed range of skills. This includes practical expertise, such as knowing how to scrape and store data. It also requires more nuanced problem-solving abilities, such as how to analyze data and draw conclusions from it. As a statistical approach, exploratory data analysis (or EDA) is vital for learning more about a new dataset.

Applied early on in the data analytics process, EDA can help you learn a great deal about a dataset’s inherent attributes and properties.

Interested in learning some data analytics skills? Try this free data short course out to see if you like it.

In this post, we’ll introduce the topic in more detail, answering the following questions:

  • What is exploratory data analysis?
  • Why is exploratory data analysis important?
  • What are the underlying principles of exploratory data analysis?
  • What are some techniques can you use for exploratory data analysis?

First up, though…

1. What is exploratory data analysis?

In data analytics, exploratory data analysis is how we describe the practice of investigating a dataset and summarizing its main features. It’s a form of descriptive analytics .

EDA aims to spot patterns and trends, to identify anomalies, and to test early hypotheses. Although exploratory data analysis can be carried out at various stages of the  data analytics process , it is usually conducted before a firm hypothesis or end goal is defined.

In general, EDA focuses on understanding the characteristics of a dataset before deciding what we want to do with that dataset.

Exploratory data analytics often uses visual techniques, such as graphs, plots, and other visualizations. This is because our natural pattern-detecting abilities make it much easier to spot trends and anomalies when they’re represented visually.

As a simple example, outliers (or data points that skew a trend) stand out much more immediately on a scatter graph than they do in columns on a spreadsheet.

Source: Indoor-Fanatikerderivative work: Indoor-Fanatiker, CC0, via Wikimedia Commons  

In the image, the outlier (in red) is immediately clear. Even if you’re new to data analytics, this approach will seem familiar. If you ever plotted a graph in math or science at school in order to infer information about a dataset, then you’ve carried out a basic EDA.

The American mathematician John Tukey formally introduced the concept of exploratory data analysis in 1961. The idea of summarizing a dataset’s key characteristics coincided with the development of early programming languages such as R and S. This was also a time when scientists and engineers were working on new data-driven problems related to early computing.

Since then, EDA has been widely adopted as a core tenet of data analytics and data science more generally. It is now considered a common—indeed, indispensable—part of the data analytics process.

Want to try your hand at exploratory data analysis?  Try this free, practical tutorial on exploratory data analysis as part of our beginner’s short course. You’ll calculate descriptive statistics for a real dataset, and create pivot tables.

2. Why is exploratory data analysis important?

At this stage, you might be asking yourself: why bother carrying out an EDA?

After all, data analytics today is far more sophisticated than it was in the 1960s. We have algorithms that can automate so many tasks. Surely it’s easier (and even preferable) to skip this step of the process altogether?

In truth, it has been shown time and again that effective EDA provides invaluable insights that an algorithm cannot. You can think of this a bit like running a document through a spellchecker versus reading it yourself. While software is useful for spotting typos and grammatical errors, only a critical human eye can detect the nuance.

An EDA is similar in this respect—tools can help you, but it requires our own intuition to make sense of it. This personal, in-depth insight will support detailed data analysis further down the line.

Specifically, some key benefits of an EDA include:

Spotting missing and incorrect data

As part of the  data cleaning process , an initial data analysis (IDA) can help you spot any structural issues with your dataset.

You may be able to fix these, or you might find that you need to reprocess the data or collect new data entirely. While this can be a nuisance, it’s better to know upfront, before you dive in with a deeper analysis.

Understanding the underlying structure of your data

Properly mapping your data ensures that you maintain high data quality when transferring it from its source to your database, spreadsheet, data warehouse, etc. Understanding how your data is structured means you can avoid mistakes from creeping in.  

Testing your hypothesis and checking assumptions

Before diving in with a full analysis, it’s important to make sure any assumptions or hypotheses you’re working on stand up to scrutiny.

While an EDA won’t give you all the details, it will help you spot if you’re inferring the right outcomes based on your understanding of the data. If not, then you know that your assumptions are wrong, or that you are asking the wrong questions about the dataset.

Calculating the most important variables

When carrying out any data analysis, it’s necessary to identify the importance of different variables.

This includes how they relate to each other. For example, which independent variables affect which dependent variables? Determining this early on will help you extract the most useful information later on.

Creating the most efficient model

When carrying out your full analysis, you’ll need to remove any extraneous information. This is because needless additional data can either skew your results or simply obscure key insights with unnecessary noise.

In pursuit of your goal, aim to include the fewest number of necessary variables. EDA helps identify information that you can extract.

Determining error margins

EDA isn’t just about finding helpful information. It’s also about determining which data might lead to unavoidable errors in your later analysis.

Knowing which data will impact your results helps you to avoid wrongly accepting false conclusions or incorrectly labeling an outcome as statistically significant when it isn’t.

Identifying the most appropriate statistical tools to help you

Perhaps the most practical outcome of your EDA is that it will help you determine which techniques and statistical models will help you get what you need from your dataset.

For instance, do you need to carry out a predictive analysis or a sentiment analysis ? An EDA will help you decide. You can learn about different types of data analysis in this guide .

As is hopefully clear by now, intuition and reflection are key skills for carrying out exploratory data analysis. While EDA can involve executing defined tasks, interpreting the results of these tasks is where the real skill lies.

3. What are the underlying principles of exploratory data analysis?

Now we know what exploratory data analysis is and why it’s important, how exactly does it work?

In short, exploratory data analysis considers what to look for, how to look for it, and, finally, how to interpret what we discover. At its core, EDA is more of an attitude than it is a step-by-step process. Exploring data with an open mind tends to reveal its underlying nature far more readily than making assumptions about the rules we think (or want) it to adhere to.

In data analytics terms, we can generally say that exploratory data analysis is a qualitative investigation, not a quantitative one . This means that it involves looking at a dataset’s inherent qualities with an inquisitive mindset. Usually, it does not attempt to make cold measurements or draw insights about a dataset’s content. This comes later on.

You’d be forgiven for thinking this sounds a bit esoteric for a scientific field like data analytics! But don’t worry. There are some practical principles to exploratory data analysis that can help you proceed. A key one of these is known as the five-number summary.

What is the five-number summary?

The five-number summary is a set of five descriptive statistics. Simple though these are, they make a useful starting point for any exploratory data analysis.

The aim of the five-number summary is not to make a value judgment on which statistics are the most important or appropriate, but to offer a concise overview of how different observations in the dataset are distributed. This allows us to ask more nuanced questions about the data, such as ‘why are the data distributed this way?’ or ‘what factors might impact the shape of these data?’ These sorts of questions are vital for obtaining insights that will help us determine the goals for our later analysis.

The five-number summary includes the five most common sample percentiles:

  • The sample minimum (the smallest observation)
  • The lower quartile (the median of the lower half of the data)
  • The median (the average / middle value)
  • The upper quartile (the median of the upper half of the data)
  • The sample maximum (the largest observation)

The lower and upper quartiles are essentially the median of the lower and upper halves of the dataset. These can be used to determine the interquartile range , which is the middle 50% of the dataset.

In turn, this helps describe the overall spread of the data, allowing you to identify any outliers. These five statistics can be easily shown using a box plot.

Source: Dcbmariano, CC BY-SA 4.0 via  Wikimedia Commons

The five-number summary can be used to determine a great number of additional attributes about a given dataset. This is why it is such a foundational part of data exploration.

To make matters easier, many programming languages, including R and Python, have inbuilt functions for determining the five-number summary and producing the corresponding box plots.

4. What are some techniques you can use for exploratory data analysis?

As we’ve already explained, most (though not all) EDA techniques are graphical in nature.

Graphical tools, like the box plot described previously, are very helpful for revealing a dataset’s hidden secrets. What follows are some common techniques for carrying out exploratory data analysis. Many of these rely on visualizations that can be easily created using tools like R, Python , S-Plus, or KNIME, to name a few popular ones.

Univariate analysis

Source: Michiel1972, CC BY-SA 3.0 via Wikimedia Commons

Univariate analysis is one of the simplest forms of data analysis. It looks at the distribution of a single variable (or column of data) at a time.

While univariate analysis does not strictly need to be visual, it commonly uses visualizations such as tables, pie charts, histograms, or bar charts.

Multivariate analysis

Source: Public domain via  Wikimedia Commons

Multivariate analysis looks at the distribution of two or more variables and explores the relationship between them. Most multivariate analyses compare two variables at a time (bivariate analysis).

However, it sometimes involves three or more variables. Either way, it is good practice to carry out univariate analysis on each of the variables before doing a multivariate EDA.

Any plot or graph that has two or more data points can be used to create a multivariate visualization (for example, a line graph that plots speed against time).

Classification or clustering analysis

Source: Chire, CC BY-SA 3.0  via Wikimedia Commons

Clustering analysis is when we place objects into groups based on their common properties.

It is similar to classification. The key difference is that classification involves grouping items using explicit, predefined classes (e.g. categorizing a dataset of people based on a range of their heights).

Clustering, meanwhile, involves grouping data based on what they implicitly tell us (such as whether someone’s height means they are highly likely, quite likely, or not at all likely to bang their head on a doorframe!)

Predictive analysis

Source: Berland, Public domain, via Wikimedia Commons

Although predictive analysis is commonly used in machine learning and AI to make (as the name suggests) predictions, it’s also popular for EDA.

In this context, it doesn’t always refer to uncovering future information but simply using predictive methods—such as linear regression—to find unknown attributes (for example, using existing data to infer the values for gaps in historical data).

These represent just a small handful of the techniques you can use for conducting an EDA. But they hopefully offer a taste of the kinds of approaches you can take when trying to better understand a dataset.

5. In summary

In this post, we’ve introduced the topic of exploratory data analysis, why it’s important, and some techniques you might want to familiarize yourself with before carrying one out. We’ve learned that exploratory data analysis:

  • Summarizes the features and characteristics of a dataset.
  • Is a philosophy, or approach, rather than a defined process.
  • Often draws out insights using visualizations, e.g. graphs and plots.
  • Is important for spotting errors, checking assumptions, identifying relationships between variables, and selecting the right data modeling tools.
  • Builds on the five-number summary, namely: the sample minimum and maximum, the lower and upper quartile, and the median.
  • Employs various techniques, such as univariate and multivariate analysis, clustering, and predictive analytics, to name a few.

To learn more about exploratory data analysis or to put it into context within the broader data analytics process, try our  free, five-day data analytics short course . Otherwise, for more introductory data analytics topics, check out the following posts:

  • What is web scraping? A complete beginner’s guide
  • What is data cleaning and why does it matter?
  • What’s the difference between quantitative and qualitative data?

Exploratory Data Analysis

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

Exploratory Data Analysis (EDA) is an approach to analyzing data that emphasizes exploring datasets for patterns and insights without any predetermined hypotheses. The goal is to let the data “speak for themselves” and guide analysis, rather than imposing rigid structures or theories.

Exploratory data analysis (EDA) mainly analyzes and investigates datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions.

EDA has several key goals:

  • Quickly summarize and describe the characteristics of a dataset . By visualizing data distributions and calculating descriptive statistics, we can get an overview of the salient properties of variables in the dataset.
  • Check the quality of the data and identify any issues . Data visualization and summary statistics readily reveal missing values, errors, and outliers that may need mitigation before proceeding with analysis.
  • Formulate hypotheses and derive insights by exploring interesting aspects of the data . Patterns may suggest causal hypotheses to test via statistical modeling. Outliers often contain useful domain insights.
  • Understand relations between variables . Visualizations can uncover the nature of bivariate relationships – shape, direction, form, outliers, etc. This guides correlation and regression modeling choices.
  • Test assumptions for statistical models you intend to apply later . Histogram shapes indicate parameter distribution assumptions. Scatterplots check linearity assumptions. Identifying where assumptions break provides guidance for requisite data transformations or alternate models.

In essence, EDA entails active investigation of what our data contains even before formal modeling to guide choices, reveal issues needing resolution, and ensure we squeeze all potential value from our data resources.

The flexibility and lack of stringent assumptions make EDA invaluable for open-ended understanding.

Exploratory Data Analysis (EDA) emphasizes flexibility and exploring different approaches to let key aspects of datasets emerge, rather than rigidly testing hypotheses from the start.

It is an iterative cycle where we analyze, visualize, and transform data to extract meaning.

EDA principles underlie “data science” and complement traditional statistical inference. Smart use of EDA provides a rich understanding of phenomena that can guide the construction of causal theories and models.

Visualizations

Creating graphs, charts, and plots to visually inspect data distributions, relationships between variables, outliers, etc. Pictures allow our powerful visual perception to notice things numerical summaries may miss. For example, a scatterplot may reveal that income and education level have a curvilinear relationship, challenging the assumption that the relationship is linear.

Summarizing

Describing key statistics of datasets to understand central tendency (mean, median), spread (variance, percentiles), shape (skewness), outliers, and so on.

These numerical summaries complement visual inspection.

Data transformations

Applying mathematical functions to “reshape” datasets to simplify observed patterns. For example, taking the logarithm of an extremely skewed variable like income may make its distribution more normal and amend issues for certain statistical tests.

Outlier detection

Identifying anomalies that distort overall patterns, and either correcting erroneous values or analyzing outliers specifically since they often reveal useful insights about the phenomenon under study. Fence methods, studentized residuals, Cook’s distance, and other techniques help detect outliers.

Interrogating with different analysis techniques

Trying various statistical and machine learning techniques to understand different facets of datasets. Rather than sticking to predetermined analysis plans, the focus is using diverse tools suited for particular datasets. Techniques could include clustering algorithms, decision trees, linear regression, ANOVA , etc., based on the data characteristics and research goals.

Descriptive vs. Explorative Analysis

Descriptive analysis focuses on summarizing what the data shows on the surface. Exploratory analysis digs deeper to uncover subtle patterns and non-obvious trends in the data.

Descriptive analysis might tell a dataset’s average, median, and standard deviation. Exploratory analysis would use visualizations, transformations, and interrogating the data with different techniques to model the relationships between variables beyond just summary statistics.

So descriptive analysis describes what the data shows, while exploratory analysis explores nuances in the data to extract deeper meaning.

But good data analysis uses both techniques – summary statistics to complement the graphs and visuals revealing relationships.

Descriptive Analysis

  • Summarizes and presents the data without making inferences or models
  • Uses simple graphics like histograms, bar charts, summary statistics
  • Goal is to describe patterns in the data

Exploratory Analysis

  • Makes inferences about patterns, relationships, effects in the data
  • Relies heavily on graphics and visualization
  • Transforms/manipulates the data to extract meaning
  • Iterative cycle to understand the data
  • Goal is to extract deeper insights from the data

Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2 (2), 131–160.

Emerson, J. D., & Stoto, M. A. (1983). Transforming data. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Understanding robust and exploratory data analysis (pp. 97–128). Wiley.

Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991). Fundamentals of exploratory analysis of variance . Wiley.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

Velleman, P. F. (2008). Truth, damn truth, and statistics. Journal of Statistics Education: An International Journal on the Teaching and Learning of Statistics, 16(2).

Print Friendly, PDF & Email

Logo for University System of New Hampshire Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

The Exploratory Essay

What is an Exploratory Essay?

The purpose of an exploratory essay is to investigate a topic through critical inquiry and present the research findings to your audience. An exploratory essay can take different forms, depending on its purpose. For example, a literature review in the social sciences is a form of exploratory essay. The writer identifies and evaluates the research that others in the field have already done in order to provide the background information needed for either a new interpretation of existing research or new research to add to the existing body of knowledge. Market analyses and environment scans are forms that the exploratory essay can take in business and industry. The company identifies and analyzes key external factors in order to inform strategic decisions.

The CRIT 602 Exploratory Essay & Your Audience

In the case of the exploratory essay in CRIT 602, its purpose is to answer the following question:

How will current trends, concerns, and research related to my academic discipline inform the decisions I make about my academic and professional development or personal interests?

A good way to think about your approach to the CRIT 602 exploratory essay is to go back to your hypothetical external audience. Your readers  are at the same stage of academic and professional development that you are and have similar goals. They are looking for credible information to examine key trends in the field, but they want the information to be evaluated by someone who understands from where they are coming.

You have already provided your readers with relevant trending information about professional organizations, social media, and information resources. They have appreciated the information you’ve presented from these contexts because it is current and relevant to their own academic and professional development. Moreover, you have demonstrated through your analysis and citations that the information you’ve shared is valid.

Purpose & Content:

Currently, your readers are coming back to you because they would really like to know what conclusions you’ve reached about the critical inquiry question you have in common, now that you’ve had the opportunity to look at your research findings in their entirety. How are the different contexts related to each other? Are there trends or patterns across contexts? Contradictions? Controversies? What about employment trends related to the field? Based on your research, are there assumptions you’ve identified that should be challenged?  What implications do you see for someone at your stage of academic and professional development? Finally, your readers would be interested in how you plan to apply this information to your own academic and professional development, so that they can more fully understand the implications of your research findings and apply them to their personal situations.

This is not an analytical paper.   You are presenting information and drawing conclusions based on your research, but you are not tasked with providing a thesis statement.  You should have a clear purpose stated, but your aim is to show your audience the information you’ve collected this term and explore the contexts and trends in the field that you’ve discovered.  After presenting that information, you’ll want to discuss assumptions and implications you see and how this newfound information would affect your audience’s academic and career goals.

Purdue University has an excellent site that describes how to approach writing an exploratory essay. As you plan your paper, please use the Assignment Guidelines below in conjunction with the Purdue OWL sources on Organizing an Exploratory Essay, as well as other resources found on that website.  Remember: this is NOT a research paper- you are to EXPLORE the research you have compiled throughout this term and present it an organized and synthesized way.

Although there are many ways to write an essay like this, a sample of a strong Exploratory Essay can be found here . Any assignment submitted later than three days after the due date will receive a zero for the assignment.

  • ALL essays should be formatted using APA style and in-text citations (or citation format used in your major or industry).
  • You should include a Works Cited page or Bibliography
  • Even though this is an exploratory, partially reflective, essay, it should use formal language and adhere to common grammatical and stylistic guidelines.

Exploratory Essay Guidelines

Area of Exploration: Your essay should focus on the following question: What current external factors impacting my field of study and its associated professions should inform the decisions of someone at my stage of academic and professional development?

Format: This essay should be 7-10 pages, double-spaced, with proper citations and a works cited page. It should be thorough, evaluative, well-written, and free of syntax and grammatical errors.

Outline: You may want to use the following outline and prompts to organize all the information you have gathered in the past few weeks. For information on grading, please consult the exploratory essay rubric.

Introduction:

Purpose, Audience & Context: Provide information and analysis about the larger context for your field of study and its associated professions to people at the same stage of academic and professional development as you are?

Keep in mind: This is an exploratory and evaluative assignment so a thesis/argument is not necessary, however the purpose of the assignment should be clear in the introduction.

  • Discuss each of the contexts you explored in the research assignments and how they are related to each other.  This should not be a list, or copied from your Annotated Bibliography.  It should be in paragraph format with topic sentences and smooth transitions.

Influence of Context & Assumptions:

  • Address your own assumptions about your field of study and its associated professions. What did you assume about the field before you started researching?  What do you find the most important aspects of the profession?
  • Identify how contexts outside of higher education influence the study and/or practice of your field.
  • Show how the different contexts are related to each other. Compare and contrast them.
  • Discuss trends or patterns across contexts. Explain any contradictions or controversies you encountered in your research.
  • Assess the employment trends related to the field.

Conclusions & Related Implications & Consequences:

  • After presenting your research, draw any conclusions you’ve reached about the critical inquiry question.
  • Identify the contexts outside of higher education that are most relevant for someone at your stage of academic and professional development.
  • Discuss any implications you see for someone at your stage of academic and professional development or personal interest in the field.
  • Based on your research, assess any assumptions you’ve identified that should be challenged.
  • Show how you plan to apply this information to your own academic and professional development or personal interest, so that the audience can more fully understand the implications of your research findings and apply them to their personal situations.

References: Access and Use Information Ethically and Legally:

  • Have you clearly identify information and ideas taken from outside sources though the use of citations?

Formatting the Exploratory Essay

Paper Length, Format & Citation of Sources

The body of your paper should be a minimum of 7-10 pages.

Your paper must align with the style guide used by your major: American Psychological Association (APA), Modern Language Association (MLA), or Chicago Manual of Style (CMOS). Use the following sample papers to ensure that you’re following the style conventions appropriately:

  • Sample APA Paper,
  • Sample MLA Paper,
  • Sample Chicago-Style Paper.

For questions about citation format for the most common sources of information, you can begin with the CPS Library & Information Commons:

  • Citation Resources.

For questions about citation format for sources that don’t exactly fit the customary models, consult the style manual or go to the applicable site below, and type your question in the search box. Chances are that someone else has already asked the question and received an answer:

  • APA Style Blog,
  • Ask the MLA ,
  • Browse Q&A (Chicago Manual of Style).

Integrating Research Findings into the Exploratory Essay

The following link to the Excelsior Online Writing Lab (OWL) will take you to resources for paragraphing, summarizing, paraphrasing, and quoting. The resources are intended for writing traditional research papers; however, they apply to our Exploratory Essay as well:

  • Drafting & Integrating

CRIT 602 Readings and Resources Copyright © 2019 by Granite State College (USNH) is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Role of Exploratory Data Analysis in Data Science

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Visual Display of Data: Exploratory Data Analysis Essay

Creating visual displays of data, exploratory data analysis, descriptive statistics for festival.sav, gendered boxplot for day1 correctedfestival.

Gendered Boxplot for Day1 CorrectedFestival

Bar Chart and Error Bars Created from Chick-Flick.sav

Bar Chart and Error Bars Created from Chick-Flick.sav

Bar Charts and Error Bars Created from Hiccups.sav

Bar Charts and Error Bars Created from Hiccups.sav

Clustered Bar Chart Created from TextMessages.sav

Clustered Bar Chart Created from TextMessages.sav

Scatterplot and regression line created from Exam Anxiety.sav

Scatterplot and regression line created from Exam Anxiety.sav

Exploratory data analysis (EDA) forms the basis of confirmatory statistical analysis because it plays a significant role in the characterization of variables, summarization of data, and visualization of patterns and trends. According to Wickham and Grolemund (2017), summarizing, visualizing, transforming, and modeling are major methods employed in EDA. One of the reasons for performing EDA is to check for errors to confirm the existence of expected values, distributions, and relationships. Common errors emanating from typos create missing values and transposition errors, which distort the nature of the distribution of data and relationships between variables. In this view, EDA methods, such as boxplots, normality plots, scatterplots, symmetric analysis, and correlation analysis, aid in checking errors in data. Checking for assumptions is another reason for undertaking EDA before confirmatory statistical analysis. Since confirmatory statistical analysis requires data to meet specific assumptions, checking for assumptions using EDA is critical to ensure the validity of findings. EDA tests of measurement scales, normality, multicollinearity, homoscedasticity, and homogeneity of variances are ordinary tests of statistical assumptions.

The third significance of EDA is a summarization of data to reveal important information regarding patterns and trends of distributions. For example, statisticians normally use descriptive statistics in summarizing data. Downey (2015) explains that descriptive statistics constitute measures of dispersion and central tendency, which provide a substantial summary of patterns and trends of data. Through descriptive statistics, statisticians can establish the magnitude and scope of each variable, and make informed interpretations of inferential statistics. The fourth importance of the EDA is that it permits the preliminary selection of appropriate tools in the design and formulation of statistical models (Wickham & Grolemund, 2017). For instance, regression analysis needs the selection of an appropriate model that predicts relationships between variables of interest. The stepwise method of regression analysis is an iterative procedure of EDA, which sequentially selects significant variables and eliminates insignificant predictors.

The fifth benefit of EDA is the selection of appropriate tools and techniques for data collection and analysis. Cluster analysis and dimension reduction are examples of EDA methods that assess the validity of questionnaires and variables used in data collection (Downey, 2015). Cluster analysis eases data analysis because it categorizes variables into groups with similar and differentiated variables. Since raw data has many redundancies, dimension reduction eliminates redundant variables and creates principal variables, which explain most of the variations in data. Thus, EDA ensures that research instruments generate not only valid but also reliable data for meaningful inferential statistics.

Downey, A. (2015). Think stats . Sebastopol, CA: O’Reilly Media.

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize and model data . Sebastopol, CA: O’Reilly Media.

  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2024, March 22). Visual Display of Data: Exploratory Data Analysis. https://ivypanda.com/essays/visual-display-of-data-exploratory-data-analysis/

"Visual Display of Data: Exploratory Data Analysis." IvyPanda , 22 Mar. 2024, ivypanda.com/essays/visual-display-of-data-exploratory-data-analysis/.

IvyPanda . (2024) 'Visual Display of Data: Exploratory Data Analysis'. 22 March.

IvyPanda . 2024. "Visual Display of Data: Exploratory Data Analysis." March 22, 2024. https://ivypanda.com/essays/visual-display-of-data-exploratory-data-analysis/.

1. IvyPanda . "Visual Display of Data: Exploratory Data Analysis." March 22, 2024. https://ivypanda.com/essays/visual-display-of-data-exploratory-data-analysis/.

Bibliography

IvyPanda . "Visual Display of Data: Exploratory Data Analysis." March 22, 2024. https://ivypanda.com/essays/visual-display-of-data-exploratory-data-analysis/.

  • Economic Development Administration: Equipment Destruction
  • “Implementation” by Pressman and Wildavsky
  • The Characteristic of Jesus by Christians
  • Cypress Semiconductor Corporation vs. Superior Court
  • The art of summarizing
  • Exploratory Research in Organizational Leadership
  • Addressing and Naming Model in a Large Organization
  • Clinical Statistical Experiments' Fundamental Variables
  • The Concept of Normality In Relation To Eating Disorders
  • Conceptualisation of Abnormality and Normality of People in Modern Society
  • Regulatory Compliance and Industry Standards
  • Data Protection Mechanisms
  • Analysis of Cisco’s Smart Cities
  • Characteristics and Requirements of Big Data Analytics Applications
  • Internet of Things in a Work of an Urban Planning Specialist

Book cover

Data Analytics for Process Engineers pp 13–57 Cite as

Exploratory Data Analysis

  • Daniela Galatro 3 &
  • Stephen Dawe 3  
  • First Online: 03 December 2023

48 Accesses

Part of the book series: Synthesis Lectures on Mechanical Engineering ((SLME))

Exploratory Data Analysis (EDA) is an approach employed to analyze datasets. Primarily, EDA uses data visualization methods and often statistical models to (i) assess a dataset’s general structure, (ii) obtain descriptive summaries of the data, and (iii) provide the basis for model formulation. EDA includes checks on data quality, calculation of summary statistics, and data plotting. The data quality is checked regarding errors, outliers, and missing observations. This chapter explores simple visualization EDA techniques, algorithms to detect and handle outliers and missing values, more advanced tools such as correlograms and clustering, and dimensionality reduction techniques.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Behera, S., & Suresh, A. K. (2019). Data on of interfacial hydrolysis kinetics of an aromatic acid chloride. Data in Brief, 26 , 104337. https://doi.org/10.1016/j.dib.2019.104337

Andrady, A. (2015). Degradation of plastics in the environment. Plastics and Environmental Sustainability, 145–184. https://doi.org/10.1002/9781119009405.ch6 .

Mallick, J., Singh, C., AlMesfer, M., Kumar, A., Khan, R., Islam, S., & Rahman, A. (2018). Hydro-geochemical assessment of groundwater quality in ASEER region Saudi Arabia. Water, 10 (12), 1847. https://doi.org/10.3390/w10121847

Souza, F., Araújo, R., Matias, T., & Mendes, M. (2013). A Multilayer-Perceptron Based Method for Variable Selection in Soft Sensor Design. Journal of Process Control, 23 (10), 1371–1378. https://doi.org/10.1016/j.jprocont.2013.09.014

California Air Resources Board. Inhalable Particulate Matter and Health (PM2.5 and PM10) | California Air Resources Board. (n.d.). https://ww2.arb.ca.gov/resources/inhalable-particulate-matter-and-health .

Taiyun Wei, V. S. (2021, November 18). An introduction to corrplot package. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

Download references

Author information

Authors and affiliations.

Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada

Daniela Galatro & Stephen Dawe

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Daniela Galatro .

2.1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (ZIP 64 kb)

Data disclosure.

The data was generated, maintaining the described problem's physical meaning of the analyzed phenomenon or process.

Analyze the current Exploratory Data Analysis on the data for modelling nitrogen dioxide concentration levels across Germany. The article can be downloaded from: https://www.sciencedirect.com/science/article/pii/S2352340921006089 . The dataset can be downloaded from: https://zenodo.org/record/5148684#.Yp-J-ajMJro , file DATA_MonitoringSites_DE.csv. Perform your own EDA with the tools provided in this chapter and compare it with the current one.

Perform an Exploratory Data Analysis on the water treatment plant experiment described by Souza et al. (2013)[ 4 ]. The objective is to estimate the fluoride concentration in the effluent of an urban water treatment plant. The corresponding dataset can be downloaded from: https://home.isr.uc.pt/~rui/publications/datasets.html , see ‘WWTP’. Discuss and report your conclusions about the dataset.

K-means clustering in R: https://rpkgs.datanovia.com/factoextra/index.html .

PCA in R: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html .

Sobol indices in R: https://cran.r-project.org/web/packages/sensitivity/sensitivity.pdf .

R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT .

Recommended Readings

Chakraborty, S., & Dey, L. (2023). Computing for data analysis: Theory and practices. Springer Verlag, Singapore.

Roy, Kavika (2022). Dimensionality reduction techniques in Data Science. KDnuggets. https://www.kdnuggets.com/2022/09/dimensionality-reduction-techniques-data-science.html .

Frost, J. (2023, May 18). Box plot explained with examples. Statistics By Jim. https://statisticsbyjim.com/basics/graph-groups-boxplots-individual-values/ .

Glen, G., & Isaacs, K. (2012). Estimating sobol sensitivity indices using correlations. Environmental Modelling & Software , 37, 157–166. https://doi.org/10.1016/j.envsoft.2012.03.014 .

Glen, S. (2022, January 14). Cook’s distance / cook’s D: Definition, interpretation. Statistics How To. https://www.statisticshowto.com/cooks-distance/ .

Giudici, P. (2013). Statistical models for data analysis. Springer.

Irizarry, R. A. (n.d.). Introduction to data science. Chapter 28 Smoothing. http://rafalab.dfci.harvard.edu/dsbook/smoothing.html .

McGregor, M. (2020, September 21). 8 clustering algorithms in machine learning that all data scientists should know. freeCodeCamp.org. https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/ .

Menon, K. (2023, January 12). The complete guide to Skewness and kurtosis: Simplilearn. Simplilearn.com. https://www.simplilearn.com/tutorials/statistics-tutorial/skewness-and-kurtosis .

n.d. (2017, October 7). Principal component analysis in R: Prcomp VS princomp. STHDA. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/ .

Pearson, R. K. (2018). Exploratory Data Analysis Using R. CRC Press.

Prakash, K. B. (2022). Data science handbook: A practical approach. Wiley-Scrivener.

Snehal_bm. (2021, July 8). How to treat outliers in a data set?. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/07/how-to-treat-outliers-in-a-data-set/ .

What is cluster analysis? when should you use it for your results?. Qualtrics. (2022, November 30). https://www.qualtrics.com/experience-management/research/cluster-analysis/ .

Zhang, X., Trame, M., Lesko, L., & Schmidt, S. (2015). Sobol Sensitivity Analysis: A tool to guide the development and evaluation of Systems Pharmacology Models. CPT: Pharmacometrics & Systems Pharmacology , 4(2), 69–79. https://doi.org/10.1002/psp4.6 .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Cite this chapter.

Galatro, D., Dawe, S. (2024). Exploratory Data Analysis. In: Data Analytics for Process Engineers. Synthesis Lectures on Mechanical Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-46866-7_2

Download citation

DOI : https://doi.org/10.1007/978-3-031-46866-7_2

Published : 03 December 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-46865-0

Online ISBN : 978-3-031-46866-7

eBook Packages : Synthesis Collection of Technology (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Statistics: Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) involves different statistical procedures that are available to provide a researcher with a view of the data in terms of distribution and general characteristics of a given data. While analyzing data using SPSS, there are a variety of statistical procedures available to carry out exploratory data analysis. There include descriptive statistics such as median, mean, mode, maximum, minimum, variance, range, and standard deviation among other numerical summaries. In addition, there are graphical procedures that help in exploring data visually. These include scatter plots, bar charts, histograms, pie charts, stem and leaf plots among others.

There are several reasons why it is critical to perform exploratory data analysis during data analysis. Any set of data is prone to errors that may be introduced during collection or data entry. Using EDA, it is possible to identify such errors with outliers being among the best indicators of errors in data. It is possible to identify outliers in a data set using box plots. Since this is a visual form of exploratory data analysis, it is easy to clearly identify values that are abnormal in the data. When collecting data, it is impossible to identify some features such as pattern of distribution without performing an analysis of the data. This is where exploratory data analysis proves invaluable. Using the skewness measure, it is possible to identify how data distributed from the mean. When data is skewed to the left, this indicates that most of the values lie on the left side of the mean whereas when data is skewed to the right, more data lies on the right side of the mean. When most data values are concentrated at the mean, the graph assumes a dome shape. Scatter plots are useful for displaying the distribution of data along the X- and Y-axes. The scatter plot is can be enriched by a regression line which indicates deviation of data from the line of best fit (from the area where most data lie). The regression line can have a positive or a negative gradient thus indicating a proportional or an inverse proportion in relationship between variables (Field, 2009).

Exploratory data analysis is also helpful in testing for normal distribution in data. Using normal probability plots, a researcher is able to define whether the mean, mode and median are the same. If these are the same, this is defined as perfect normal distribution. To identify differences in distributions, one can utilize the Kolmogorov-Smirnov & Shapiro-Wilk tests. Performing EDA helps in choosing between parametric and non-parametric tests for further data analysis. For numerical data with normal distribution, one can perform analysis using t-test or ANOVA. On the other hand, non-parametric methods include Chi-square test or Spearman correlation coefficient (Field, 2009). Missing data can also be dealt with using pairwise or listwise deletion during exploratory data analysis.

Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Los Angeles: Sage. ISBN: 9781847879073.

Cite this paper

  • Chicago (N-B)
  • Chicago (A-D)

StudyCorgi. (2022, March 15). Statistics: Exploratory Data Analysis (EDA). https://studycorgi.com/statistics-exploratory-data-analysis-eda/

"Statistics: Exploratory Data Analysis (EDA)." StudyCorgi , 15 Mar. 2022, studycorgi.com/statistics-exploratory-data-analysis-eda/.

StudyCorgi . (2022) 'Statistics: Exploratory Data Analysis (EDA)'. 15 March.

1. StudyCorgi . "Statistics: Exploratory Data Analysis (EDA)." March 15, 2022. https://studycorgi.com/statistics-exploratory-data-analysis-eda/.

Bibliography

StudyCorgi . "Statistics: Exploratory Data Analysis (EDA)." March 15, 2022. https://studycorgi.com/statistics-exploratory-data-analysis-eda/.

StudyCorgi . 2022. "Statistics: Exploratory Data Analysis (EDA)." March 15, 2022. https://studycorgi.com/statistics-exploratory-data-analysis-eda/.

This paper, “Statistics: Exploratory Data Analysis (EDA)”, was written and voluntary submitted to our free essay database by a straight-A student. Please ensure you properly reference the paper if you're using it to write your assignment.

Before publication, the StudyCorgi editorial team proofread and checked the paper to make sure it meets the highest standards in terms of grammar, punctuation, style, fact accuracy, copyright issues, and inclusive language. Last updated: March 15, 2022 .

If you are the author of this paper and no longer wish to have it published on StudyCorgi, request the removal . Please use the “ Donate your paper ” form to submit an essay.

Help | Advanced Search

Computer Science > Software Engineering

Title: model generation from requirements with llms: an exploratory study.

Abstract: Complementing natural language (NL) requirements with graphical models can improve stakeholders' communication and provide directions for system design. However, creating models from requirements involves manual effort. The advent of generative large language models (LLMs), ChatGPT being a notable example, offers promising avenues for automated assistance in model generation. This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., UML sequence diagrams, from NL requirements. We conduct a qualitative study in which we examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains. Observations from the analysis of the generated diagrams have systematically been captured through evaluation logs, and categorized through thematic analysis. Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges. This issue is particularly pronounced in the presence of requirements smells, such as ambiguity and inconsistency. The insights derived from this study can influence the practical utilization of LLMs in the RE process, and open the door to novel RE-specific prompting strategies targeting effective model generation.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Search Results

  • PRESS RELEASE

Results of the ECB Survey of Professional Forecasters for the second quarter of 2024

12 April 2024

  • Inflation expectations unchanged across all horizons; longer-term inflation expectations (for 2028) stand at 2.0%
  • Real GDP growth expectations broadly unchanged; economic activity expected to strengthen gradually throughout 2024
  • Unemployment rate expectations largely unchanged

In the ECB’s Survey of Professional Forecasters (SPF) for the second quarter of 2024, respondents expectations for headline inflation, as measured in terms of the Harmonised Index of Consumer Prices (HICP), were unchanged across all horizons. Headline HICP inflation was expected to decline from 2.4% in 2024 to 2.0% in both 2025 and 2026. Expectations for core HICP inflation, which excludes energy and food, were also unchanged for the years from 2024 to 2026. Longer-term expectations for headline and core HICP inflation were unchanged at 2.0%.

Respondents expected real GDP growth of 0.5% in 2024, 1.4% in 2025 and 1.4% in 2026. Compared with the previous survey round, the expectations for 2024 and 2025 were revised down and up by 0.1 percentage point respectively, while those for 2026 were unchanged. Respondents’ short-term GDP outlook was for a gradual strengthening of economic activity throughout 2024. Longer-term growth expectations remained unchanged at 1.3%.

The expected profile of the unemployment rate was revised downwards slightly over the entire horizon. Respondents continued to expect the unemployment rate to increase in 2024, to 6.6%, but to decline to 6.5% in 2026 and to 6.4% in the longer term.

For media queries, please contact Silvia Margiocco , tel.: +49 69 1344 6619.

  • The SPF for the second quarter of 2024 was conducted between 18 and 21 March 2024 and 61 responses were received. The SPF is conducted on a quarterly basis and gathers expectations for the rates of inflation, real GDP growth and unemployment in the euro area for several horizons, together with a quantitative assessment of the uncertainty about them. The survey participants are experts affiliated with financial or non-financial institutions based within Europe. The survey results do not represent the views of the ECB’s decision-making bodies or its staff. The next Eurosystem staff macroeconomic projections for the euro area will be published on 6 June 2024.
  • Since 2015 the results of the SPF have been published on the ECB’s website. For surveys prior to the first quarter of 2015, see the ECB’s Monthly Bulletin (2002-14: Q1 – February, Q2 – May, Q3 – August, Q4 – November).
  • A report on this survey round and more detailed data are available via the SPF webpage and the ECB’s Statistical Data Warehouse .

European Central Bank

Directorate general communications.

Reproduction is permitted provided that the source is acknowledged.

Our website uses cookies

We are always working to improve this website for our users. To do this, we use the anonymous data provided by cookies. Learn more about how we use cookies

We have updated our privacy policy

We are always working to improve this website for our users. To do this, we use the anonymous data provided by cookies. See what has changed in our privacy policy

Your cookie preference has expired

IMAGES

  1. Exploratory Data Analysis |Beginners Guide to Explanatory Data Analysis

    data analysis exploratory essay

  2. Sample Exploratory Essay

    data analysis exploratory essay

  3. Exploratory data analysis

    data analysis exploratory essay

  4. FREE 10+ Sample Data Analysis Templates in PDF

    data analysis exploratory essay

  5. An Ultimate Guide To Exploratory Data Analysis (EDA)

    data analysis exploratory essay

  6. Exploratory Data Analysis

    data analysis exploratory essay

VIDEO

  1. exploratory data analysis

  2. #1 Exploratory Analysis on Structured data

  3. Exploratory Data Analysis is critical #machinelearning #customer #datascience #beginners #job #stats

  4. Exploratory Data Analysis (EDA)

  5. Exploratory Data Analysis-Retail The Spark Foundation

  6. Exploratory Data Analysis

COMMENTS

  1. Exploratory Data Analysis: Frequencies, Descriptive Statistics, Histograms, and Boxplots

    Researchers must utilize exploratory data techniques to present findings to a target audience and create appropriate graphs and figures. Researchers can determine if outliers exist, data are missing, and statistical assumptions will be upheld by understanding data. Additionally, it is essential to comprehend these data when describing them in conclusions of a paper, in a meeting with ...

  2. A Data Scientist's Essential Guide to Exploratory Data Analysis

    Introduction. Exploratory Data Analysis (EDA) is the single most important task to conduct at the beginning of every data science project. In essence, it involves thoroughly examining and characterizing your data in order to find its underlying characteristics, possible anomalies, and hidden patterns and relationships.

  3. (PDF) Exploratory Data Analysis

    15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The. primary aim with exploratory analysis is to examine the data for distribution, outliers and ...

  4. What is Exploratory Data Analysis?

    What is EDA? Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot ...

  5. Goals, Process, and Challenges of Exploratory Data Analysis: An

    Exploratory data analysis stems from the collection of work by the statistician John Tukey in the 1960s and 1970s [24,39,40,67]. His seminal book [67] compiles a collection of data visualization tech-niques as well as robust and non-parametric statistics for data explo-ration. Many communities including Statistics, Human-Computer In-

  6. Academic analytics: Anatomy of an exploratory essay

    Predictive Analytics —Once the data is collected, the reports are prepared and the information is analyzed, the next step is to use this data to anticipate scenarios ("Predictive Analytics") and make decisions based on the information ("Action Analytics"). This predictive analysis serves all levels of higher education.

  7. PDF Exploratory Data Analysis

    Such critiques motivate our proposal that research on supporting exploratory visual analysis should embrace theories of graphical inference. In the following section we propose an alternative understanding of exploratory visual analysis as guided by model checks, and describe possible formalizations of this theory. 4.

  8. Exploratory Data Analysis

    Exploratory data analysis is an approach to data analysis where the features and characteristics of the data are reviewed with an "open mind"; in other words, without attempting to apply any particular model to the data. It is often used upon first contact with the data, before any models have been chosen for the structural or stochastic ...

  9. Exploratory Data Analysis

    1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data usually ...

  10. What Is Exploratory Data Analysis?

    In data analytics, exploratory data analysis is how we describe the practice of investigating a dataset and summarizing its main features. It's a form of descriptive analytics. EDA aims to spot patterns and trends, to identify anomalies, and to test early hypotheses. Although exploratory data analysis can be carried out at various stages of ...

  11. Exploratory Data Analysis

    Exploratory Data Analysis. Exploratory Data Analysis (EDA) is an approach to analyzing data that emphasizes exploring datasets for patterns and insights without any predetermined hypotheses. The goal is to let the data "speak for themselves" and guide analysis, rather than imposing rigid structures or theories.

  12. PDF Exploratory Data Analysis in Schools: A Logic Model to Guide ...

    Abstract. Exploratory data analysis (EDA) is an iterative, open-ended data analysis procedure that allows practitioners to examine data without pre-conceived notions to advise improvement processes and make informed decisions. Education is a data-rich field that is primed for a transition into a deeper, more purposeful use of data.

  13. Assignment Guidelines: Exploratory Essay

    Market analyses and environment scans are forms that the exploratory essay can take in business and industry. The company identifies and analyzes key external factors in order to inform strategic decisions. The CRIT 602 Exploratory Essay & Your Audience. In the case of the exploratory essay in CRIT 602, its purpose is to answer the following ...

  14. A Comprehensive Guide to Exploratory Data Analysis (EDA) for Data

    4. Exploratory Data Analysis 🕵️‍♀️. Univariate Analysis: Include a breakdown of the univariate analysis. Provide summary statistics for each feature, histograms, kernel density plots ...

  15. PDF Futzing and Moseying: Interviews with Professional Data Analysts on

    famously described exploratory data analysis (EDA) —"looking at data to see what it seems to say" — in his 1977 book on the subject [23]. To better understand the less structured, more exploratory aspects of data analysis, we conducted and coded interviews with thirty ex-perienced professionals in the field. These participants worked for

  16. PDF Chapter 15 Exploratory Data Analysis

    15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data ...

  17. (PDF) Visualization and Explorative Data Analysis

    Exploratory data analysis (EDA) is a well-established statistical tradition that provides conceptual and computational tools for discovering patterns to foster hypothesis development and refinement.

  18. Role of Exploratory Data Analysis in Data Science

    Exploratory Data analysis (EDA) is one of the hidden and mundane tasks in analysis of Data, as a Model, Project or analysis is based on data, which is intuitive, extremely heterogenous and distorted in its form. (Data has become an integral part of every project, Model &) The analyzed data is more insightful for identifying and improving extremely critical business insights across the ...

  19. Practical Machine Learning Tutorial: Part.1 (Data Exploratory Analysis)

    Although there are tons of great books and papers outside to practice machine learning, I always wanted to see something short, simple, and with a descriptive manuscript. I always wanted to see an example with an appropriate explanation of the procedure accompanied by detailed results interpretation. ... Exploratory Data Analysis, Part.2: Build ...

  20. Visual Display of Data: Exploratory Data Analysis Essay

    Exploratory data analysis (EDA) forms the basis of confirmatory statistical analysis because it plays a significant role in the characterization of variables, summarization of data, and visualization of patterns and trends. According to Wickham and Grolemund (2017), summarizing, visualizing, transforming, and modeling are major methods employed ...

  21. Exploratory Data Analysis

    Exploratory Data Analysis (EDA) is an approach employed to analyze datasets. Primarily, EDA uses data visualization methods and often statistical models to (i) assess a dataset's general structure, (ii) obtain descriptive summaries of the data, and (iii) provide the basis for model formulation. EDA includes checks on data quality, calculation ...

  22. Statistics: Exploratory Data Analysis (EDA)

    These include scatter plots, bar charts, histograms, pie charts, stem and leaf plots among others. We will write a custom essay on your topic tailored to your instructions! 308 experts online. Let us help you. There are several reasons why it is critical to perform exploratory data analysis during data analysis.

  23. Model Generation from Requirements with LLMs: an Exploratory Study

    Complementing natural language (NL) requirements with graphical models can improve stakeholders' communication and provide directions for system design. However, creating models from requirements involves manual effort. The advent of generative large language models (LLMs), ChatGPT being a notable example, offers promising avenues for automated assistance in model generation. This paper ...

  24. Results of the ECB Survey of Professional Forecasters for the second

    For media queries, please contact Silvia Margiocco, tel.: +49 69 1344 6619.. Notes. The SPF for the second quarter of 2024 was conducted between 18 and 21 March 2024 and 61 responses were received.