jamiefosterscience logo

10 Unique Data Science Capstone Project Ideas

A capstone project is a culminating assignment that allows students to demonstrate the skills and knowledge they’ve acquired throughout their degree program. For data science students, it’s a chance to tackle a substantial real-world data problem.

If you’re short on time, here’s a quick answer to your question: Some great data science capstone ideas include analyzing health trends, building a predictive movie recommendation system, optimizing traffic patterns, forecasting cryptocurrency prices, and more .

In this comprehensive guide, we will explore 10 unique capstone project ideas for data science students. We’ll overview potential data sources, analysis methods, and practical applications for each idea.

Whether you want to work with social media datasets, geospatial data, or anything in between, you’re sure to find an interesting capstone topic.

Project Idea #1: Analyzing Health Trends

When it comes to data science capstone projects, analyzing health trends is an intriguing idea that can have a significant impact on public health. By leveraging data from various sources, data scientists can uncover valuable insights that can help improve healthcare outcomes and inform policy decisions.

Data Sources

There are several data sources that can be used to analyze health trends. One of the most common sources is electronic health records (EHRs), which contain a wealth of information about patient demographics, medical history, and treatment outcomes.

Other sources include health surveys, wearable devices, social media, and even environmental data.

Analysis Approaches

When analyzing health trends, data scientists can employ a variety of analysis approaches. Descriptive analysis can provide a snapshot of current health trends, such as the prevalence of certain diseases or the distribution of risk factors.

Predictive analysis can be used to forecast future health outcomes, such as predicting disease outbreaks or identifying individuals at high risk for certain conditions. Machine learning algorithms can be trained to identify patterns and make accurate predictions based on large datasets.

Applications

The applications of analyzing health trends are vast and far-reaching. By understanding patterns and trends in health data, policymakers can make informed decisions about resource allocation and public health initiatives.

Healthcare providers can use these insights to develop personalized treatment plans and interventions. Researchers can uncover new insights into disease progression and identify potential targets for intervention.

Ultimately, analyzing health trends has the potential to improve overall population health and reduce healthcare costs.

Project Idea #2: Movie Recommendation System

When developing a movie recommendation system, there are several data sources that can be used to gather information about movies and user preferences. One popular data source is the MovieLens dataset, which contains a large collection of movie ratings provided by users.

Another source is IMDb, a trusted website that provides comprehensive information about movies, including user ratings and reviews. Additionally, streaming platforms like Netflix and Amazon Prime also provide access to user ratings and viewing history, which can be valuable for building an accurate recommendation system.

There are several analysis approaches that can be employed to build a movie recommendation system. One common approach is collaborative filtering, which uses user ratings and preferences to identify patterns and make recommendations based on similar users’ preferences.

Another approach is content-based filtering, which analyzes the characteristics of movies (such as genre, director, and actors) to recommend similar movies to users. Hybrid approaches that combine both collaborative and content-based filtering techniques are also popular, as they can provide more accurate and diverse recommendations.

A movie recommendation system has numerous applications in the entertainment industry. One application is to enhance the user experience on streaming platforms by providing personalized movie recommendations based on individual preferences.

This can help users discover new movies they might enjoy and improve overall satisfaction with the platform. Additionally, movie recommendation systems can be used by movie production companies to analyze user preferences and trends, aiding in the decision-making process for creating new movies.

Finally, movie recommendation systems can also be utilized by movie critics and reviewers to identify movies that are likely to be well-received by audiences.

For more information on movie recommendation systems, you can visit https://www.kaggle.com/rounakbanik/movie-recommender-systems or https://www.researchgate.net/publication/221364567_A_new_movie_recommendation_system_for_large-scale_data .

Project Idea #3: Optimizing Traffic Patterns

When it comes to optimizing traffic patterns, there are several data sources that can be utilized. One of the most prominent sources is real-time traffic data collected from various sources such as GPS devices, traffic cameras, and mobile applications.

This data provides valuable insights into the current traffic conditions, including congestion, accidents, and road closures. Additionally, historical traffic data can also be used to identify recurring patterns and trends in traffic flow.

Other data sources that can be used include weather data, which can help in understanding how weather conditions impact traffic patterns, and social media data, which can provide information about events or incidents that may affect traffic.

Optimizing traffic patterns requires the use of advanced data analysis techniques. One approach is to use machine learning algorithms to predict traffic patterns based on historical and real-time data.

These algorithms can analyze various factors such as time of day, day of the week, weather conditions, and events to predict traffic congestion and suggest alternative routes.

Another approach is to use network analysis to identify bottlenecks and areas of congestion in the road network. By analyzing the flow of traffic and identifying areas where traffic slows down or comes to a halt, transportation authorities can make informed decisions on how to optimize traffic flow.

The optimization of traffic patterns has numerous applications and benefits. One of the main benefits is the reduction of traffic congestion, which can lead to significant time and fuel savings for commuters.

By optimizing traffic patterns, transportation authorities can also improve road safety by reducing the likelihood of accidents caused by congestion.

Additionally, optimizing traffic patterns can have positive environmental impacts by reducing greenhouse gas emissions. By minimizing the time spent idling in traffic, vehicles can operate more efficiently and emit fewer pollutants.

Furthermore, optimizing traffic patterns can have economic benefits by improving the flow of goods and services. Efficient traffic patterns can reduce delivery times and increase productivity for businesses.

Project Idea #4: Forecasting Cryptocurrency Prices

With the growing popularity of cryptocurrencies like Bitcoin and Ethereum, forecasting their prices has become an exciting and challenging task for data scientists. This project idea involves using historical data to predict future price movements and trends in the cryptocurrency market.

When working on this project, data scientists can gather cryptocurrency price data from various sources such as cryptocurrency exchanges, financial websites, or APIs. Websites like CoinMarketCap (https://coinmarketcap.com/) provide comprehensive data on various cryptocurrencies, including historical price data.

Additionally, platforms like CryptoCompare (https://www.cryptocompare.com/) offer real-time and historical data for different cryptocurrencies.

To forecast cryptocurrency prices, data scientists can employ various analysis approaches. Some common techniques include:

  • Time Series Analysis: This approach involves analyzing historical price data to identify patterns, trends, and seasonality in cryptocurrency prices. Techniques like moving averages, autoregressive integrated moving average (ARIMA), or exponential smoothing can be used to make predictions.
  • Machine Learning: Machine learning algorithms, such as random forests, support vector machines, or neural networks, can be trained on historical cryptocurrency data to predict future price movements. These algorithms can consider multiple variables, such as trading volume, market sentiment, or external factors, to make accurate predictions.
  • Sentiment Analysis: This approach involves analyzing social media sentiment and news articles related to cryptocurrencies to gauge market sentiment. By considering the collective sentiment, data scientists can predict how positive or negative sentiment can impact cryptocurrency prices.

Forecasting cryptocurrency prices can have several practical applications:

  • Investment Decision Making: Accurate price forecasts can help investors make informed decisions when buying or selling cryptocurrencies. By considering the predicted price movements, investors can optimize their investment strategies and potentially maximize their returns.
  • Trading Strategies: Traders can use price forecasts to develop trading strategies, such as trend following or mean reversion. By leveraging predicted price movements, traders can make profitable trades in the volatile cryptocurrency market.
  • Risk Management: Cryptocurrency price forecasts can help individuals and organizations manage their risk exposure. By understanding potential price fluctuations, risk management strategies can be implemented to mitigate losses.

Project Idea #5: Predicting Flight Delays

One interesting and practical data science capstone project idea is to create a model that can predict flight delays. Flight delays can cause a lot of inconvenience for passengers and can have a significant impact on travel plans.

By developing a predictive model, airlines and travelers can be better prepared for potential delays and take appropriate actions.

To create a flight delay prediction model, you would need to gather relevant data from various sources. Some potential data sources include:

  • Flight data from airlines or aviation organizations
  • Weather data from meteorological agencies
  • Historical flight delay data from airports

By combining these different data sources, you can build a comprehensive dataset that captures the factors contributing to flight delays.

Once you have collected the necessary data, you can employ different analysis approaches to predict flight delays. Some common approaches include:

  • Machine learning algorithms such as decision trees, random forests, or neural networks
  • Time series analysis to identify patterns and trends in flight delay data
  • Feature engineering to extract relevant features from the dataset

By applying these analysis techniques, you can develop a model that can accurately predict flight delays based on the available data.

The applications of a flight delay prediction model are numerous. Airlines can use the model to optimize their operations, improve scheduling, and minimize disruptions caused by delays. Travelers can benefit from the model by being alerted in advance about potential delays and making necessary adjustments to their travel plans.

Additionally, airports can use the model to improve resource allocation and manage passenger flow during periods of high delay probability. Overall, a flight delay prediction model can significantly enhance the efficiency and customer satisfaction in the aviation industry.

Project Idea #6: Fighting Fake News

With the rise of social media and the easy access to information, the spread of fake news has become a significant concern. Data science can play a crucial role in combating this issue by developing innovative solutions.

Here are some aspects to consider when working on a project that aims to fight fake news.

When it comes to fighting fake news, having reliable data sources is essential. There are several trustworthy platforms that provide access to credible news articles and fact-checking databases. Websites like Snopes and FactCheck.org are good starting points for obtaining accurate information.

Additionally, social media platforms such as Twitter and Facebook can be valuable sources for analyzing the spread of misinformation.

One approach to analyzing fake news is by utilizing natural language processing (NLP) techniques. NLP can help identify patterns and linguistic cues that indicate the presence of misleading information.

Sentiment analysis can also be employed to determine the emotional tone of news articles or social media posts, which can be an indicator of potential bias or misinformation.

Another approach is network analysis, which focuses on understanding how information spreads through social networks. By analyzing the connections between users and the content they share, it becomes possible to identify patterns of misinformation dissemination.

Network analysis can also help in identifying influential sources and detecting coordinated efforts to spread fake news.

The applications of a project aiming to fight fake news are numerous. One possible application is the development of a browser extension or a mobile application that provides users with real-time fact-checking information.

This tool could flag potentially misleading articles or social media posts and provide users with accurate information to help them make informed decisions.

Another application could be the creation of an algorithm that automatically identifies fake news articles and separates them from reliable sources. This algorithm could be integrated into news aggregation platforms to help users distinguish between credible and non-credible information.

Project Idea #7: Analyzing Social Media Sentiment

Social media platforms have become a treasure trove of valuable data for businesses and researchers alike. When analyzing social media sentiment, there are several data sources that can be tapped into. The most popular ones include:

  • Twitter: With its vast user base and real-time nature, Twitter is often the go-to platform for sentiment analysis. Researchers can gather tweets containing specific keywords or hashtags to analyze the sentiment of a particular topic.
  • Facebook: Facebook offers rich data for sentiment analysis, including posts, comments, and reactions. Analyzing the sentiment of Facebook posts can provide valuable insights into user opinions and preferences.
  • Instagram: Instagram’s visual nature makes it an interesting platform for sentiment analysis. By analyzing the comments and captions on Instagram posts, researchers can gain insights into the sentiment associated with different images or topics.
  • Reddit: Reddit is a popular platform for discussions on various topics. By analyzing the sentiment of comments and posts on specific subreddits, researchers can gain insights into the sentiment of different communities.

These are just a few examples of the data sources that can be used for analyzing social media sentiment. Depending on the research goals, other platforms such as LinkedIn, YouTube, and TikTok can also be explored.

When it comes to analyzing social media sentiment, there are various approaches that can be employed. Some commonly used analysis techniques include:

  • Lexicon-based analysis: This approach involves using predefined sentiment lexicons to assign sentiment scores to words or phrases in social media posts. By aggregating these scores, researchers can determine the overall sentiment of a post or a collection of posts.
  • Machine learning: Machine learning algorithms can be trained to classify social media posts into positive, negative, or neutral sentiment categories. These algorithms learn from labeled data and can make predictions on new, unlabeled data.
  • Deep learning: Deep learning techniques, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), can be used to capture the complex patterns and dependencies in social media data. These models can learn to extract sentiment information from textual or visual content.

It is important to note that the choice of analysis approach depends on the specific research objectives, available resources, and the nature of the social media data being analyzed.

Analyzing social media sentiment has a wide range of applications across different industries. Here are a few examples:

  • Brand reputation management: By analyzing social media sentiment, businesses can monitor and manage their brand reputation. They can identify potential issues, respond to customer feedback, and take proactive measures to maintain a positive image.
  • Market research: Social media sentiment analysis can provide valuable insights into consumer opinions and preferences. Businesses can use this information to understand market trends, identify customer needs, and develop targeted marketing strategies.
  • Customer feedback analysis: Social media sentiment analysis can help businesses understand customer satisfaction levels and identify areas for improvement. By analyzing sentiment in customer feedback, companies can make data-driven decisions to enhance their products or services.
  • Public opinion analysis: Researchers can analyze social media sentiment to study public opinion on various topics, such as political events, social issues, or product launches. This information can be used to understand public sentiment, predict trends, and inform decision-making.

These are just a few examples of how analyzing social media sentiment can be applied in real-world scenarios. The insights gained from sentiment analysis can help businesses and researchers make informed decisions, improve customer experience, and drive innovation.

Project Idea #8: Improving Online Ad Targeting

Improving online ad targeting involves analyzing various data sources to gain insights into users’ preferences and behaviors. These data sources may include:

  • Website analytics: Gathering data from websites to understand user engagement, page views, and click-through rates.
  • Demographic data: Utilizing information such as age, gender, location, and income to create targeted ad campaigns.
  • Social media data: Extracting data from platforms like Facebook, Twitter, and Instagram to understand users’ interests and online behavior.
  • Search engine data: Analyzing search queries and user behavior on search engines to identify intent and preferences.

By combining and analyzing these diverse data sources, data scientists can gain a comprehensive understanding of users and their ad preferences.

To improve online ad targeting, data scientists can employ various analysis approaches:

  • Segmentation analysis: Dividing users into distinct groups based on shared characteristics and preferences.
  • Collaborative filtering: Recommending ads based on users with similar preferences and behaviors.
  • Predictive modeling: Developing algorithms to predict users’ likelihood of engaging with specific ads.
  • Machine learning: Utilizing algorithms that can continuously learn from user interactions to optimize ad targeting.

These analysis approaches help data scientists uncover patterns and insights that can enhance the effectiveness of online ad campaigns.

Improved online ad targeting has numerous applications:

  • Increased ad revenue: By delivering more relevant ads to users, advertisers can expect higher click-through rates and conversions.
  • Better user experience: Users are more likely to engage with ads that align with their interests, leading to a more positive browsing experience.
  • Reduced ad fatigue: By targeting ads more effectively, users are less likely to feel overwhelmed by irrelevant or repetitive advertisements.
  • Maximized ad budget: Advertisers can optimize their budget by focusing on the most promising target audiences.

Project Idea #9: Enhancing Customer Segmentation

Enhancing customer segmentation involves gathering relevant data from various sources to gain insights into customer behavior, preferences, and demographics. Some common data sources include:

  • Customer transaction data
  • Customer surveys and feedback
  • Social media data
  • Website analytics
  • Customer support interactions

By combining data from these sources, businesses can create a comprehensive profile of their customers and identify patterns and trends that will help in improving their segmentation strategies.

There are several analysis approaches that can be used to enhance customer segmentation:

  • Clustering: Using clustering algorithms to group customers based on similar characteristics or behaviors.
  • Classification: Building predictive models to assign customers to different segments based on their attributes.
  • Association Rule Mining: Identifying relationships and patterns in customer data to uncover hidden insights.
  • Sentiment Analysis: Analyzing customer feedback and social media data to understand customer sentiment and preferences.

These analysis approaches can be used individually or in combination to enhance customer segmentation and create more targeted marketing strategies.

Enhancing customer segmentation can have numerous applications across industries:

  • Personalized marketing campaigns: By understanding customer preferences and behaviors, businesses can tailor their marketing messages to individual customers, increasing the likelihood of engagement and conversion.
  • Product recommendations: By segmenting customers based on their purchase history and preferences, businesses can provide personalized product recommendations, leading to higher customer satisfaction and sales.
  • Customer retention: By identifying at-risk customers and understanding their needs, businesses can implement targeted retention strategies to reduce churn and improve customer loyalty.
  • Market segmentation: By identifying distinct customer segments, businesses can develop tailored product offerings and marketing strategies for each segment, maximizing the effectiveness of their marketing efforts.

Project Idea #10: Building a Chatbot

A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

It requires a combination of natural language processing, machine learning, and programming skills.

When building a chatbot, data sources play a crucial role in training and improving its performance. There are various data sources that can be used:

  • Chat logs: Analyzing existing chat logs can help in understanding common user queries, responses, and patterns. This data can be used to train the chatbot on how to respond to different types of questions and scenarios.
  • Knowledge bases: Integrating a knowledge base can provide the chatbot with a wide range of information and facts. This can be useful in answering specific questions or providing detailed explanations on certain topics.
  • APIs: Utilizing APIs from different platforms can enhance the chatbot’s capabilities. For example, integrating a weather API can allow the chatbot to provide real-time weather information based on user queries.

There are several analysis approaches that can be used to build an efficient and effective chatbot:

  • Natural Language Processing (NLP): NLP techniques enable the chatbot to understand and interpret user queries. This involves tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.
  • Intent recognition: Identifying the intent behind user queries is crucial for providing accurate responses. Machine learning algorithms can be trained to classify user intents based on the input text.
  • Contextual understanding: Chatbots need to understand the context of the conversation to provide relevant and meaningful responses. Techniques such as sequence-to-sequence models or attention mechanisms can be used to capture contextual information.

Chatbots have a wide range of applications in various industries:

  • Customer support: Chatbots can be used to handle customer queries and provide instant support. They can assist with common troubleshooting issues, answer frequently asked questions, and escalate complex queries to human agents when necessary.
  • E-commerce: Chatbots can enhance the shopping experience by assisting users in finding products, providing recommendations, and answering product-related queries.
  • Healthcare: Chatbots can be deployed in healthcare settings to provide preliminary medical advice, answer general health-related questions, and assist with appointment scheduling.

Building a chatbot as a data science capstone project not only showcases your technical skills but also allows you to explore the exciting field of artificial intelligence and natural language processing.

It can be a great opportunity to create a practical and useful tool that can benefit users in various domains.

Completing an in-depth capstone project is the perfect way for data science students to demonstrate their technical skills and business acumen. This guide outlined 10 unique project ideas spanning industries like healthcare, transportation, finance, and more.

By identifying the ideal data sources, analysis techniques, and practical applications for their chosen project, students can produce an impressive capstone that solves real-world problems and showcases their abilities.

Similar Posts

Is Computer Science An Engineering Degree? An In-Depth Explanation

Is Computer Science An Engineering Degree? An In-Depth Explanation

For students exploring technology-related fields, a common question arises: is computer science actually an engineering degree, or is it fundamentally different? With overlapping subject matter and career options, the lines can seem blurred. If you’re short on time, here’s a quick answer: While computer science shares some qualities with engineering, most universities designate computer science…

Is Data Science Easier Than Computer Science? A Detailed Comparison

Is Data Science Easier Than Computer Science? A Detailed Comparison

With technology fields booming, many students find themselves trying to choose between majors like data science and computer science. But which one is easier? If you’re short on time, here’s a quick answer: Data science is generally considered easier than computer science overall, thanks to less intensive coding and math requirements. However, data science still…

Rensselaer Polytechnic Institute Computer Science Rankings: An In-Depth Analysis

Rensselaer Polytechnic Institute Computer Science Rankings: An In-Depth Analysis

Rensselaer Polytechnic Institute (RPI) is renowned for its computing and IT programs, but how does it stack up in major computer science rankings? In this comprehensive guide, we’ll analyze RPI’s CS program rankings across various authoritative college lists and metrics. In short, RPI computer science is consistently ranked among the top computer science schools in…

A Guide To The Uiuc Online Master’S In Computer Science

A Guide To The Uiuc Online Master’S In Computer Science

As one of the top computer science schools in the nation, the University of Illinois Urbana-Champaign offers a high-caliber online master’s program catering to working professionals. If you’re considering advancing your computer science career with flexible, remote coursework, UIUC’s online master’s may be an ideal option. This comprehensive guide covers everything you need to know,…

Is Cyber Security Computer Science? An In-Depth Look

Is Cyber Security Computer Science? An In-Depth Look

Cyber threats are more rampant than ever, with hackers and cyber criminals using increasingly sophisticated tools and techniques to carry out attacks. Naturally, cyber security has become a top priority for organizations of all sizes. But where does cyber security fit within the broader field of computer science? Let’s take an in-depth look. If you’re…

Utd Computer Science Acceptance Rates And Admissions Tips

Utd Computer Science Acceptance Rates And Admissions Tips

With its highly ranked programs in computer science and engineering, the University of Texas at Dallas (UTD) is a top choice for many aspiring tech students. But how easy is it to get into UTD computer science? What are the acceptance rates? If you’re short on time, here’s the quick answer: UTD computer science acceptance…

  • Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer

Data and Technology Insights

Big Data – Capstone Project

capstone project on big data

Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game “Catch the Pink Flamingo”. During the five week Capstone Project, you will walk through the typical big data science steps for acquiring, exploring, preparing, analyzing, and reporting.

In the first two weeks, we will introduce you to the data set and guide you through some exploratory analysis using tools such as Splunk and Open Office. Then we will move into more challenging big data problems requiring the more advanced tools you have learned including KNIME, Spark's MLLib and Gephi. Finally, during the fifth and final week, we will show you how to bring it all together to create engaging and compelling reports and slide presentations.

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

As a result of our collaboration with Splunk, a software company focus on analyzing machine-generated big data, learners with the top projects will be eligible to present to Splunk and meet Splunk recruiters and engineering leadership.

About Coursera

  • 5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
  • 5 Most Common Database Structures for Small Businesses
  • 6 Ways to Reduce IT Costs Through Observability
  • How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
  • How Realistic Are Self-Driving Cars?

Dear visitor, Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Thanks for visiting Datafloq If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

  • Privacy Overview
  • Necessary Cookies
  • Marketing cookies

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!

CodeAvail

21 Interesting Data Science Capstone Project Ideas [2024]

data science capstone project ideas

Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. 

Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings. 

These projects serve as a culmination of their learning journey, providing invaluable opportunities for skill development and problem-solving. 

Our blog is dedicated to guiding prospective students through the selection process of data science capstone project ideas. It offers curated ideas and insights to help them embark on a fulfilling educational experience. 

Join us as we navigate the dynamic world of data science, empowering students to thrive in this exciting field.

Data Science Capstone Project: A Comprehensive Overview

Table of Contents

Data science capstone projects are an essential component of data science education, providing students with the opportunity to apply their knowledge and skills to real-world problems. 

Capstone projects challenge students to acquire and analyze data to solve real-world problems. These projects are designed to test students’ skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning. 

In addition, capstone projects are conducted with industry, government, and academic partners, and most projects are sponsored by an organization. 

The projects are drawn from real-world problems, and students work in teams consisting of two to four students and a faculty advisor. 

However, the goal of the capstone project is to create a usable/public data product that can be used to show students’ skills to potential employers. 

Best Data Science Capstone Project Ideas – According to Skill Level

Data science capstone projects are a great way to showcase your skills and apply what you’ve learned in a real-world context. Here are some project ideas categorized by skill level:

best data science capstone project ideas - according to skill level

Beginner-Level Data Science Capstone Project Ideas

beginner-level data science capstone project ideas

1. Exploratory Data Analysis (EDA) on a Dataset

Start by analyzing a dataset of your choice and exploring its characteristics, trends, and relationships. Practice using basic statistical techniques and visualization tools to gain insights and present your findings clearly and understandably.

2. Predictive Modeling with Linear Regression

Build a simple linear regression model to predict a target variable based on one or more input features. Learn about model evaluation techniques such as mean squared error and R-squared, and interpret the results to make meaningful predictions.

3. Classification with Decision Trees

Use decision tree algorithms to classify data into distinct categories. Learn how to preprocess data, train a decision tree model, and evaluate its performance using metrics like accuracy, precision, and recall. Apply your model to practical scenarios like predicting customer churn or classifying spam emails.

4. Clustering with K-Means

Explore unsupervised learning by applying the K-Means algorithm to group similar data points together. Practice feature scaling and model evaluation to identify meaningful clusters within your dataset. Apply your clustering model to segment customers or analyze patterns in market data.

5. Sentiment Analysis on Text Data

Dive into natural language processing (NLP) by analyzing text data to determine sentiment polarity (positive, negative, or neutral). 

Learn about tokenization, text preprocessing, and sentiment analysis techniques using libraries like NLTK or spaCy. Apply your skills to analyze product reviews or social media comments.

6. Time Series Forecasting

Predict future trends or values based on historical time series data. Learn about time series decomposition, trend analysis, and seasonal patterns using methods like ARIMA or exponential smoothing. Apply your forecasting skills to predict stock prices, weather patterns, or sales trends.

7. Image Classification with Convolutional Neural Networks (CNNs)

Explore deep learning concepts by building a basic CNN model to classify images into different categories. 

Learn about convolutional layers, pooling, and fully connected layers, and experiment with different architectures to improve model performance. Apply your CNN model to tasks like recognizing handwritten digits or classifying images of animals.

Intermediate-Level Data Science Capstone Project Ideas

intermediate-level data science capstone project ideas

8. Customer Segmentation and Market Basket Analysis

Utilize advanced clustering techniques to segment customers based on their purchasing behavior. Conduct market basket analysis to identify frequent item associations and recommend personalized product suggestions. 

Implement techniques like the Apriori algorithm or association rules mining to uncover valuable insights for targeted marketing strategies.

9. Time Series Anomaly Detection

Apply anomaly detection algorithms to identify unusual patterns or outliers in time series data. Utilize techniques such as moving average, Z-score, or autoencoders to detect anomalies in various domains, including finance, IoT sensors, or network traffic. 

Develop robust anomaly detection models to enhance data security and predictive maintenance.

10. Recommendation System Development

Build a recommendation engine to suggest personalized items or content to users based on their preferences and behavior. Implement collaborative filtering, content-based filtering, or hybrid recommendation approaches to improve user engagement and satisfaction. 

Evaluate the performance of your recommendation system using metrics like precision, recall, and mean average precision.

11. Natural Language Processing for Topic Modeling

Dive deeper into NLP by exploring topic modeling techniques to extract meaningful topics from text data. 

Implement algorithms like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) to identify hidden themes or subjects within large text corpora. Apply topic modeling to analyze customer feedback, news articles, or academic papers.

12. Fraud Detection in Financial Transactions

Develop a fraud detection system using machine learning algorithms to identify suspicious activities in financial transactions. Utilize supervised learning techniques such as logistic regression, random forests, or gradient boosting to classify transactions as fraudulent or legitimate. 

Employ feature engineering and model evaluation to improve fraud detection accuracy and minimize false positives.

13. Predictive Maintenance for Industrial Equipment

Implement predictive maintenance techniques to anticipate equipment failures and prevent costly downtime. 

Analyze sensor data from machinery using machine learning algorithms like support vector machines or recurrent neural networks to predict when maintenance is required. Optimize maintenance schedules to minimize downtime and maximize operational efficiency.

14. Healthcare Data Analysis and Disease Prediction

Utilize healthcare datasets to analyze patient demographics, medical history, and diagnostic tests to predict the likelihood of disease occurrence or progression. 

Apply machine learning algorithms such as logistic regression, decision trees, or support vector machines to develop predictive models for diseases like diabetes, cancer, or heart disease. Evaluate model performance using metrics like sensitivity, specificity, and area under the ROC curve.

Advanced Level Data Science Capstone Project Ideas

advanced level data science capstone project ideas

15. Deep Learning for Image Generation

Explore generative adversarial networks (GANs) or variational autoencoders (VAEs) to generate realistic images from scratch. Experiment with architectures like DCGAN or StyleGAN to create high-resolution images of faces, landscapes, or artwork. 

Evaluate image quality and diversity using perceptual metrics and human judgment.

16. Reinforcement Learning for Game Playing

Implement reinforcement learning algorithms like deep Q-learning or policy gradients to train agents to play complex games like Atari or board games. 

Experiment with exploration-exploitation strategies and reward-shaping techniques to improve agent performance and achieve superhuman levels of gameplay.

17. Anomaly Detection in Streaming Data

Develop real-time anomaly detection systems to identify abnormal behavior in streaming data streams such as network traffic, sensor readings, or financial transactions. 

Utilize online learning algorithms like streaming k-means or Isolation Forest to detect anomalies and trigger timely alerts for intervention.

18. Multi-Modal Sentiment Analysis

Extend sentiment analysis to incorporate multiple modalities such as text, images, and audio to capture rich emotional expressions. 

However, utilize deep learning architectures like multimodal transformers or fusion models to analyze sentiment across different modalities and improve understanding of complex human emotions.

19. Graph Neural Networks for Social Network Analysis

Apply graph neural networks (GNNs) to model and analyze complex relational data in social networks. Use techniques like graph convolutional networks (GCNs) or graph attention networks (GATs) to learn node embeddings and predict node properties such as community detection or influential users.

20. Time Series Forecasting with Deep Learning

Explore advanced deep learning architectures like long short-term memory (LSTM) networks or transformer-based models for time series forecasting. 

Utilize attention mechanisms and multi-horizon forecasting to capture long-term dependencies and improve prediction accuracy in dynamic and volatile environments.

21. Adversarial Robustness in Machine Learning

Investigate techniques to improve the robustness of machine learning models against adversarial attacks. 

Explore methods like adversarial training, defensive distillation, or certified robustness to mitigate vulnerabilities and ensure model reliability in adversarial perturbations, particularly in critical applications like autonomous vehicles or healthcare.

These project ideas cater to various skill levels in data science, ranging from beginners to experts. Choose a project that aligns with your interests and skill level, and don’t hesitate to experiment and learn along the way!

Factors to Consider When Choosing a Data Science Capstone Project

Choosing the right data science capstone project is crucial for your learning experience and effectively showcasing your skills. Here are some factors to consider when selecting a data science capstone project:

Personal Interest

Select a project that aligns with your passions and career goals to stay motivated and engaged throughout the process.

Data Availability

Ensure access to relevant and sufficient data to complete the project and draw meaningful insights effectively.

Complexity Level

Consider your current skill level and choose a project that challenges you without overwhelming you, allowing for growth and learning.

Real-World Impact

Aim for projects with practical applications or societal relevance to showcase your ability to solve tangible problems.

Resource Requirements

Evaluate the availability of resources such as time, computing power, and software tools needed to execute the project successfully.

Mentorship and Support

Seek projects with opportunities for guidance and feedback from mentors or peers to enhance your learning experience.

Novelty and Innovation

Explore projects that push boundaries and explore new techniques or approaches to demonstrate creativity and originality in your work.

Tips for Successfully Completing a Data Science Capstone Project

Successfully completing a data science capstone project requires careful planning, effective execution, and strong communication skills. Here are some tips to help you navigate through the process:

  • Plan and Prioritize: Break down the project into manageable tasks and create a timeline to stay organized and focused.
  • Understand the Problem: Clearly define the project objectives, requirements, and expected outcomes before analyzing.
  • Explore and Experiment: Experiment with different methodologies, algorithms, and techniques to find the most suitable approach.
  • Document and Iterate: Document your process, results, and insights thoroughly, and iterate on your analyses based on feedback and new findings.
  • Collaborate and Seek Feedback: Collaborate with peers, mentors, and stakeholders, actively seeking feedback to improve your work and decision-making.
  • Practice Communication: Communicate your findings effectively through clear visualizations, reports, and presentations tailored to your audience’s understanding.
  • Reflect and Learn: Reflect on your challenges, successes, and lessons learned throughout the project to inform your future endeavors and continuous improvement.

By following these tips, you can successfully navigate the data science capstone project and demonstrate your skills and expertise in the field.

Wrapping Up

In wrapping up, data science capstone project ideas are invaluable in bridging the gap between theory and practice, offering students a chance to apply their knowledge in real-world scenarios.

They are a cornerstone of data science education, fostering critical thinking, problem-solving, and practical skills development. 

As you embark on your journey, don’t hesitate to explore diverse and challenging project ideas. Embrace the opportunity to push boundaries, innovate, and make meaningful contributions to the field. 

Share your insights, challenges, and successes with others, and invite fellow enthusiasts to exchange ideas and experiences. 

1. What is the purpose of a data science capstone project?

A data science capstone project serves as a culmination of a student’s learning experience, allowing them to apply their knowledge and skills to solve real-world problems in the field of data science. It provides hands-on experience and showcases their ability to analyze data, derive insights, and communicate findings effectively.

2. What are some examples of data science capstone projects?

Data science capstone projects can cover a wide range of topics and domains, including predictive modeling, natural language processing, image classification, recommendation systems, and more. Examples may include analyzing customer behavior, predicting stock prices, sentiment analysis on social media data, or detecting anomalies in financial transactions.

3. How long does it typically take to complete a data science capstone project?

The duration of a data science capstone project can vary depending on factors such as project complexity, available resources, and individual pace. Generally, it may take several weeks to several months to complete a project, including tasks such as data collection, preprocessing, analysis, modeling, and presentation of findings.

Related Posts

Science Fair Project Ideas For 6th Graders

Science Fair Project Ideas For 6th Graders

When it comes to Science Fair Project Ideas For 6th Graders, the possibilities are endless! These projects not only help students develop essential skills, such…

Java Project Ideas For Beginners

Java Project Ideas for Beginners

Java is one of the most popular programming languages. It is used for many applications, from laptops to data centers, gaming consoles, scientific supercomputers, and…

25+ Solved End-to-End Big Data Projects with Source Code

Solved End-to-End Real World Mini Big Data Projects Ideas with Source Code For Beginners and Students to master big data tools like Hadoop and Spark.

25+ Solved End-to-End Big Data Projects with Source Code

Ace your big data analytics interview by adding some unique and exciting Big Data projects to your portfolio. This blog lists over 20 big data analytics projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. You will find several big data projects depending on your level of expertise- big data projects for students, big data projects for beginners, etc.

big_data_project

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Downloadable solution code | Explanatory videos | Tech Support

Have you ever looked for sneakers on Amazon and seen advertisements for similar sneakers while searching the internet for the perfect cake recipe? Maybe you started using Instagram to search for some fitness videos, and now, Instagram keeps recommending videos from fitness influencers to you. And even if you’re not very active on social media, I’m sure you now and then check your phone before leaving the house to see what the traffic is like on your route to know how long it could take you to reach your destination. None of this would have been possible without the application of big data analysis process on by the modern data driven companies. We bring the top big data projects for 2023 that are specially curated for students, beginners, and anybody looking to get started with mastering data skills.

Table of Contents

What is a big data project, how do you create a good big data project, 25+ big data project ideas to help boost your resume , big data project ideas for beginners, intermediate projects on data analytics, advanced level examples of big data projects, real-time big data projects with source code, sample big data project ideas for final year students, big data project ideas using hadoop , big data projects using spark, gcp and aws big data projects, best big data project ideas for masters students, fun big data project ideas, top 5 apache big data projects, top big data projects on github with source code, level-up your big data expertise with projectpro's big data projects, faqs on big data projects.

A big data project is a data analysis project that uses machine learning algorithms and different data analytics techniques on structured and unstructured data for several purposes, including predictive modeling and other advanced analytics applications. Before actually working on any big data projects, data engineers must acquire proficient knowledge in the relevant areas, such as deep learning, machine learning, data visualization , data analytics, data science, etc. 

Many platforms, like GitHub and ProjectPro, offer various big data projects for professionals at all skill levels- beginner, intermediate, and advanced. However, before moving on to a list of big data project ideas worth exploring and adding to your portfolio, let us first get a clear picture of what big data is and why everyone is interested in it.

ProjectPro Free Projects on Big Data and Data Science

Kicking off a big data analytics project is always the most challenging part. You always encounter questions like what are the project goals, how can you become familiar with the dataset, what challenges are you trying to address,  what are the necessary skills for this project, what metrics will you use to evaluate your model, etc.

Well! The first crucial step to launching your project initiative is to have a solid project plan. To build a big data project, you should always adhere to a clearly defined workflow. Before starting any big data project, it is essential to become familiar with the fundamental processes and steps involved, from gathering raw data to creating a machine learning model to its effective implementation.

Understand the Business Goals of the Big Data Project

The first step of any good big data analytics project is understanding the business or industry that you are working on. Go out and speak with the individuals whose processes you aim to transform with data before you even consider analyzing the data. Establish a timeline and specific key performance indicators afterward. Although planning and procedures can appear tedious, they are a crucial step to launching your data initiative! A definite purpose of what you want to do with data must be identified, such as a specific question to be answered, a data product to be built, etc., to provide motivation, direction, and purpose.

Here's what valued users are saying about ProjectPro

user profile

Tech Leader | Stanford / Yale University

user profile

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

Not sure what you are looking for?

Collect Data for the Big Data Project

The next step in a big data project is looking for data once you've established your goal. To create a successful data project, collect and integrate data from as many different sources as possible. 

Here are some options for collecting data that you can utilize:

Connect to an existing database that is already public or access your private database.

Consider the APIs for all the tools your organization has been utilizing and the data they have gathered. You must put in some effort to set up those APIs so that you can use the email open and click statistics, the support request someone sent, etc.

There are plenty of datasets on the Internet that can provide more information than what you already have. There are open data platforms in several regions (like data.gov in the U.S.). These open data sets are a fantastic resource if you're working on a personal project for fun.

Data Preparation and Cleaning

The data preparation step, which may consume up to 80% of the time allocated to any big data or data engineering project, comes next. Once you have the data, it's time to start using it. Start exploring what you have and how you can combine everything to meet the primary goal. To understand the relevance of all your data, start making notes on your initial analyses and ask significant questions to businesspeople, the IT team, or other groups. Data Cleaning is the next step. To ensure that data is consistent and accurate, you must review each column and check for errors, missing data values, etc.

Making sure that your project and your data are compatible with data privacy standards is a key aspect of data preparation that should not be overlooked. Personal data privacy and protection are becoming increasingly crucial, and you should prioritize them immediately as you embark on your big data journey. You must consolidate all your data initiatives, sources, and datasets into one location or platform to facilitate governance and carry out privacy-compliant projects. 

New Projects

Data Transformation and Manipulation

Now that the data is clean, it's time to modify it so you can extract useful information. Starting with combining all of your various sources and group logs will help you focus your data on the most significant aspects. You can do this, for instance, by adding time-based attributes to your data, like:

Acquiring date-related elements (month, hour, day of the week, week of the year, etc.)

Calculating the variations between date-column values, etc.

Joining datasets is another way to improve data, which entails extracting columns from one dataset or tab and adding them to a reference dataset. This is a crucial component of any analysis, but it can become a challenge when you have many data sources.

 Visualize Your Data

Now that you have a decent dataset (or perhaps several), it would be wise to begin analyzing it by creating beautiful dashboards, charts, or graphs. The next stage of any data analytics project should focus on visualization because it is the most excellent approach to analyzing and showcasing insights when working with massive amounts of data.

Another method for enhancing your dataset and creating more intriguing features is to use graphs. For instance, by plotting your data points on a map, you can discover that some geographic regions are more informative than some other nations or cities.

Build Predictive Models Using Machine Learning Algorithms

Machine learning algorithms can help you take your big data project to the next level by providing you with more details and making predictions about future trends. You can create models to find trends in the data that were not visible in graphs by working with clustering techniques (also known as unsupervised learning). These organize relevant outcomes into clusters and more or less explicitly state the characteristic that determines these outcomes.

Advanced data scientists can use supervised algorithms to predict future trends. They discover features that have influenced previous data patterns by reviewing historical data and can then generate predictions using these features. 

Lastly, your predictive model needs to be operationalized for the project to be truly valuable. Deploying a machine learning model for adoption by all individuals within an organization is referred to as operationalization.

Repeat The Process

This is the last step in completing your big data project, and it's crucial to the whole data life cycle. One of the biggest mistakes individuals make when it comes to machine learning is assuming that once a model is created and implemented, it will always function normally. On the contrary, if models aren't updated with the latest data and regularly modified, their quality will deteriorate with time.

You need to accept that your model will never indeed be "complete" to accomplish your first data project effectively. You need to continually reevaluate, retrain it, and create new features for it to stay accurate and valuable. 

If you are a newbie to Big Data, keep in mind that it is not an easy field, but at the same time, remember that nothing good in life comes easy; you have to work for it. The most helpful way of learning a skill is with some hands-on experience. Below is a list of Big Data analytics project ideas and an idea of the approach you could take to develop them; hoping that this could help you learn more about Big Data and even kick-start a career in Big Data. 

Yelp Data Processing Using Spark And Hive Part 1

Yelp Data Processing using Spark and Hive Part 2

Hadoop Project for Beginners-SQL Analytics with Hive

Tough engineering choices with large datasets in Hive Part - 1

Finding Unique URL's using Hadoop Hive

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Orchestrate Redshift ETL using AWS Glue and Step Functions

Analyze Yelp Dataset with Spark & Parquet Format on Azure Databricks

Data Warehouse Design for E-commerce Environments

Analyzing Big Data with Twitter Sentiments using Spark Streaming

PySpark Tutorial - Learn to use Apache Spark with Python

Tough engineering choices with large datasets in Hive Part - 2

Event Data Analysis using AWS ELK Stack

Web Server Log Processing using Hadoop

Data processing with Spark SQL

Build a Time Series Analysis Dashboard with Spark and Grafana

GCP Data Ingestion with SQL using Google Cloud Dataflow

Deploying auto-reply Twitter handle with Kafka, Spark, and LSTM

Dealing with Slowly Changing Dimensions using Snowflake

Spark Project -Real-Time data collection and Spark Streaming Aggregation

Snowflake Real-Time Data Warehouse Project for Beginners-1

Real-Time Log Processing using Spark Streaming Architecture

Real-Time Auto Tracking with Spark-Redis

Building Real-Time AWS Log Analytics Solution

Explore real-world Apache Hadoop projects by ProjectPro and land your Big Data dream job today!

In this section, you will find a list of good big data project ideas for masters students.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

Online Hadoop Projects -Solving small file problem in Hadoop

Airline Dataset Analysis using Hadoop, Hive, Pig, and Impala

AWS Project-Website Monitoring using AWS Lambda and Aurora

Explore features of Spark SQL in practice on Spark 2.0

MovieLens Dataset Exploratory Analysis

Bitcoin Data Mining on AWS

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Spark Project-Analysis and Visualization on Yelp Dataset

Project Ideas on Big Data Analytics

Let us now begin with a more detailed list of good big data project ideas that you can easily implement.

This section will introduce you to a list of project ideas on big data that use Hadoop along with descriptions of how to implement them.

1. Visualizing Wikipedia Trends

Human brains tend to process visual data better than data in any other format. 90% of the information transmitted to the brain is visual, and the human brain can process an image in just 13 milliseconds. Wikipedia is a page that is accessed by people all around the world for research purposes, general information, and just to satisfy their occasional curiosity. 

Visualizing Wikipedia Trends Big Data Project

Raw page data counts from Wikipedia can be collected and processed via Hadoop. The processed data can then be visualized using Zeppelin notebooks to analyze trends that can be supported based on demographics or parameters. This is a good pick for someone looking to understand how big data analysis and visualization can be achieved through Big Data and also an excellent pick for an Apache Big Data project idea. 

Visualizing Wikipedia Trends Big Data Project with Source Code .

2. Visualizing Website Clickstream Data

Clickstream data analysis refers to collecting, processing, and understanding all the web pages a particular user visits. This analysis benefits web page marketing, product management, and targeted advertisement. Since users tend to visit sites based on their requirements and interests, clickstream analysis can help to get an idea of what a user is looking for. 

Visualization of the same helps in identifying these trends. In such a manner, advertisements can be generated specific to individuals. Ads on webpages provide a source of income for the webpage, and help the business publishing the ad reach the customer and at the same time, other internet users. This can be classified as a Big Data Apache project by using Hadoop to build it.

Big Data Analytics Projects Solution for Visualization of Clickstream Data on a Website

3. Web Server Log Processing

A web server log maintains a list of page requests and activities it has performed. Storing, processing, and mining the data on web servers can be done to analyze the data further. In this manner, webpage ads can be determined, and SEO (Search engine optimization) can also be done. A general overall user experience can be achieved through web-server log analysis. This kind of processing benefits any business that heavily relies on its website for revenue generation or to reach out to its customers. The Apache Hadoop open source big data project ecosystem with tools such as Pig, Impala, Hive, Spark, Kafka Oozie, and HDFS can be used for storage and processing.

Big Data Project using Hadoop with Source Code for Web Server Log Processing 

This section will provide you with a list of projects that utilize Apache Spark for their implementation.

4. Analysis of Twitter Sentiments Using Spark Streaming

Sentimental analysis is another interesting big data project topic that deals with the process of determining whether a given opinion is positive, negative, or neutral. For a business, knowing the sentiments or the reaction of a group of people to a new product launch or a new event can help determine the profitability of the product and can help the business to have a more extensive reach by getting an idea of the feel of the customers. From a political standpoint, the sentiments of the crowd toward a candidate or some decision taken by a party can help determine what keeps a specific group of people happy and satisfied. You can use Twitter sentiments to predict election results as well. 

Sentiment Analysis Big Data Project

Sentiment analysis has to be done for a large dataset since there are over 180 million monetizable daily active users ( https://www.businessofapps.com/data/twitter-statistics/) on Twitter. The analysis also has to be done in real-time. Spark Streaming can be used to gather data from Twitter in real time. NLP (Natural Language Processing) models will have to be used for sentimental analysis, and the models will have to be trained with some prior datasets. Sentiment analysis is one of the more advanced projects that showcase the use of Big Data due to its involvement in NLP.

Access Big Data Project Solution to Twitter Sentiment Analysis

5. Real-time Analysis of Log-entries from Applications Using Streaming Architectures

If you are looking to practice and get your hands dirty with a real-time big data project, then this big data project title must be on your list. Where web server log processing would require data to be processed in batches, applications that stream data will have log files that would have to be processed in real-time for better analysis. Real-time streaming behavior analysis gives more insight into customer behavior and can help find more content to keep the users engaged. Real-time analysis can also help to detect a security breach and take necessary action immediately. Many social media networks work using the concept of real-time analysis of the content streamed by users on their applications. Spark has a Streaming tool that can process real-time streaming data.

Access Big Data Spark Project Solution to Real-time Analysis of log-entries from applications using Streaming Architecture

6. Analysis of Crime Datasets

Analysis of crimes such as shootings, robberies, and murders can result in finding trends that can be used to keep the police alert for the likelihood of crimes that can happen in a given area. These trends can help to come up with a more strategized and optimal planning approach to selecting police stations and stationing personnel. 

With access to CCTV surveillance in real-time, behavior detection can help identify suspicious activities. Similarly, facial recognition software can play a bigger role in identifying criminals. A basic analysis of a crime dataset is one of the ideal Big Data projects for students. However, it can be made more complex by adding in the prediction of crime and facial recognition in places where it is required.

Big Data Analytics Projects for Students on Chicago Crime Data Analysis with Source Code

Explore Categories

In this section, you will find big data projects that rely on cloud service providers such as AWS and GCP.

7. Build a Scalable Event-Based GCP Data Pipeline using DataFlow

Suppose you are running an eCommerce website, and a customer places an order. In that case, you must inform the warehouse team to check the stock availability and commit to fulfilling the order. After that, the parcel has to be assigned to a delivery firm so it can be shipped to the customer. For such scenarios, data-driven integration becomes less comfortable, so you must prefer event-based data integration.

This project will teach you how to design and implement an event-based data integration pipeline on the Google Cloud Platform by processing data using DataFlow .

Scalable Event-Based GCP Data Pipeline using DataFlow

Data Description: You will use the Covid-19 dataset(COVID-19 Cases.csv) from data.world , for this project, which contains a few of the following attributes:

people_positive_cases_count

county_name

data_source

Language Used: Python 3.7

Services: Cloud Composer , Google Cloud Storage (GCS), Pub-Sub , Cloud Functions, BigQuery, BigTable

Big Data Project with Source Code: Build a Scalable Event-Based GCP Data Pipeline using DataFlow  

8. Topic Modeling

The future is AI! You must have come across similar quotes about artificial intelligence (AI). Initially, most people found it difficult to believe that could be true. Still, we are witnessing top multinational companies drift towards automating tasks using machine learning tools. 

Understand the reason behind this drift by working on one of our repository's most practical data engineering project examples .

Topic Modeling Big Data Project

Project Objective: Understand the end-to-end implementation of Machine learning operations (MLOps) by using cloud computing .

Learnings from the Project: This project will introduce you to various applications of AWS services . You will learn how to convert an ML application to a Flask Application and its deployment using Gunicord webserver. You will be implementing this project solution in Code Build. This project will help you understand ECS Cluster Task Definition.

Tech Stack:

Language: Python

Libraries: Flask, gunicorn, scipy , nltk , tqdm, numpy, joblib, pandas, scikit_learn, boto3

Services: Flask, Docker, AWS, Gunicorn

Source Code: MLOps AWS Project on Topic Modeling using Gunicorn Flask

9. MLOps on GCP Project for Autoregression using uWSGI Flask

Here is a project that combines Machine Learning Operations (MLOps) and Google Cloud Platform (GCP). As companies are switching to automation using machine learning algorithms, they have realized hardware plays a crucial role. Thus, many cloud service providers have come up to help such companies overcome their hardware limitations. Therefore, we have added this project to our repository to assist you with the end-to-end deployment of a machine learning project .

Project Objective: Deploying the moving average time-series machine-learning model on the cloud using GCP and Flask.

Learnings from the Project: You will work with Flask and uWSGI model files in this project. You will learn about creating Docker Images and Kubernetes architecture. You will also get to explore different components of GCP and their significance. You will understand how to clone the git repository with the source repository. Flask and Kubernetes deployment will also be discussed in this project.

Tech Stack: Language - Python

Services - GCP, uWSGI, Flask, Kubernetes, Docker

Build Professional SQL Projects for Data Analysis with ProjectPro

Unlock the ProjectPro Learning Experience for FREE

This section has good big data project ideas for graduate students who have enrolled in a master course.

10. Real-time Traffic Analysis

Traffic is an issue in many major cities, especially during some busier hours of the day. If traffic is monitored in real-time over popular and alternate routes, steps could be taken to reduce congestion on some roads. Real-time traffic analysis can also program traffic lights at junctions – stay green for a longer time on higher movement roads and less time for roads showing less vehicular movement at a given time. Real-time traffic analysis can help businesses manage their logistics and plan their commute accordingly for working-class individuals. Concepts of deep learning can be used to analyze this dataset properly.

11. Health Status Prediction

“Health is wealth” is a prevalent saying. And rightly so, there cannot be wealth unless one is healthy enough to enjoy worldly pleasures. Many diseases have risk factors that can be genetic, environmental, dietary, and more common for a specific age group or sex and more commonly seen in some races or areas. By gathering datasets of this information relevant for particular diseases, e.g., breast cancer, Parkinson’s disease, and diabetes, the presence of more risk factors can be used to measure the probability of the onset of one of these issues. 

Health Status Prediction Big Data Project

In cases where the risk factors are not already known, analysis of the datasets can be used to identify patterns of risk factors and hence predict the likelihood of onset accordingly. The level of complexity could vary depending on the type of analysis that has to be done for different diseases. Nevertheless, since prediction tools have to be applied, this is not a beginner-level big data project idea.

12. Analysis of Tourist Behavior

Tourism is a large sector that provides a livelihood for several people and can adversely impact a country's economy.. Not all tourists behave similarly simply because individuals have different preferences. Analyzing this behavior based on decision-making, perception, choice of destination, and level of satisfaction can be used to help travelers and locals have a more wholesome experience. Behavior analysis, like sentiment analysis, is one of the more advanced project ideas in the Big Data field.

13. Detection of Fake News on Social Media

Detection of Fake News on Social Media

With the popularity of social media, a major concern is the spread of fake news on various sites. Even worse, this misinformation tends to spread even faster than factual information. According to Wikipedia, fake news can be visual-based, which refers to images, videos, and even graphical representations of data, or linguistics-based, which refers to fake news in the form of text or a string of characters. Different cues are used based on the type of news to differentiate fake news from real. A site like Twitter has 330 million users , while Facebook has 2.8 billion users. A large amount of data will make rounds on these sites, which must be processed to determine the post's validity. Various data models based on machine learning techniques and computational methods based on NLP will have to be used to build an algorithm that can be used to detect fake news on social media.

Access Solution to Interesting Big Data Project on Detection of Fake News

14. Prediction of Calamities in a Given Area

Certain calamities, such as landslides and wildfires, occur more frequently during a particular season and in certain areas. Using certain geospatial technologies such as remote sensing and GIS (Geographic Information System) models makes it possible to monitor areas prone to these calamities and identify triggers that lead to such issues. 

Calamity Prediction Big Data Project

If calamities can be predicted more accurately, steps can be taken to protect the residents from them, contain the disasters, and maybe even prevent them in the first place. Past data of landslides has to be analyzed, while at the same time, in-site ground monitoring of data has to be done using remote sensing. The sooner the calamity can be identified, the easier it is to contain the harm. The need for knowledge and application of GIS adds to the complexity of this Big Data project.

15. Generating Image Captions

With the emergence of social media and the importance of digital marketing, it has become essential for businesses to upload engaging content. Catchy images are a requirement, but captions for images have to be added to describe them. The additional use of hashtags and attention-drawing captions can help a little more to reach the correct target audience. Large datasets have to be handled which correlate images and captions. 

Image Caption Generating Big Data Project

This involves image processing and deep learning to understand the image and artificial intelligence to generate relevant but appealing captions. Python can be used as the Big Data source code. Image caption generation cannot exactly be considered a beginner-level Big Data project idea. It is probably better to get some exposure to one of the projects before proceeding with this.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

16. Credit Card Fraud Detection

Credit Card Fraud Detection

The goal is to identify fraudulent credit card transactions, so a customer is not billed for an item that the customer did not purchase. This can tend to be challenging since there are huge datasets, and detection has to be done as soon as possible so that the fraudsters do not continue to purchase more items. Another challenge here is the data availability since the data is supposed to be primarily private. Since this project involves machine learning, the results will be more accurate with a larger dataset. Data availability can pose a challenge in this manner. Credit card fraud detection is helpful for a business since customers are likely to trust companies with better fraud detection applications, as they will not be billed for purchases made by someone else. Fraud detection can be considered one of the most common Big Data project ideas for beginners and students.

If you are looking for big data project examples that are fun to implement then do not miss out on this section.

17. GIS Analytics for Better Waste Management

Due to urbanization and population growth, large amounts of waste are being generated globally. Improper waste management is a hazard not only to the environment but also to us. Waste management involves the process of handling, transporting, storing, collecting, recycling, and disposing of the waste generated. Optimal routing of solid waste collection trucks can be done using GIS modeling to ensure that waste is picked up, transferred to a transfer site, and reaches the landfills or recycling plants most efficiently. GIS modeling can also be used to select the best sites for landfills. The location and placement of garbage bins within city localities must also be analyzed. 

18. Customized Programs for Students

We all tend to have different strengths and paces of learning. There are different kinds of intelligence, and the curriculum only focuses on a few things. Data analytics can help modify academic programs to nurture students better. Programs can be designed based on a student’s attention span and can be modified according to an individual’s pace, which can be different for different subjects. E.g., one student may find it easier to grasp language subjects but struggle with mathematical concepts.

In contrast, another might find it easier to work with math but not be able to breeze through language subjects. Customized programs can boost students’ morale, which could also reduce the number of dropouts. Analysis of a student’s strong subjects, monitoring their attention span, and their responses to specific topics in a subject can help build the dataset to create these customized programs.

19. Real-time Tracking of Vehicles

Transportation plays a significant role in many activities. Every day, goods have to be shipped across cities and countries; kids commute to school, and employees have to get to work. Some of these modes might have to be closely monitored for safety and tracking purposes. I’m sure parents would love to know if their children’s school buses were delayed while coming back from school for some reason. 

Vehicle Tracking Big Data Project

Taxi applications have to keep track of their users to ensure the safety of the drivers and the users. Tracking has to be done in real-time, as the vehicles will be continuously on the move. Hence, there will be a continuous stream of data flowing in. This data has to be processed, so there is data available on how the vehicles move so that improvements in routes can be made if required but also just for information on the general whereabouts of the vehicle movement.

20. Analysis of Network Traffic and Call Data Records

There are large chunks of data-making rounds in the telecommunications industry. However, very little of this data is currently being used to improve the business. According to a MindCommerce study: “An average telecom operator generates billions of records per day, and data should be analyzed in real or near real-time to gain maximum benefit.” 

The main challenge here is that these large amounts of data must be processed in real-time. With big data analysis, telecom industries can make decisions that can improve the customer experience by monitoring the network traffic. Issues such as call drops and network interruptions must be closely monitored to be addressed accordingly. By evaluating the usage patterns of customers, better service plans can be designed to meet these required usage needs. The complexity and tools used could vary based on the usage requirements of this project.

This section contains project ideas in big data that are primarily open-source and have been developed by Apache.

Apache Hadoop is an open-source big data processing framework that allows distributed storage and processing of large datasets across clusters of commodity hardware. It provides a scalable, reliable, and cost-effective solution for processing and analyzing big data.

22. Apache Spark

Apache Spark is an open-source big data processing engine that provides high-speed data processing capabilities for large-scale data processing tasks. It offers a unified analytics platform for batch processing, real-time processing, machine learning, and graph processing.

23. Apache Nifi 

Apache NiFi is an open-source data integration tool that enables users to easily and securely transfer data between systems, databases, and applications. It provides a web-based user interface for creating, scheduling, and monitoring data flows, making it easy to manage and automate data integration tasks.

24. Apache Flink

Apache Flink is an open-source big data processing framework that provides scalable, high-throughput, and fault-tolerant data stream processing capabilities. It offers low-latency data processing and provides APIs for batch processing, stream processing, and graph processing.

25. Apache Storm

Apache Storm is an open-source distributed real-time processing system that provides scalable and fault-tolerant stream processing capabilities. It allows users to process large amounts of data in real-time and provides APIs for creating data pipelines and processing data streams.

Does Big Data sound difficult to work with? Work on end-to-end solved Big Data Projects using Spark , and you will know how easy it is!

This section has projects on big data along with links of their source code on GitHub.

26. Fruit Image Classification

This project aims to make a mobile application to enable users to take pictures of fruits and get details about them for fruit harvesting. The project develops a data processing chain in a big data environment using Amazon Web Services (AWS) cloud tools, including steps like dimensionality reduction and data preprocessing and implements a fruit image classification engine. 

Fruit Image Classification Big Data Project

The project involves generating PySpark scripts and utilizing the AWS cloud to benefit from a Big Data architecture (EC2, S3, IAM) built on an EC2 Linux server. This project also uses DataBricks since it is compatible with AWS.

Source Code: Fruit Image Classification

27. Airline Customer Service App

In this project, you will build a web application that uses machine learning and Azure data bricks to forecast travel delays using weather data and airline delay statistics. Planning a bulk data import operation is the first step in the project. Next comes preparation, which includes cleaning and preparing the data for testing and building your machine learning model. 

Airline Customer Service App Big Data Project

This project will teach you how to deploy the trained model to Docker containers for on-demand predictions after storing it in Azure Machine Learning Model Management. It transfers data using Azure Data Factory (ADF) and summarises data using Azure Databricks and Spark SQL . The project uses Power BI to visualize batch forecasts.

Source Code: Airline Customer Service App

28. Criminal Network Analysis

This fascinating big data project seeks to find patterns to predict and detect links in a dynamic criminal network. This project uses a stream processing technique to extract relevant information as soon as data is generated since the criminal network is a dynamic social graph. It also suggests three brand-new social network similarity metrics for criminal link discovery and prediction. The next step is to develop a flexible data stream processing application using the Apache Flink framework, which enables the deployment and evaluation of the newly proposed and existing metrics.

Source Code- Criminal Network Analysis

Trying out these big data project ideas mentioned above in this blog will help you get used to the popular tools in the industry. But these projects are not enough if you are planning to land a job in the big data industry. And if you are curious about what else will get you closer to landing your dream job, then we highly recommend you check out ProjectPro . ProjectPro hosts a repository of solved projects in Data Science and Big Data prepared by experts in the industry. It offers a subscription to that repository that contains solutions in the form of guided videos along with supporting documentation to help you understand the projects end-to-end. So, don’t wait more to get your hands dirty with ProjectPro projects and subscribe to the repository today!

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

1. Why are big data projects important?

Big data projects are important as they will help you to master the necessary big data skills for any job role in the relevant field. These days, most businesses use big data to understand what their customers want, their best customers, and why individuals select specific items. This indicates a huge demand for big data experts in every industry, and you must add some good big data projects to your portfolio to stay ahead of your competitors.

2. What are some good big data projects?

Design a Network Crawler by Mining Github Social Profiles. In this big data project, you'll work on a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects.

Visualize Daily Wikipedia Trends using Hadoop - You'll build a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects. 

Modeling & Thinking in Graphs(Neo4J) using Movielens Dataset - You will reconstruct the movielens dataset in a graph structure and use that structure to answer queries in various ways in this Neo4j big data project.

3. How long does it take to complete a big data project?

A big data project might take a few hours to hundreds of days to complete. It depends on various factors such as the type of data you are using, its size, where it's stored, whether it is easily accessible, whether you need to perform any considerable amount of ETL processing on the data, etc. 

Access Solved Big Data and Data Science Projects

About the Author

author profile

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

Capstone Projects

M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project at the beginning of the year and work on the project over the course of two semesters. 

Most projects are sponsored by an organization—academic, commercial, non-profit, and government—seeking valuable recommendations to address strategic and operational issues. Depending on the needs of the sponsor, teams may develop web-based applications that can support ongoing decision-making. The capstone project concludes with a paper and presentation.

Key takeaways:

  • Synthesizing the concepts you have learned throughout the program in various courses (this requires that the question posed by the project be complex enough to require the application of appropriate analytical approaches learned in the program and that the available data be of sufficient size to qualify as ‘big’)
  • Experience working with ‘raw’ data exposing you to the data pipeline process you are likely to encounter in the ‘real world’  
  • Demonstrating oral and written communication skills through a formal paper and presentation of project outcomes  
  • Acquisition of team building skills on a long-term, complex, data science project 
  • Addressing an actual client’s need by building a data product that can be shared with the client

Capstone projects have been sponsored by a variety of organizations and industries, including: Capital One, City of Charlottesville, Deloitte Consulting LLP, Metropolitan Museum of Art, MITRE Corporation, a multinational banking firm, The Public Library of Science, S&P Global Market Intelligence, UVA Brain Institute, UVA Center for Diabetes Technology, UVA Health System, U.S. Army Research Laboratory, Virginia Department of Health, Virginia Department of Motor Vehicles, Virginia Office of the Governor, Wikipedia, and more. 

Sponsor a Capstone Project  

View previous examples of capstone projects  and check out answers to frequently asked questions. 

What does the process look like?

  • The School of Data Science periodically puts out a Call for Proposals . Prospective project sponsors submit official proposals, vetted by the Associate Director for Research Development, Capstone Director, and faculty.
  • Sponsors present their projects to students at “Pitch Day” near the start of the Fall term, where students have the opportunity to ask questions.
  • Students individually rank their top project choices. An algorithm sorts students into capstone groups of approximately 3 to 4 students per group.
  • Adjustments are made by hand as necessary to finalize groups.
  • Each group is assigned a faculty mentor, who will meet groups each week in a seminar-style format.

What is the seminar approach to mentoring capstones?

We utilize a seminar approach to managing capstones to provide faculty mentorship and streamlined logistics. This approach involves one mentor supervising three to four loosely related projects and meeting with these groups on a regular basis. Project teams often encounter similar roadblocks and issues so meeting together to share information and report on progress toward key milestones is highly beneficial.

Do all capstone projects have corporate sponsors?

Not necessarily. Generally, each group works with a sponsor from outside the School of Data Science. Some sponsors are corporations, some are from nonprofit and governmental organizations, and some are from in other departments at UVA.

One of the challenges we continue to encounter when curating capstone projects with external sponsors is appropriately scoping and defining a question that is of sufficient depth for our students, obtaining data of sufficient size, obtaining access to the data in sufficient time for adequate analysis to be performed and navigating a myriad of legal issues (including conflicts of interest). While we continue to strive to use sponsored projects and work to solve these issues, we also look for ways to leverage openly available data to solve interesting societal problems which allow students to apply the skills learned throughout the program. While not all capstones have sponsors, all capstones have clients. That is, the work is being done for someone who cares and has investment in the outcome. 

Why do we have to work in groups?

Because data science is a team sport!

All capstone projects are completed by group work. While this requires additional coordination , this collaborative component of the program reflects the way companies expect their employees to work. Building this skill is one of our core learning objectives for the program. 

I didn’t get my first choice of capstone project from the algorithm matching. What can I do?

Remember that the point of the capstone projects isn’t the subject matter; it’s the data science. Professional data scientists may find themselves in positions in which they work on topics assigned to them, but they use methods they enjoy and still learn much through the process. That said, there are many ways to tackle a subject, and we are more than happy to work with you to find an approach to the work that most aligns with your interests.

Your ability to influence which project you work on is in the ranking process after “pitch day” and in encouraging your company or department to submit a proposal during the Call for Proposal process. At a minimum it takes several months to work with a sponsor to adequately scope a project, confirm access to the data and put the appropriate legal agreements into place. Before you ever see a project presented on pitch day, a lot of work has taken place to get it to that point!

Can I work on a project for my current employer?

Each spring, we put forward a public call for capstone projects. You are encouraged to share this call widely with your community, including your employer, non-profit organizations, or any entity that might have a big data problem that we can help solve. As a reminder, capstone projects are group projects so the project would require sufficient student interest after ‘pitch day’. In addition, you (the student) cannot serve as the project sponsor (someone else within your employer organization must serve in that capacity).

If my project doesn’t have a corporate sponsor, am I losing out on a career opportunity?

The capstone project will provide you with the opportunity to do relevant, high-quality work which can be included on a resume and discussed during job interviews. The project paper and your code on Github will provide more career opportunities than the sponsor of the project. Although it does happen from time to time, it is rare that capstones lead to a direct job offer with the capstone sponsor's company. Capstone projects are just one networking opportunity available to you in the program.

Capstone Project Reflections From Alumni  

Theo Braimoh, MSDS Online Graduate and Admissions Student Ambassador

"For my Capstone project, I used Python to train machine learning models for visual analysis – also known as computer vision. Computer vision helped my Capstone team analyze the ergonomic posture of workers at risk of developing musculoskeletal injuries. We automated the process, and hope our work further protects the health and safety of people working in the United States.” — Theophilus Braimoh, MSDS Online Program 2023, Admissions Student Ambassador

Haley Egan, MSDS Online 2023 and Admissions Student Ambassador

“My Capstone experience with the ALMA Observatory and NRAO was a pivotal chapter in my UVA Master’s in Data Science journey. It fostered profound growth in my data science expertise and instilled a confidence that I'm ready to make meaningful contributions in the professional realm.” — Haley Egan, MSDS Online Program 2023, Admissions Student Ambassador

Mina Kim, MSDS/PhD 2023

“Our Capstone projects gave us the opportunity to gain new domain knowledge and answer big data questions beyond the classroom setting.” — Mina Kim, MSDS Residential Program 2023, Ph.D. in Psychology Candidate

Capstone Project Reflections From Sponsors  

“For us, the level of expertise, and special expertise, of the capstone students gives us ‘extra legs’ and an extra push to move a project forward. The team was asked to provide a replicable prototype air quality sensor that connected to the Cville Things Network, a free and community supported IoT network in Charlottesville. Their final product was a fantastic example that included clear circuit diagrams for replication by citizen scientists.” — Lucas Ames, Founder, Smart Cville
“Working with students on an exploratory project allowed us to focus on the data part of the problem rather than the business part, while testing with little risk. If our hypothesis falls flat, we gain valuable information; if it is validated or exceeded, we gain valuable information and are a few steps closer to a new product offering than when we started.” — Ellen Loeshelle, Senior Director of Product Management, Clarabridge

electrolarynx capstone project

Student Capstone Project Looks To Improve Electrolarynx Speech-to-Text

Women discussing by laptop

MSDS Capstone Projects Give Students Exposure to Industry While in Academia

student presentations

Master's Students' Capstone Presentations

Get the latest news.

Subscribe to receive updates from the School of Data Science.

  • Prospective Student
  • School of Data Science Alumnus
  • UVA Affiliate
  • Industry Member

University of Adelaide home page

Big Data Capstone Project

Further develop your knowledge of big data by applying the skills you have learned to a real-world data science project.

The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project.

Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge.

This project will give you the opportunity to deepen your learning by giving you valuable experience in evaluating, selecting and applying relevant data science techniques, principles and theory to a data science problem.

This project will see you plan and execute a reasonably substantial project and demonstrate autonomy, initiative and accountability.

You’ll deepen your learning of social and ethical concerns in relation to data science, including an analysis of ethical concerns and ethical frameworks in relation to data selection and data management.

By communicating the knowledge, skills and ideas you have gained to other learners through online collaborative technologies, you will learn valuable communication skills, important for any career. You’ll also deliver a written presentation of your project design, plan, methodologies, and outcomes.

What you'll learn

The Big Data Capstone project will give you the chance to demonstrate practically what you have learned in the Big Data MicroMasters program including:

  • How to evaluate, select and apply data science techniques, principles and theory;
  • How to plan and execute a project;
  • Work autonomously using your own initiative;
  • Identify social and ethical concerns around your project;
  • Develop communication skills using online collaborative technologies.

University pathways

Related degrees from the University of Adelaide

Data science

Master of Data Science

This program provides the necessary skills for entering the world of big data and data science.

Find out more

comp sci

Bachelor of Computer Science

Want to study at the cutting edge of the growing technological age and build a career that can change the world?

capstone project on big data

Environmental Monitoring, remote sensing, cyber-physical systems, Engineers for Exploration

E4e microfaune project.

  • Group members: Jinsong Yang, Qiaochen Sun

Abstract: Nowadays, human activities such as wildfires and hunting have become the largest factor that would have serious negative effects on biodiversity. In order to deeply understand how anthropogenic activities deeply affect wildlife populations, field biologists utilize automated image classification driven by neural networks to get relevant biodiversity information from the images. However, for some small animals such as insects or birds, the camera could not work very well because of the small size of these animals. It is extremely hard for cameras to capture the movement and activities of small animals. To effectively solve this problem, passive acoustic monitoring (PAM) has become one of the most popular methods. We could utilize sounds we collect from PAM to train certain machine learning models which could tell us the fluctuation of biodiversity of all these small animals. The goal of the whole program is to test the biodiversity of these small animals (most of them are birds). However, the whole program could be divided into plenty of small parts. I and Jinsong will pay attention to the intermediate step of the program. The goal of our project is to generate subsets of audio recordings that have higher probability of vocalization of interest, which could help our labeling volunteer to save time and energy. The solutions could help us reduce down the amount of time and resources required to achieve enough training data for species-level classifiers. We perform the same thing with AID_NeurIPS_2021. Only the data is different between these two github. For this github, we use the peru data instead of Coastal_Reserve data.

  • Group members: Harsha Jagarlamudi, Kelly Kong

Eco-Acoustic Event Detection: Classifying temporal presence of birds in recorded bird vocalization audio

  • Group members: Alan Arce, Edmundo Zamora

Abstract: Leveraging "Deep Learning" methods to classify temporal presence birds in recorded bird vocalization audio. Using a hybrid CNN-RNN model, trained on audio data, in the interest of benefitting wildlife monitoring and preservation.

Pyrenote - User Profile Design & Accessible Data

  • Group members: Dylan Nelson

Abstract: Pyrenote is a project in development by a growing group of student researchers here at UCSD. It's primary purpose is to allow anyone to contribute to research by labeling data in an intuitive and accessible way. Right now it is currently being used to develop a sort of voice recognition for birds. The goal is to make an algorithm that can strongly label data (say where in the clip a bird is calling and what bird is making the call). To do this, a very vast dataset is needed to be labeled. I worked mostly on the user experience side. Allowing them to interact with their labeling in new ways, such as keeping tabs on their progress and reaching goals. Developing a User Profile page was the primary source for receiving this data and was developed iteratively as a whole new page for the site

Pyrenote Webdeveloper

  • Group members: Wesley Zhen

Abstract: The website, Pyrenote, is helping scientists track bird populations by identifying them using machine learning classifiers on publicly annotated audio recordings. I have implemented three features over the course of two academic quarters aimed at streamlining user experience and improving scalability. The added scalability will be useful for future projects as we start becoming more ambitious with the number of users we bring to the site.

Spread of Misinformation Online

Who is spreading misinformation and worries in twitter.

  • Group members: Lehan Li, Ruojia Tao

Abstract: Spread of misinformation over social media posts challenges to daily information intake and exchange. Especially under current covid 19 pandemic, the disperse of misinformation regarding to covid 19 diseases and vaccination posts threats to individuals' wellbeing's and general publish health. The people's worries also increase with misinformation such as the shortage of food and water. This spread of misinformation also provide This project seeks to investigate the spread of misinformation over social media (Twitter) under covid 19 pandemic. wo main directions are investigated in the project. The first direction is the analysis of the effect of bot users on the spread of misinformation: We want to explore what is the role that robot user plays in spreading the misinformation. Where are the bot users located in the social network. The second direction is the sentiment analysis that examines users' attitudes towards misinformation: We want to see the spread of sentiment with different places in social networks. We also mixed the two directions: What is the relationship between bot-users with positive and negative emptions? Since online social medias users form social networks, the project also seeks to investigate the effect of social network on the above two topics. Moreover, the project is also interested in exploring the change in proportion of bot users and users' attitude towards misinformation as the social network becomes more concentrated and tightly connected.

Misinformation on Reddit

  • Group members: Samuel Huang, David Aminifard

Abstract: As social media has grown in popularity, namely Reddit, its use for rapidly sharing information based on categories or topics (subreddits) has had massive implications for how people are usually exposed to information and the quality of the information they interact with. While Reddit has its benefits, e.g. providing instant access to - nearly - real time, categorized information, it has possibly played a role in worsening divisions and the spread of misinformation. Our results showed that subreddits with the highest proportions of misinformation posts tend to lean more towards politics and news. In addition, we found that despite the frequency of misinformation per subreddit, the average upvote ratio per submission seemed consistently high, which indicated that subreddits tend to be ideologically homogeneous.

The Spread of YouTube Misinformation Through Twitter

  • Group members: Alisha Sehgal, Anamika Gupta

Abstract: In our Capstone Project, we explore the spread of misinformation online. More specifically, we look at the spread of misinformation across Twitter and YouTube because of the large role these two social media platforms play in the dissemination of news and information. Our main objectives are to understand how YouTube videos contribute to spreading misinformation on Twitter, evaluate how effectively YouTube is removing misinformation and if these policies also prevent users from engaging with misinformation. We take a novel approach of analyzing tweets, YouTube video captions, and other metadata using NLP to determine the presence of misinformation and investigate how individuals interact or spread misinformation. Our research focuses on the domain of public health as this is the subject of many conspiracies, varying opinions, and fake news.

Particle Physics

Understanding higgs boson particle jets with graph neural networks.

  • Group members: Charul Sharma, Rui Lu, Bryan Ambriz

Abstract: Extending the content of last quarter of deep sets neural network, fully connected neural network classifier, adversarial deep set model and designed decorrelated tagger (DDT), we went a little bit further this quarter about picking up different layers in neural network like GENConv and EdgeConv. GENConv and EdgeConv play incredibly important roles here for boosting the performances of our basic GNN model. We also evaluated the performance of our model using ROC (Receiver-Operating Curve) curves describing AUC (Area Under the Curve). Meanwhile, based on previous experiences of project one and past project of particle physics domain, we decided to add one more section, exploratory data analysis in our project for conducting some basic theory, bootstrapping or common sense of our dataset. But we have not produced all the optimal outcomes so far even though we finished the EdgeConv part and for the following weeks, we would like to finish the GENConv and may try some other layers to find out the potential to increase the performance of our model.

Predicting a Particle's True Mass

  • Group members: Jayden Lee, Dan Ngo, Isac Lee

Abstract: The Large Hadron Collider (LHC) collides protons traveling near light speed to generate high-energy collisions. These collisions produce new particles and have led to the discovery of new elementary particles (e.g., Higgs Boson). One key information to collect from this collision event is the structure of the particle jet, which refers to a group of collective spray of decaying particles that travel in the same direction, as accurately identifying the type of these jets - QCD or signal - play a crucial role in discovery of high-energy elementary particles like Higgs particle. There are several properties that determine jet type with jet mass being one of the strongest indicators in jet type classification. A previous study jet mass estimation, called “soft drop declustering,” has been one of the most effective methods in making rough estimations on the jet mass. With this in mind, we aim to implement machine learning in jet mass estimation through various neural network architectures. With data collected and processed by CERN, we implemented a model capable of improving jet mass prediction through jet features.

Mathematical Signal Processing (compression of deep nets, or optimization for data-science/ML)

Graph neural networks, graph neural network based recommender systems for spotify playlists.

  • Group members: Benjamin Becze, Jiayun Wang, Shone Patil

Abstract: With the rise of music streaming services on the internet in the 2010’s, many have moved away from radio stations to streaming services like Spotify and Apple Music. This shift offers more specificity and personalization to users’ listening experiences, especially with the ability to create playlists of whatever songs that they wish. Oftentimes user playlists have a similar genre or theme between each song, and some streaming services like Spotify offer recommendations to expand a user’s existing playlist based on the songs in it. Using Node2vec and GraphSAGE graph neural network methods, we set out to create a recommender system for songs to add to an existing playlist by drawing information from a vast graph of songs we built from playlist co-occurrences. The result is a personalized song recommender based not only on Spotify’s community of playlist creators, but also the specific features within a song.

Dynamic Stock Industry Classification

  • Group members: Sheng Yang

Abstract: Use Graph-based Analysis to Re-classify Stocks in China A-share and Improve Markowitz Portfolio Optimization

NLP, Misinformation

Hdsi faculty exploration tool.

  • Group members: Martha Yanez, Sijie Liu, Siddhi Patel, Brian Qian

Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for assistance to tackle their daily activities and need experts in different domain areas. Currently, there are around 55 professors affiliated to HDSI. They all have diverse research interests and have written numerous papers in their own fields. Our goal was to create a tool that allows HDSI to select the best fit from their faculty, based on their published work, to aid their industry partners in their specific endeavors. We did this with Natural Language Processing (NLP) by managing all the abstracts from the faculty’s published work and organizing them by topics. We will then obtained the proportion of papers of each faculty associated with each of the topics and drew a relationship between researchers and their most published topics. This will allow HDSI to personalize recommendations of faculty candidates to their industry partner’s particular job.

  • Group members: Du Xiang

AI in Healthcare, Deep Reinforcement Learning, Trustworthy Machine Learning

Improving robustness in deep fusion modeling against adversarial attacks.

  • Group members: Ayush More, Amy Nguyen

Abstract: Autonomous vehicles rely heavily on deep fusion modeling, which utilize multiple inputs for its inferences and decision making. By using the data from these inputs, the deep fusion model benefits from shared information, which is primarily associated with robustness as these input sources can face different levels of corruption. Thus, it is highly important that the deep fusion models used in autonomous vehicles are robust to corruption, especially to input sources that are weighted more heavily in different conditions. We explore a different approach in training the robustness for a deep fusion model through adversarial training. We fine-tune the model on adversarial examples and evaluate its robustness against single source noise and other forms of corruption. Our experimental results show that adversarial training was effective in improving the robustness of a deep fusion model object detector against adversarial noise and Gaussian noise while maintaining performance on clean data. The results also highlighted the lack of robustness of models that are not trained to handle adversarial examples. We believe that this is relevant given the risks that autonomous vehicles pose to pedestrians - it is important that we ensure the inferences and decisions made by the model are robust against corruption, especially if it is intentional from outside threats.

Healthcare: Adversarial Defense In Medical Deep Learning Systems

  • Group members: Rakesh Senthilvelan, Madeline Tjoa

Abstract: In order to combat against such adversarial instances, there needs to be robust training done with these models in order to best protect against the methods that these attacks use on deep learning systems. In the scope of this paper, we will be looking into the methods of fast gradient signed method and projected gradient descent, two methods used in adversarial attacks to maximize loss functions and cause the affected system to make opposing predictions, in order to train our models against them and allow for stronger accuracy when faced with adversarial examples.

Satellite image analysis

Ml for finance, ml for healthcare, fair ml, ml for science, actionable recourse.

  • Group members: Shweta Kumar, Trevor Tuttle, Takashi Yabuta, Mizuki Kadowaki, Jeffrey Feng

Abstract: In American society today there is a constant encouraged reliance on credit, despite it not being available to everyone as a legal right. Currently, there are countless evaluation methods of an individual's creditworthiness in practice. In an effort to regulate the selection criteria of different financial institutions, the Equal Credit Opportunity Act (ECOA) requires that applicants denied a loan are entitled to an Adverse Action notice, a statement from the creditor explaining the reason for the denial. However, these adverse action notices are frequently unactionable and ineffective in providing feedback to give an individual recourse, which is the ability to act up on a reason for denial to raise one’s odds of getting accepted for a loan. In our project, we will be exploring whether it is possible to create an interactive interface to personalize adverse action notices in alignment with personal preferences for individuals to gain recourse.

Social media; online communities; text analysis; ethics

Finding commonalities in misinformative articles across topics.

  • Group members: Hwang Yu, Maximilian Halvax, Lucas Nguyen

Abstract: In order to combat the large scale distribution of misinformation online, We wanted to develop a way to flag news articles that are misinformative and could potentially mislead the general public. In addition to flagging news articles, we also wanted to find commonalities between the misinformation that we found. Were some topics in specific containing more misleading information than others? How much overlap do these articles have when we break their content down into TF IDF and see what words carry the most importance when put into various models detecting misinformation. We wanted to narrow down our models to be trained on four different topics: economics, politics, science, and general which is a dataset encompassing the three previous topics. We Found that general included the most overlap overall, while the topics themselves, while mostly different than the other specific topics, had certain models that still put emphasis on similar words, indicating a possible pattern of misinformative language in these articles. We believe, from these results, that we can find a pattern that could direct further investigation into how misinformation is written and distributed online.

The Effect of Twitter Cancel Culture on the Music Industry

  • Group members: Peter Wu, Nikitha Gopal, Abigail Velasquez

Abstract: Musicians often trend on social media for various reasons but in recent years, there has been a rise in musicians being “canceled” for committing offensive or socially unacceptable behavior. Due to the wide accessibility of social media, the masses are able to hold accountable musicians for their actions through “cancel culture”, a form of modern ostracism. Twitter has become a well-known platform for “cancel culture” as users can easily spread hashtags and see what’s trending, which also has the potential to facilitate the spread of toxicity. We analyze how public sentiment towards canceled musicians on Twitter changes in respect to the type of issue they were canceled for, their background, and the strength of their parasocial relationship with their fans. Through our research, we aim to determine whether “cancel culture” leads to an increase in toxicity and negative sentiment towards a canceled individual.

Analyzing single cell multimodality data via (coupled) autoencoder neural networks

Coupled autoencoders for single-cell data analysis.

  • Group members: Alex Nguyen, Brian Vi

Abstract: Historically, analysis on single-cell data has been difficult to perform, due to data collection methods often resulting in the destruction of the cell in the process of collecting information. However, an ongoing endeavor of biological data science has recently been to analyze different modalities, or forms, of the genetic information within a cell. Doing so will allow modern medicine a greater understanding of cellular functions and how cells work in the context of illnesses. The information collected on the three modalities of DNA, RNA, and protein can be done safely and because it is known that they are same information in different forms, analysis done on them can be extrapolated understand the cell as a whole. Previous research has been conducted by Gala, R., Budzillo, A., Baftizadeh, F. et al. to capture gene expression in neuron cells with a neural network called a coupled autoencoder. This autoencoder framework is able to reconstruct the inputs, allowing the prediction of one input to another, as well as align the multiple inputs in the same low dimensional representation. In our paper, we build upon this coupled autoencoder on a data set of cells taken from several sites of the human body, predicting from RNA information to protein. We find that the autoencoder is able to adequately cluster the cell types in its lower dimensional representation, as well as perform decently at the prediction task. We show that the autoencoder is a powerful tool for analyzing single-cell data analysis and may prove to be a valuable asset in single-cell data analysis.

Machine Learning, Natural Language Processing

On evaluating the robustness of language models with tuning.

  • Group members: Lechuan Wang, Colin Wang, Yutong Luo

Abstract: Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Activity Based Travel Models and Feature Selection

A tree-based model for activity based travel models and feature selection.

  • Group members: Lisa Kuwahara, Ruiqin Li, Sophia Lau

Abstract: In a previous study, Deloitte Consulting LLP developed a method of creating city simulations through cellular location and geospatial data. Using these simulations of human activity and traffic patterns, better decisions can be made regarding modes of transportation or road construction. However, the current commonly used method of estimating transportation mode choice is a utility model that involves many features and coefficients that may not necessarily be important but still make the model more complex. Instead, we used a tree-based approach - in particular, XGBoost - to identify just the features that are important for determining mode choice so that we can create a model that is simpler, robust, and easily deployable, in addition to performing better than the original utility model on both the full dataset and population subsets.

Explainable AI, Causal Inference

Explainable ai.

  • Group members: Jerry Chan, Apoorv Pochiraju, Zhendong Wang, Yujie Zhang

Abstract: Nowadays, the algorithmic decision-making system has been very common in people’s daily lives. Gradually, some algorithms become too complex for humans to interpret, such as some black-box machine learning models and deep neural networks. In order to assess the fairness of the models and make them better tools for different parties, we need explainable AI (XAI) to uncover the reasoning behind the predictions made by those black-box models. In our project, we will be focusing on using different techniques from causal inferences and explainable AI to interpret various classification models across various domains. In particular, we are interested in three domains - healthcare, finance, and the housing market. Within each domain, we are going to train four binary classification models first, and we have four goals in general: 1) Explaining black-box models both globally and locally with various XAI methods. 2) Assessing the fairness of each learning algorithm with regard to different sensitive attributes; 3) Generating recourse for individuals - a set of minimal actions to change the prediction of those black-box models. 4) Evaluating the explanations from those XAI methods using domain knowledge.

AutoML Platforms

Deep learning transformer models for feature type inference.

  • Group members: Andrew Shen, Tanveer Mittal

Abstract: The first step AutoML software must take after loading in the data is to identify the feature types of individual columns in input data. This information then allows the software to understand the data and then preprocess it to allow machine learning algorithms to run on it. Project Sortinghat of the ADA lab at UCSD frames this task of Feature Type Inference as a machine learning multiclass classification problem. Machine learning models defined in the original SortingHat feature type inference paper use 3 sets of features as input. 1. The name of the given column 2. 5 not null sample values 3. Descriptive numeric features about the column The textual features are easy to access, however the descriptive statistics previous models rely on require a full pass through the data which make preprocessing less scalable. Our goal is to produce models that may rely less on these statistics by better leveraging the textual features. As an extension of Project SortingHat, we experimented with deep learning transformer models and varying the sample sizes used by random forest models. We found that our transformer models achieved state of the art results on this task which outperform all existing tools and ML models that have been benchmarked against SortingHat's ML Data Prep Zoo. Our best model used a pretrained Bidirectional Encoder Representations Transformer(BERT) language model to produce word embeddings which are then processed by a Convolutional Neural Network(CNN) model. As a result of this project, we have published 2 BERT CNN models using the PyTorch Hub api. This is to allow software engineers to easily integrate our models or train similar ones for use in AutoML platforms or other automated data preparation applications. Our best model uses all the features defined above, while the other only uses column names and sample values while offering comparable performance and much better scalability for all input data.

Exploring Noise in Data: Applications to ML Models

  • Group members: Cheolmin Hwang, Amelia Kawasaki, Robert Dunn

Abstract: In machine learning, models are commonly built in such a way to avoid what is known as overfitting. As it is generally understood, overfitting is when a model is fit exactly to the training data causing the model to have poor performance on new examples. This means that overfit models tend to have poor accuracy on unseen data because the model is fit exactly to the training data. Therefore, in order to generalize to all examples of data and not only the examples found in a given training set, models are built with certain techniques to avoid fitting the data exactly. However, it can be found that overfitting does not always work in this way that one might expect as will be shown by fitting models with a given level of noisiness. Specifically, it is seen that some models fit exactly to data with high levels of noise still produce results with high accuracy whereas others are more prone to overfitting.

Group Testing for Optimizing COVID-19 Testing

Covid-19 group testing optimization strategies.

  • Group members: Mengfan Chen, Jeffrey Chu, Vincent Lee, Ethan Dinh-Luong

Abstract: The COVID-19 pandemic that has persisted for more than two years has been combated by efficient testing strategies that reliably identifies positive individuals to slow the spread of the pandemic. Opposed to other pooling strategies within the domain, the methods described in this paper prioritize true negative samples over overall accuracy. In the Monte Carlo simulations, both nonadaptive and adaptive testing strategies with random pool sampling resulted in high accuracy approaching at least 95% with varying pooling sizes and population sizes to decrease the number of tests given. A split tensor rank 2 method attempts to identify all infected samples within 961 samples, converging the number of tests to 99 as the prevalence of infection converges to 1%.

Causal Discovery

Patterns of fairness in machine learning.

  • Group members: Daniel Tong, Anne Xu, Praveen Nair

Abstract: Machine learning tools are increasingly used for decision-making in contexts that have crucial ramifications. However, a growing body of research has established that machine learning models are not immune to bias, especially on protected characteristics. This had led to efforts to create mathematical definitions of fairness that could be used to estimate whether, given a prediction task and a certain protected attribute, an algorithm is being fair to members of all classes. But just like how philosophical definitions of fairness can vary widely, mathematical definitions of fairness vary as well, and fairness conditions can in fact be mutually exclusive. In addition, the choice of model to use to optimize fairness is also a difficult decision we have little intuition for. Consequently, our capstone project centers around an empirical analysis for studying the relationships between machine learning models, datasets, and various fairness metrics. We produce a 3-dimensional matrix of the performance of a certain machine learning model, for a certain definition of fairness, for a certain given dataset. Using this matrix on a sample of 8 datasets, 7 classification models, and 9 fairness metrics, we discover empirical relationships between model type and performance on specific metrics, in addition to correlations between metric values across different dataset-model pairs. We also offer a website and command-line interface for users to perform this experimentation on their own datasets.

Causal Effects of Socioeconomic and Political Factors on Life Expectancy in 166 Different Countries

  • Group members: Adam Kreitzman, Maxwell Levitt, Emily Ramond

Abstract: This project examines causal relationships between various socioeconomic variables and life expectancy outcomes in 166 different countries, with the ability to account for new, unseen data and variables with an intuitive data pipeline process with detailed instructions and the PC algorithm with updated code to account for missingness in data. With access to this model and pipeline, we hope that questions such as “do authoritarian countries have a direct relation to life expectancy?” or “how does women in government affect perceived notion of social support?” will now be able to be answered and understood. Through our own analysis, we were able to find intriguing results, such as a higher Perception of Corruption is distinctly related to a lower Life Ladder score. We also found that higher quality of life perceptions is related to lower economic inequality. These results aim to educate not only the general public, but government officials as well.

Time series analysis in health

Time series analysis on the effect of light exposure on sleep quality.

  • Group members: Shubham Kaushal, Yuxiang Hu, Alex Liu

Abstract: The increase of artificial light exposure through the increased prevalence of technology has an affect on the sleep cycle and circadian rhythm of humans. The goal of this project is to determine how different colors and intensities of light exposure prior to sleep affects the quality of sleep through the classification of time series data.

Sleep Stage Classification for Patients With Sleep Apnea

  • Group members: Kevin Chin, Yilan Guo, Shaheen Daneshvar

Abstract: Sleeping is not uniform and consists of four stages: N1, N2, N3, and REM sleep. The analysis of sleep stages is essential for understanding and diagnosing sleep-related diseases, such as insomnia, narcolepsy, and sleep apnea; however, sleep stage classification often does not generalize to patients with sleep apnea. The goal of our project is to build a sleep stage classifier specifically for people with sleep apnea and understand how it differs from the normal sleep stage. We will then explore whether or not the inclusion and featurization of ECG data will improve the performance of our model.

Environmental health exposures & pollution modeling & land-use change dynamics

Supervised classification approach to wildfire mapping in northern california.

  • Group members: Alice Lu, Oscar Jimenez, Anthony Chi, Jaskaranpal Singh

Abstract: Burn severity maps are an important tool for understanding fire damage and managing forest recovery. We have identified several issues with current mapping methods used by federal agencies that affect the completeness, consistency, and efficiency of their burn severity maps. In order to address these issues, we demonstrate the use of machine learning as an alternative to traditional methods of producing severity maps, which rely on in-situ data and spectral indices derived from image algebra. We have trained several supervised classifiers on sample data collected from 17 wildfires across Northern California and evaluate their performance at mapping fire severity.

Network Performance Classification

Network signal anomaly detection.

  • Group members: Laura Diao, Benjamin Sam, Jenna Yang

Abstract: Network degradation occurs in many forms, and our project will focus on two common factors: packet loss and latency. Packet loss occurs when one or more data packets transmitted across a computer network fail to reach their destination. Latency can be defined as a measure of delay for data to transmit across a network. For internet users, high rates of packet loss and significant latency can manifest in jitter or lag, which are indicators of overall poor network performance as perceived by the end user. Thus, when issues arise in these two factors, it would be beneficial for internet service providers to know exactly when the user is experiencing problems in real time. In real world scenarios, situations or environments such as poor port quality, overloaded ports, network congestion and more can impact overall network performance. In order to detect some of these issues in network transmission data, we built an anomaly detection system that predicts the estimated packet loss and latency of a connection and detects whether there is a significant degradation of network quality for the duration of the connection.

Real Time Anomaly Detection in Networks

  • Group members: Justin Harsono, Charlie Tran, Tatum Maston

Abstract: Internet companies are expected to deliver the speed their customer has paid for. However, for various reasons such as congestion or connectivity issues, it is inevitable for one to perceive degradations in network quality. To still ensure the customer is satisfied, certain monitoring systems must be built to inspect the quality of the connection. Our goal is to build a model that would be able to detect, in real time, these regions of networks degradations, so that an appropriate recovery can be enacted to offset these degradations. Our solution is a combination of two anomaly detection methods that successfully detects shifts in the data, based on a rolling window of data it has seen.

System Usage Reporting

Intel telemetry: data collection & time-series prediction of app usage.

  • Group members: Srikar Prayaga, Andrew Chin, Arjun Sawhney

Abstract: Despite advancements in hardware technology, PC users continue to face frustrating app launch times, especially on lower end Windows machines. The desktop experience differs vastly from the instantaneous app launches and optimized experience we have come to expect even from low end smartphones. We propose a solution to preemptively run Windows apps in the background based on the app usage patterns of the user. Our solution is two-step. First, we built telemetry collector modules in C/C++ to collect real-world app usage data from two of our personal Windows 10 devices. Next, we developed neural network models, trained on the collected data, to predict app usage times and corresponding launch sequences in python. We achieved impressive results on selected evaluation metrics across different user profiles.

Predicting Application Use to Reduce User Wait Time

  • Group members: Sasami Scott, Timothy Tran, Andy Do

Abstract: Our goal for this project was to lower the user wait time when loading programs by predicting the next used application. In order to obtain the needed data, we created data collection libraries. Using this data, we created a Hidden Markov Model (HMM) and a Long Short-Term Memory (LSTM) model, but the latter proved to be better. Using LSTM, we can predict the application use time and expand this concept to more applications. We created multiple LSTM models with varying results, but ultimately chose a model that we think had potential. We decided on using the model that reported a 90% accuracy.

INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection

  • Group members: Jared Thach, Hiroki Hoshida, Cyril Gorlla

Abstract: As the power of modern computing devices increases, so too do user expectations for them. Despite advancements in technology, computer users are often faced with the dreaded spinning icon waiting for an application to load. Building upon our previous work developing data collectors with the Intel System Usage Reporting (SUR) SDK, we introduce INTELlinext, a comprehensive solution for next-app prediction for application preload to improve perceived system fluidity. We develop a Hidden Markov Model (HMM) for prediction of the k most likely next apps, achieving an accuracy of 64% when k = 3. We then implement a long short-term memory (LSTM) model to predict the total duration that applications will be used. After hyperparameter optimization leading to an optimal lookback value of 5 previous applications, we are able to predict the usage time of a given application with a mean absolute error of ~45 seconds. Our work constitutes a promising comprehensive application preload solution with data collection based on the Intel SUR SDK and prediction with machine learning.

cds official logo

NYU Center for Data Science

Harnessing Data’s Potential for the World

Master’s in Data Science

  • Industry Concentration
  • Admission Requirements
  • Capstone Project
  • Summer Research Initiative
  • Financial Aid
  • MS Admissions Ambassadors
  • Summer Initiative

CDS master’s students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving and collaboration skills.

Students are matched with research labs within the NYU community and with industry partners to investigate pressing issues, applying data science to the following areas:

  • Probability and statistical analyses
  • Natural language processing
  • Big Data analysis and modeling
  • Machine learning and computational statistics
  • Coding and software engineering
  • Visualization modeling
  • Neural networks
  • Signal processing
  • High dimensional statistics

Capstone projects present students with the opportunity to work in their field of interest and gain exposure to applicable solutions. Project sponsors, NYU labs, and external partners, in turn receive the benefit of having a new perspective applied to their projects.

“Capstone is a unique opportunity for students to solve real world problems through projects carried out in collaboration with industry partners or research labs within the NYU community,” says capstone advisor and CDS Research Fellow Anastasios Noulas. “It is a vital experience for students ahead of their graduation and prior to entering the market, as it helps them improve their skills, especially in problem solving contexts that are atypical compared to standard courses offered in the curriculum. Cooperation within teams is another crucial skill built through the Capstone experience as projects are typically run across groups of 2 to 4 people.”

The Capstone Project offers the opportunity for organizations to propose a project that our graduate students will work on as part of their curriculum for one semester. Information on the course along with a questionnaire to propose a project, can be found on the Capstone Fall 2024 Project Submission Form . If you have any questions, please reach out to [email protected] .

Best Fall 2023 Capstone Posters

capstone project on big data

Multimodal NLP for M&A Agreements

Student Authors: Harsh Asrani, Chaitali Joshi, Tayyibah Khanam, Ansh Riyal | Project Mentors: Vlad Kobzar, Kyunghyun Cho

capstone project on big data

  • Partisan Bias and the US Federal Court System

Student Authors: Annabelle Huether, Mary Nwangwu, Allison Redfern | Project Mentors: Aaron Kaufman, Jon Rogowski

Best Fall 2023 Student Voted Posters

capstone project on big data

User-Centric AI Models for Assisting the Blind

Student Authors: Gail Batutis, Aradhita Bhandari, Aryan Jain, Mallory Sico | Project Mentors: Giles Hamilton-Fletcher, Chen Feng, Kevin C. Chan

capstone project on big data

  • Multi-Modal Foundation Models for Medicine

Student Authors: Yunming Chen, Harry Huang, Jordan Tian, Ning Yang | Project Mentors: Narges Razavian

Best Fall 2023 Student Voted Runner-Up Posters

capstone project on big data

  • Representational geometry of learning rules in neural networks

Student Authors: Ghana Bandi, Shiyu Ling, Shreemayi Sonti, Zoe Xiao | Project Mentors: SueYeon Chung, Chi-Ning Chou

capstone project on big data

  • Medical Data Leakage with Multi-site Collaborative Training

Student Authors: Christine Gao, Ciel Wang, Yuqi Zhang | Project Mentors: Qi Lei

Fall 2023 Capstone Project List

  • Segmentation of Metastatic Brain Tumors Using Deep Learning
  • Discovering misinformation narratives from suspended tweets using embedding-based clustering algorithms
  • Network Intrusion Detection Systems using Machine Learning
  • Knowledge Extraction from Pathology Reports Using LLMs
  • Building an Interactive Browser for Epigenomic & Functional Maps from the Viewpoint of Disease Association
  • Prediction of Acute Pancreatitis Severity Using CT Imaging and Deep Learning
  • User-centric AI models for assisting the blind
  • A machine learning model to predict future kidney function in patients undergoing treatment for kidney masses
  • Fine-Tuning of MedSAM for the Automated Segmentation of Musculoskeletal MRI for Bone Topology Evaluation and Radiomic Analysis
  • Online News Content Neural Network Recommendation Engine
  • Explanatory Modeling for Website Traffic Movements
  • Egocentric video zero-shot object detection
  • Leverage OncoKB’s Curated Literature Database to Build an NLP Biomarker Identifier
  • Improving Out-of-Distribution Generalization in Neural Models for Astrophics and Cosmology?
  • Preparing a Flood Risk Index for the State of Assam, India
  • Causal GANs
  • Bringing Structure to Emergent Taxonomies from Open-Ended CMS Tags
  • Social Network Analysis of Hospital Communication Networks
  • Multimodal Question Answering
  • Does resolution matter for transfer learning with satelitte imagery?
  • Measuring Optimizer-Agnostic Hyperparameter Tuning Difficulty
  • Extracting causal political narratives from text.
  • Designing Principled Training Methods for Deep Neural Networks
  • Multimodal NLP for M&A Agreements
  • Using Deep Learning to Solve Forward-Backward Stochastic Differential Equations
  • OptiComm: Maximizing Medical Communication Success with Advanced Analytics
  • Automated assessment of epilepsy subtypes using patient-generated language data
  • Predicting cancer drug response of patients from their alteration and clinical data
  • Identify & Summarize top key events for a given company from News Data using ML and NLP Models
  • Developing predictive shooting accuracy metric(s) for First-Person-Shooter esports
  • Supporting Student Success through Pipeline Curricular Analysis
  • Transformers for Electronic Health Records
  • Build Models for Multilingual Medical Coding
  • Metadata Extraction from Spoken Interactions Between Mothers and Young Children
  • Uncertainty Radius Selection in Distributionally Robust Portfolio Optimization
  • Unveiling Insights into Employee Benefit Plans and Insurance Dynamics
  • Advanced Name Screening and Entity Linking Using large language models
  • What Keeps the Public Safe While Avoiding Excessive Use of Incarceration? Supporting Data-Centered Decisionmaking in a DA’s Office
  • Foundation Models for Brain Imaging
  • Housing Price Forecasting – Alternative Approaches
  • Evaluating the Capability of Large Language Models to Measure Psychiatric Functioning
  • Predicting year-end success using deep neural network (DNN) architecture

Best Fall 2022 Capstone Posters

Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity poster

  • Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity

Student Authors: Lorena Piedras, Priya Dhond, and Alejandro Sáez | Mentors: Iyke Derek Maduako (UNICEF)

Neural Re-Ranking for Personalized Home Search poster

  • Neural Re-Ranking for Personalized Home Search

Student Authors: Giacomo Bugli, Luigi Noto, Guilherme Albertini | Mentors: Shourabh Rawat, Niranjan Krishna, and Andreas Rubin-Schwarz

Sequence Modeling for Query Understanding & Conversational Search poster

Sequence Modeling for Query Understanding & Conversational Search

Student Authors: Lucas Tao, Evelyn Wang, Jun Wang, Cecilia Wu | Mentors: Amir Rahmani, Arun Balagopalan, Shourabh Rawat, and Najoung Kim

 Solving challenging video games in human-like ways poster

  • Solving challenging video games in human-like ways

Student Authors: Brian Pennisi, Jiawen Wu, Adeet Patel, and Sarvesh Patki | Mentors: Todd Gureckis (NYU)

Best Fall 2022 Student Voted Posters

Deep Learning Framework for Segmentation of Medical Images poster

  • Deep Learning Framework for Segmentation of Medical Images

Student Authors: Luoyao Chen, Mei Chen, Jinqian Pan | Mentors: Jacopo Cirrone (NYU)

Galaxy Dataset Distillation poster

  • Galaxy Dataset Distillation

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Julia Kempe (NYU)

Best Fall 2022 Runner-Up Posters

Dementia Detection from FLAIR MRI via Deep Learning poster

  • Dementia Detection from FLAIR MRI via Deep Learning

Student Authors: Jiawen Fan, Aiqing Li | Mentors: Narges Razavian (NYU Langone)

Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory poster

  • Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory

Student Authors: Dongdong Sun; Rui Chen; Ying Wang | Mentors: Mengye Ren (NYU)

Learning User Representations from Zillow Search Sessions using Transformer Architectures poster

  • Learning User Representations from Zillow Search Sessions using Transformer Architectures

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Shourabh Rawat (Zillow Group)

Methane Emission Quantification through Satellite Images poster

  • Methane Emission Quantification through Satellite Images

Student Authors: Alex Herron, Dhruv Saxena, Xiangyue Wang | Mentors: Robert Huppertz (orbio.earth)

Fall 2022 Capstone Project List

  • Data Science for Clinical Decision-making Support in Radiation Therapy
  • Using Voter File Data to Study Electoral Reform
  • Creating an Epigenomic Map of the Heart
  • Career Recommendation
  • Calibrating for Class Weights
  • Assigning Locations to Detected Stops using LSTM
  • Impact of YMCA Facilities on the Local Neighborhoods of Bronx
  • Powering SMS Product Recommendations with Deep Learning
  • Evaluation and Performance Comparison of Two Models in Classifying Cosmological Simulation Parameters
  • Crypto Anomaly Detection
  • Sequence Modeling for Query Understanding & Conversational Search
  • Multi-Modal Graph Inductive Learning with CLIP Embeddings
  • Multimodal Contract Segmentation
  • Extraction of Causal Narratives from News Articles
  • Detecting Erroneous Geospatial Data
  • Improving Speech Recognition Performance using Synthetic Data
  • Multi-document Summarization for News Events
  • Multi-task learning in orthogonal low dimensional parameter manifolds
  • Let’s Go Shopping: An Investigation Into a New Bimodal E-Commerce Dataset
  • Training AI to recognize objects of interest to the blind community
  • Classify Classroom Activities using Ambient Sound
  • Database and Dashboard for RII
  • Bitcoin Price Prediction Using Machine Learning Models
  • Context Driven Approach to Detecting Cross-Platform Coordinated Influence Campaigns
  • Invalid Traffic Detection Model Deployment
  • Recalled Experiences of Death: Using Transformers to Understand Experiences and Themes
  • Context-Based Content Extraction & Summarization from News Articles
  • Neural Learning to Rank for Personalized Home Search
  • Improve Speech Recognition Performance Using Unpaired Audio and Text
  • Data Normalization & Generalization to Population Metrics
  • Automated Judicial Case Briefing
  • Cyber Threat Detection for News Articles
  • MLS Fan Segmentation
  • Near Real-Time Estimation of Beef and Dairy Feedlot Greenhouse Gas Emissions
  • Do Better Batters Face Higher or Lower Quality Pitches?

Previous Capstone Projects

Best fall 2021 capstone posters.

capstone project on big data

  • Question Answering on Long Context

Student Authors: Xinli Gu, Di He, Congyun Jin | Project Mentor: Jocelyn Beauchesne (Hyperscience)

capstone project on big data

Multimodal Self-Supervised Deep Learning with Chest X-Rays and EHR Data

Student Authors: Adhham Zaatri, Emily Mui, Yechan Lew | Project Mentor: Sumit Chopra (NYU Langone)

capstone project on big data

Head and Neck CT Segmentation Using Deep Learning

Student Authors: Pengyun Ding, Tianyu Zhang | Project Mentor: Ye Yuan (NYU Langone)

capstone project on big data

  • 3D Astrophysical Simulation with Transformer

Student Authors: Elliot Dang, Tong Li, Zheyuan Hu | Project Mentor: Shirley Ho (Flatiron Institute)

capstone project on big data

Multimodal Representations for Document Understanding (Best Student Voted Poster)

Student Authors: Pavel Gladkevich, David Trakhtenberg, Ted Xie, Duey Xu | Project Mentor: Shourabh Rawat (Zillow Group)

2021 Capstone Project List

  • Accelerated Learning in the Context of Language Acquisition
  • Analysis of Cardiac Signals on Patients with Atrial Fibrillation
  • Applications of Neural Radiance Fields in Astronomy
  • Automatic Detection of Alzheimer’s Disease with Multi-Modal Fusion of Clinical MRI Scans
  • Automatic Transcription of Speech on SAYCam
  • Automatic Volumetric Segmentation of Brain Tumor Using Deep Learning for Radiation Oncology
  • Automatically Identify Applicants Who Require Physician’s Reports
  • Building a Question-Answer Generation Pipeline for The New York Times
  • Coupled Energy-Based Models and Normalizing Flows for Unsupervised Learning
  • Data Classification Processing for Clinical Decision-making Support in Radiation Therapy
  • Deep Active Learning for Protest Detection
  • Estimating Intracranial Pressure Using OCT Scans of the Eyeball
  • Graph Neural Networks for Electronic Health Record (EHR) Data
  • Head and Neck CT Image Segmentation
  • Head Movement Measurement During Structural MRI
  • Image Segmentation for Vestibular Schwannoma
  • Investigation into the Functionality of Key, Query, Value Sub-modules of a Transformer
  • Know Your Worth: An Analysis of Job Salaries
  • Machine learning-based computational phenotyping of electronic health records
  • Modeling the Speed Accuracy Tradeoff in Decision-Making
  • Multi-modal Breast Cancer Detection
  • Multi-Modal Deep Learning with Medical Images and EHR Data
  • Multimodal Representations for Document Understanding
  • Nematode Counting
  • News Clustering and Summarization
  • Post-surgical resection mapping in epilepsy using CNNs
  • Predicting Grandstanding in the Supreme Court through Speech
  • Predicting Probability of Post-Colectomy Hospital Readmission
  • Prediction of Total Knee Replacement Using Radiographs and Clinical Risk Factors
  • Reinforcement Learning for Option Hedging
  • Representation Learning Regarding RNA-RBP Binding
  • Self-Supervised Learning of Medical Image Representations Using Radiology Reports
  • The Study of American Public Policy with NLP
  • Topical Aggregation and Timeline Extraction on the NYT Corpus
  • Unsupervised Deep Denoiser for Electron-Microscope Data
  • Using Deep Learning and FBSDEs to Solve Option Pricing and Trading Problems
  • Vision Language Models for Real Estate Images and Descriptions

Featured 2020 Capstone Projects

Speak or Chat with Me Paper Chart

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

By Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Kuo, Samuel Thomas, Edmilson MoraisJain

Accented Speech Paper Chart

Accented Speech Recognition Inspired by Human Perception

By Xiangyun Chu, Elizabeth Combs, Amber Wang, Michael Picheny

Diarization of Legal Proceedings Paper Chart

Diarization of Legal Proceedings. Identifying and Transcribing Judicial Speech from Recorded Court Audio

By Jeffrey Tumminia, Amanda Kuznecov, Sophia Tsilerides, Ilana Weinstein, Brian McFee, Michael Picheny, Aaron R. Kaufman

2020 Capstone Project List

  • 2D to 3D Video Generation for Surgery (Best Capstone Poster)
  • Action Primitive Recognition with Sequence to Sequence Models towards Stroke Rehabilitation
  • Applying Self-learning Methods on Histopathology Whole Slide Images
  • Applying Transformers Models to Scanned Documents: An Application in Industry
  • Beyond Bert-based Financial Sentimental Classification: Label Noise and Company Information
  • Bias and Stability in Hiring Algorithms (Best Capstone Poster)
  • Breast Cancer Detection using Self-supervised Learning Method
  • Catastrophic Forgetting: An Extension of Current Approaches (Best Capstone Poster)
  • ClinicalLongformer: Public Available Transformers Language Models for Long Clinical Sequences
  • Complication Prediction of Bariatric Surgery
  • Constraining Search Space for Hardware Configurations
  • D4J: Data for Justice to Advance Transparency and Fairness
  • Data-driven Diesel Insights
  • Deep Learning to Study Pathophysiology in Dermatomyositis
  • Detection Of Drug-Target Interactions Using BioNLP
  • Determining RNA Alternative Splicing Patterns
  • Developing a Data Ecosystem for Refugee Integration Insights
  • Diarizing Legal Proceedings
  • Estimating the Impact of the Home Health Value-Based Purchasing Model
  • Extracting economic sentiment from mainstream media articles
  • Food Trend Detection in Chinese Financial Market
  • Forecasting Biodiesel Auction Prices
  • Generative Adversarial Networks for Electron Microscope Image Denoising
  • Graph Embedding for Question Answering over Knowledge Graphs
  • Impact of NYU Wasserman Resources on Students’ Career Outcomes
  • Improving Accented Speech Recognition Through Multi-Accent Pre-Exposure
  • Improving Synthetic Image Generation for Better Object Detection
  • Learning-based Model for Super-resolution in Microscopy Imaging
  • Modeling Human Reading by a Grapheme-to-Phoneme Neural Network
  • Movement Classification of Macaque Neural Activity
  • New OXXO Store in Brazil and Revenue Prediction
  • Numerical Relativity Interpolations using Deep Learning
  • One Medical Passport: Predictive Obstructive Sleep Apnea Analysis
  • Online Student Pathways at New York University
  • Predicting YouTube Trending Video Project
  • Promotional Forecasting Model for Profit Optimization
  • Question Answering on Tabular Data with NLP
  • Raizen Fuel Demand Forecasting
  • Reach for the stars: detecting astronomical transients
  • Reverse Engineering the MOS 6502 Microprocessor
  • Selecting Optimal Training Sets
  • Synthesizing baseball data with event prediction pretraining
  • Train ETA Estimation for Rumo S.A.
  • Training a Generalizable End-to-End Speech-to-Intent Model
  • Utilizing Machine Learning for Career Advancement and Professional Growth

Best Fall 2019 Capstone Projects

Wikipedia Articles poster

  • Inferring the Topic(s) of Wikipedia Articles

By Marina Zavalina, Sarthak Agarwal, Chinmay Singhal, Peeyush Jain

portfolio replication poster

Option Portfolio Replication and Hedging in Deep Reinforcement Learning

By Bofei Zhang, Jiayi Du, Yixuan Wang, Muyang Jin

Deep-Learning Regressions in Astronomy poster

Adversarial Attacks Against Linear and Deep-Learning Regressions in Astronomy

By Teresa Huang, Zacharie Martin, Greg Scanlon, Eva Wang Mentors: Soledad Villar, David W. Hogg

2019 Capstone Project List

  • Adversarial Attacks Against Linear and Deep-learning Regressions in Astronomy
  • Automated Breast Cancer Screening
  • Automatic Legal Case Summaries
  • Cross-task Transfer Between Language Understanding Tasks in NLP
  • Dark Matter and Stellar Stream Detection using Deep Learned Clustering
  • Exploiting Google Street View to Generate Global-scale Data Sets for Training Next Generation Cyber-Physical Systems
  • Federated Incremental Learning
  • Fraud Detection in Monetary Transactions Between Bank Accounts
  • Guided Image Upsampling
  • Improving State of the Art Cross-Lingual Word-Embeddings
  • Latent Semantic Topics Distribution Over Web Content Corpus
  • Lease Renewal Probability Prediction
  • Machine Learning for Adaptive Fuzzy String Matching
  • Market Segmentation from Retailer Behavior
  • Modeling the Experienced Dental Curriculum from Student Data
  • Modelling NBA Games
  • Movie Preference Prediction
  • MRI Image Reconstruction
  • NLP Metalearning
  • Predict next sales office location

Predicting Stock Market Movements using Public Sentiment Data & Sequential Deep Learning Models

  • Predictive Maintenance Techniques
  • Reinforcement Learning for Replication and Hedging of Option
  • Self-supervised Machine Listening

Sentence Classification of TripAdvisor ‘Points-of-Interest’ Reviews

  • Simulating the Dark Matter Distribution of the Universe with Deep Learning
  • SMaPP2: Joint Embedding of User-content and Network Structure to Enable a Common coordinate that captures ideology, geography and user topic spectrum.”
  • Sparse Deconvolution Methods for Microscopy Imaging Data Analysis
  • Stereotype and Unconscious Bias in Large Datasets
  • Structuring Exploring and Exploiting NIH’s Clinical Trials Database
  • The Analysis, Visualization, and Understanding of Big Urban Noise Data
  • Unsupervised and Self-supervised Learning for Medical Notes
  • Unsupervised Generative Video Dubbing
  • Using Deep Generative Models to de-noise Noisy Astronomical Data

Featured Academic Capstone Projects

deep learning poster

Deep Learning for Breast Cancer Detection

By Jason Phang, Jungkyu (JP) Park, Thibault Fevry, Zhe Huang, The B-Team

Brain segmentation poster

Brain Segmentation Using Deep Learning

By Team 22/7 | Chaitra V. Hegde | Advisor: Narges Razavian

Knee replacement poster

Predict Total Knee Replacement Using MRI With Supervised and Semi-Supervised Networks

By Team Glosy: Hong Gao, Mingsi Long, Yulin Shen, and Jie Yang

Featured Industry Capstone Projects

accern logo

Determining where New York Life Insurance should open its next sales office

BK Nets logo

NBA Shot Prediction with Spatio-Temporal Analysis

Other past capstone projects.

  • Active Physical Inference via Reinforcement Learning
  • Deep Multi-Modal Content-User Embeddings for Music Recommendation
  • Fluorescent Microscopy Image Restoration
  • Learning Visual Embeddings for Reinforcement Learning
  • Offensive Speech Detection on Twitter
  • Predicting Movement Primitives in Stroke Patients using IMU Sensors
  • Recurrent Policy Gradients For Smooth Continuous Control
  • The Quality-Quantity Tradeoff in Deep Learning
  • Trend Modeling in Childhood Obesity Prediction
  • Twitter Food/Activity Monitor

University of California, San Diego

  • Top Courses

University of California San Diego

Big Data - Capstone Project

This course is part of Big Data Specialization

Taught in English

Some content may not be translated

Ilkay Altintas

Instructors: Ilkay Altintas +1 more

Instructors

Instructor ratings.

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

Sponsored by University of California, San Diego

16,694 already enrolled

(393 reviews)

Skills you'll gain

Details to know.

capstone project on big data

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Build your subject-matter expertise

  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 7 modules in this course

Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the five week Capstone Project, you will walk through the typical big data science steps for acquiring, exploring, preparing, analyzing, and reporting. In the first two weeks, we will introduce you to the data set and guide you through some exploratory analysis using tools such as Splunk and Open Office. Then we will move into more challenging big data problems requiring the more advanced tools you have learned including KNIME, Spark's MLLib and Gephi. Finally, during the fifth and final week, we will show you how to bring it all together to create engaging and compelling reports and slide presentations. As a result of our collaboration with Splunk, a software company focus on analyzing machine-generated big data, learners with the top projects will be eligible to present to Splunk and meet Splunk recruiters and engineering leadership.

Simulating Big Data for an Online Game

This week we provide an overview of the Eglence, Inc. Pink Flamingo game, including various aspects of the data which the company has access to about the game and users and what we might be interested in finding out.

What's included

4 videos 4 readings

4 videos • Total 17 minutes

  • Welcome to the Big Data Capstone Project • 2 minutes • Preview module
  • Welcome from Splunk: Rob Reed World Education Evangelist • 3 minutes
  • A Summary of Catch the Pink Flamingo • 7 minutes
  • A Conceptual Schema for Catch the Pink Flamingo • 4 minutes

4 readings • Total 35 minutes

  • Planning, Preparation, and Review • 10 minutes
  • A Game by Eglence Inc. : Catch The Pink Flamingo • 10 minutes
  • Overview of the Catch the Pink Flamingo Data Model • 10 minutes
  • Overview of Final Project Design • 5 minutes

Acquiring, Exploring, and Preparing the Data

Next, we begin working with the simulated game data by exploring and preparing the data for ingestion into big data analytics applications.

6 readings 1 quiz 1 peer review

6 readings • Total 140 minutes

  • Downloading the Game Data and Associated Scripts • 10 minutes
  • Understanding the CSV Files Generated by the Scripts • 20 minutes
  • Optional Review of Splunk • 0 minutes
  • “Catch the Pink Flamingo” Data Exploration with Splunk • 45 minutes
  • Aggregate Calculations Using Splunk • 45 minutes
  • Filtering the Data With Splunk • 20 minutes

1 quiz • Total 30 minutes

  • Data Exploration With Splunk • 30 minutes

1 peer review • Total 60 minutes

  • Data Exploration Technical Appendix • 60 minutes

Data Classification with KNIME

This week we do some data classification using KNIME.

4 readings 1 peer review

4 readings • Total 45 minutes

  • Review: Classification Using Decision Tree in KNIME • 10 minutes
  • Review: Interpreting a Decision Tree in KNIME • 10 minutes
  • Workflow Overview for Building a Decision Tree in KNIME • 20 minutes
  • Description of combined_data.csv • 5 minutes

1 peer review • Total 240 minutes

  • Classifying in KNIME to identify big spenders in Catch the Pink Flamingo • 240 minutes

Clustering with Spark

This week we do some clustering with Spark.

2 readings 1 peer review 3 discussion prompts

2 readings • Total 35 minutes

  • Informing business strategies based on client base • 5 minutes
  • Practice with PySpark MLlib Clustering • 30 minutes

1 peer review • Total 200 minutes

  • Recommending Actions from Clustering Analysis • 200 minutes

3 discussion prompts • Total 40 minutes

  • Is there only “one way” to cluster a client base? • 15 minutes
  • How many clusters? • 10 minutes
  • What kind of criteria might provide actionable information for Eglence Inc.? • 15 minutes

Graph Analytics of Simulated Chat Data With Neo4j

This week we apply what we learned from the 'Graph Analytics With Big Data' course to simulated chat data from Catch the Pink Flamingos using Neo4j. We analyze player chat behavior to find ways of improving the game.

2 readings 1 peer review

2 readings • Total 130 minutes

  • Understanding the Simulated Chat Data Generated by the Scripts • 10 minutes
  • Graph Analytics of Catch the Pink Flamingo Chat Data Using Neo4j • 120 minutes
  • Graph Analytics With Chat Data Using Neo4j • 60 minutes

Reporting and Presenting Your Work

1 video 1 reading

1 video • Total 1 minute

  • Week 5: Bringing It All Together • 1 minute • Preview module

1 reading • Total 7 minutes

  • Final project preparation • 7 minutes

Final Submission

1 video 1 reading 2 peer reviews

  • Congratulations! Some Final Words... • 1 minute • Preview module

1 reading • Total 10 minutes

  • Part 2: Help us connect your video to your LinkedIn profile • 10 minutes

2 peer reviews • Total 180 minutes

  • Optional 3-minute video: Splunk opportunity • 60 minutes
  • Final Project • 120 minutes

capstone project on big data

UC San Diego is an academic powerhouse and economic engine, recognized as one of the top 10 public universities by U.S. News and World Report. Innovation is central to who we are and what we do. Here, students learn that knowledge isn't just acquired in the classroom—life is their laboratory.

Why people choose Coursera for their career

capstone project on big data

Learner reviews

Showing 3 of 393

393 reviews

Reviewed on Jul 7, 2020

Really interesting insights into the general overview of the big data specialization with brain-teasing hands-on exercises and a look to hoe reporting various big data analytics should be undertaken

Reviewed on Mar 21, 2019

its very good course , here its aggregating all knowledge and information learned in previous courses

Reviewed on Dec 26, 2018

This is great platform to enhance your skills with periodic learning even from busy schedule and make yourself in pace with new IT.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Language Technologies Institute

School of computer science.

LTI Logo

Master of Computational Data Science

The Master of Computational Data Science (MCDS) program focuses on engineering and deploying large-scale information systems, and includes concentrations in Systems, Analytics, and Human-Centered Data Science.

Requirements

The MCDS program offers three majors: Systems, Analytics, and Human-Centered Data Science. All three require the same total number of course credits, split among required core courses, electives, data science seminar and capstone courses specifically defined for each major. The degree can also be earned in two different ways, depending on the length of time you spend working on it. Regardless of the timing option, all MCDS students must complete a minimum of 144 units to graduate.

Here are the options:

  • Standard Timing — a 16-month degree consisting of study for fall and spring semesters, a summer internship, and fall semester of study. Each semester comprises a minimum of 48 units. This timing is typical for most students. Students graduate in December.
  • Extended Timing — a 20-month degree consisting of study for fall and spring semesters, a summer internship, and a second year of fall and spring study. Each semester comprises a minimum of 36 units. Students graduate in May.

Core Curriculum

All MCDS students must complete 144 units of graduate study which satisfy the following curriculum:

  • Five (5) MCDS Core Courses (63 units)
  • Three courses (3) from one area of concentration curriculum (36 units)
  • Three (3) MCDS Capstone courses (11-635, 11-634 and 11-632) (36 units)
  • One (1) Electives: any graduate level course 600 and above in the School of Computer Science (12 units)

Area of Concentration

  • During the first two semesters in the program, all students take a set of five (5) required core courses: 11-637 Fundamentals of Computational Data Science, 15-619 Cloud Computing, 10-601 Machine Learning, 05-839 Interactive Data Science, and 11-631 Data Science Seminar.
  • By the end of the first semester, all students must select at least one area of concentration — Systems, Analytics, or Human-Centered Data Science — which governs the courses taken after the first semester.
  • To maximize your chances of success in the program, you should consider which concentration area(s) you are best prepared for, based on your educational background, work experience, and  areas of interest as described in your Statement of Purpose.
  • You are strongly encouraged to review the detailed curriculum requirements for each concentration area, in order to determine the best fit given your preparation and background.

For a complete overview of the MCDS requirements read the  MCDS Handbook .

To earn an MCDS degree, students must pass courses in the core curriculum, the MCDS seminar, a concentration area, and electives. Students must also complete a capstone project in which they work on a research project at CMU or on an industry-sponsored project.

In total, students must complete 144 eligible units of study, including eight 12-unit courses, two 12-unit seminar courses, and one 24-unit capstone course. Students must choose at minimum five core courses. The remainder of the 12-unit courses with course numbers 600 or greater can be electives chosen from the SCS course catalog. Any additional non-prerequisite units taken beyond the 144 units are also considered electives.

Students who plan to select the Systems concentration may wish to enroll in 15-513 “Introduction to Computing Systems” during the summer session preceding their enrollment in the program; this course is a prerequisite for many advanced Systems courses, so it should be completed during Summer if you wish to enroll in advanced Systems courses in the Fall.

Click here   to see the MCDS Course Map.

Some example courses of study are included below.

Example 1: Analytics Major, 16 Months

Example 2: Systems Major, 16 Months

Example 3: Human-Centered Data Science Major, 16 Months

Carnegie Mellon's School of Computer Science has a centralized  online application process . Applications and all supporting documentation for fall admission to any of the LTI's graduate programs must be received by the application deadline. Incomplete applications will not be considered.  The application period for Fall 2024 is now closed. Information about the Fall 2025 admissions cycle will be available in summer 2024.

Application Deadlines

Fee Waivers

Fee waivers may be available in cases of financial hardship, or for participants in select "pipeline" programs. For more information, please refer to the  School of Computer Science Fee Waiver page .

The School of Computer Science requires the following for all applications:

  • A GPA of 3.0 or higher.
  • GRE scores: These must be less than five years old. Our Institution Code is 2074; Department Code is 0402. (This requirement is waived for CMU undergrads.)
  • The GRE At Home test is accepted but we prefer you take the GRE at a test center if possible.
  • Unofficial transcripts from each university you have attended, regardless of whether you received your degree there.
  • Current resume.
  • Statement of Purpose.
  • Three letters of recommendation
  • A short (1-3 minutes) video of yourself. Tell us about you and why you are interested in the MCDS program. This is not a required part of the application process, but it is STRONGLY suggested.  
  • Proof of English Language Proficiency

Proof of English Language Proficiency: If you will be studying on an F-1 or J-1 visa, and English is not a native language for you (native language…meaning spoken at home and from birth), we are required to formally evaluate your English proficiency. We require applicants who will be studying on an F-1 or J-1 visa, and for whom English is not a native language, to demonstrate English proficiency via one of these standardized tests: TOEFL (preferred), IELTS, or Duolingo. We discourage the use of the "TOEFL ITP Plus for China," since speaking is not scored.

We do not issue waivers for non-native speakers of English. In particular, we do not issue waivers based on previous study at a U.S. high school, college, or university. We also do not issue waivers based on previous study at an English-language high school, college, or university outside of the United States. No amount of educational experience in English, regardless of which country it occurred in, will result in a test waiver.

Applicants applying to MCDS are required to submit scores from an English proficiency exam taken within the last two years. Scores taken before Sept. 1, 2021, will not be accepted regardless of whether you have previously studied in the U.S. For more information about their English proficiency score policies, visit the  MCDS  admission website.  Successful applicants will have a minimum TOEFL score of 100, IELTS score of 7.5, or DuoLingo score of 120. Our Institution Code is 4256; the Department Code is 78. Additional details about English proficiency requirements are provided on the  FAQ  page. 

Applications which do not meet  all  of these requirements by the application deadline (see above) will not be reviewed.

For more details on these requirements, please see the  SCS Master's Admissions page.

In addition to the SCS guidelines, the LTI requires:

  • Any outside funding you are receiving must be accompanied by an official award letter.

No incomplete applications will be eligible for consideration.

For specific application/admissions questions, please contact  Jennifer Lucas  or Caitlin Korpus .

Program Contact

For more information about the MCDS program, contact Jennifer Lucas or Caitlin Korpus

Jennifer Lucas

Caitlin korpus, online graduate certificate program, program handbook.

  • UC Berkeley
  • Sign Up to Volunteer
  • I School Slack
  • Alumni News
  • Alumni Events
  • Alumni Accounts
  • Career Support
  • Academic Mission
  • Diversity & Inclusion Resources
  • DEIBJ Leadership
  • Featured Faculty
  • Featured Alumni
  • Work at the I School
  • Subscribe to Email Announcements
  • Logos & Style Guide
  • Directions & Parking

The School of Information is UC Berkeley’s newest professional school. Located in the center of campus, the I School is a graduate research and education community committed to expanding access to information and to improving its usability, reliability, and credibility while preserving security and privacy.

  • Career Outcomes
  • Degree Requirements
  • Paths Through the MIMS Degree
  • Final Project
  • Funding Your Education
  • Admissions Events
  • Request Information
  • Capstone Project
  • Jack Larson Data for Good Fellowship
  • Tuition & Fees
  • Women in MIDS
  • MIDS Curriculum News
  • MICS Student News
  • Dissertations
  • Applied Data Science Certificate
  • ICTD Certificate
  • Citizen Clinic

The School of Information offers four degrees:

The Master of Information Management and Systems (MIMS) program educates information professionals to provide leadership for an information-driven world.

The Master of Information and Data Science (MIDS) is an online degree preparing data science professionals to solve real-world problems. The 5th Year MIDS program is a streamlined path to a MIDS degree for Cal undergraduates.

The Master of Information and Cybersecurity (MICS) is an online degree preparing cybersecurity leaders for complex cybersecurity challenges.

Our Ph.D. in Information Science is a research program for next-generation scholars of the information age.

  • Spring 2024 Course Schedule
  • Summer 2024 Course Schedule
  • Fall 2024 Course Schedule

The School of Information's courses bridge the disciplines of information and computer science, design, social sciences, management, law, and policy. We welcome interest in our graduate-level Information classes from current UC Berkeley graduate and undergraduate students and community members.  More information about signing up for classes.

  • Ladder & Adjunct Faculty
  • MIMS Students
  • MIDS Students
  • 5th Year MIDS Students
  • MICS Students
  • Ph.D. Students

capstone project on big data

  • Publications
  • Centers & Labs
  • Computer-mediated Communication
  • Data Science
  • Entrepreneurship
  • Human-computer Interaction (HCI)
  • Information Economics
  • Information Organization
  • Information Policy
  • Information Retrieval & Search
  • Information Visualization
  • Social & Cultural Studies
  • Technology for Developing Regions
  • User Experience Research

Research by faculty members and doctoral students keeps the I School on the vanguard of contemporary information needs and solutions.

The I School is also home to several active centers and labs, including the Center for Long-Term Cybersecurity (CLTC) , the Center for Technology, Society & Policy , and the BioSENSE Lab .

  • Why Hire I School?
  • Request a Resume Book
  • Leadership Development Program
  • Mailing List
  • For Nonprofit and Government Employers
  • Jobscan & Applicant Tracking Systems
  • Resume & LinkedIn Review
  • Resume Book

I School graduate students and alumni have expertise in data science, user experience design & research, product management, engineering, information policy, cybersecurity, and more — learn more about hiring I School students and alumni .

  • Press Coverage
  • I School Voices

view of attendees and speakers at conference

The Goldman School of Public Policy, the CITRIS Policy Lab, and the School of Information hosted the inaugural UC...

Man in blue suit smiling at camera

Dr. Diag Davenport has been appointed as an assistant professor at UC Berkeley as part of a joint search in...

photo of a group posing on a stage in front of WiDS logo

At the Women in Data Science conference held at UC Berkeley this past week, four educators affiliated with the...

AI-generated photo of people discussing data ethics in front of a presentation

At the UC Berkeley School of Information, two educators have taken the initiative to begin incorporating data...

  • Distinguished Lecture Series
  • I School Lectures
  • Information Access Seminars
  • CLTC Events
  • Women in MIDS Events

mics_banner_16x9.png

  • MIMS Program
  • 5th Year MIDS Program
  • MIDS Program
  • Chang Award
  • Spring 2024 Projects
  • Ph.D. Program
  • Graduate Certificates
  • Application

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

BigData Engineering Capstone Project with Tech-stack : Linux, MySQL, sqoop, HDFS, Hive, Impala, SparkSQL, SparkML, git

Subham2S/BigData-Engineering-Capstone-Project-1

Folders and files, repository files navigation, bigdata engineering capstone project 1.

capstone project on big data

🤖 Tech-Stack

capstone project on big data

One of the big corporations needed data engineering services for a decade's worth of employee data. All employee datasets from that period were provided in six CSV files. The first step of this project was to create an Entity Relation Diagram and create a database in an RDBMS with all the tables for structuring and holding the data as per the relations between the tables. So, I imported the CSVs into a MySQL database, transferred to HDFS/Hive in an optimized format, and analyzed with Hive, Impala, Spark, and SparkML. Finally, I created a Bash Script to facilitate the end-to-end data pipeline and machine learning pipeline for automation purposes.

Importing data from MySQL RDBMS to HDFS using Sqoop, Creating HIVE Tables with compressed file format (avro), Explanatory Data Analysis with Impala & SparkSQL and Building Random Forest Classifer Model & Logistic Regression Model using SparkML.

Upload the Capstone_Inputs Folder in Client home dir which contains :

  • Capstone_P1.sh
  • CreateMySQLTables.sql
  • HiveTables.sql
  • capstone.py
  • departments.csv
  • dept_emp.csv
  • dept_manager.csv
  • employees.csv
  • salaries.csv

Run the Bash Script Capstone_P1.sh file in Terminal

Wait for a while and download the Capstone_Outputs Folder After approx. 10-15 mins Capstone_Ouputs Folder will be generated with all the output files : 1. Cap_MySQLTables.txt - To Check MySQL Tables. 2. Cap_HiveDB.txt - To Ensure that Hive Tables were created. 3. Cap_ImpalaAnalysis.txt - All EDA output tables from Impala. 4. Cap_HiveTables.txt - To Check records in Hive Tables and dept_emp1 is created additionally to fix some duplicate issues which were present in dept_emp. 5. Cap_SparkSQL_EDA_ML.txt - All EDA output tables from SparkSQL, pySpark and all the details of the Models (both Random Forest & Logistic Regression) 6. random_forest.model.zip 7. logistic_regression.model.zip

🔍 Details of Capstone_P1.sh

Linux commands.

  • Removes the metadata of the tables which are there in the Root dir (created by the sqoop command when the code was run last time)
  • Removes the Java MapReduce Codes which are there in the Root dir (created by the sqoop command when the code was run last time)
  • Removes the current Capstone_Outputs Folder and Creates a new dir Capstone_Outputs - Here all the Outputs will be stored.
  • Recursively Copies everything to root folder to avoid permission issues at later point of time.

MySQL ( .sql )

  • Creates MySQL tables & Inserts data. For more details, please check out CreateMySQLTables.sql
  • Removes & Creates the Warehouse/Capstone dir to avoid anomalies between same named files
  • Importing Data & Metadata of all the Tables from MySQL RDBMS system to Hadoop using SQOOP command

HDFS Commands

  • Transfering the Metadata to HDFS for Table creation in Hive

Hive ( .hql )

  • All the hive Tables are created as AVRO format. In the HiveDB.hql file Table location and its metadata (schema) locations are mentioned separately.

Impala ( .sql )

  • Explanatory Data Analysis is done with Impala. For more details, please check out EDA.sql
  • Checking all the records of the Hive Tables before moving to spark. For more details, please check out HiveTables.sql

Spark ( .py )

  • This capstone.py does everything. First it loads the tables and creates spark dataframes, then checks all the records again. After that Same EDA analysis is performed with the aid of sparkSQL & pySpark.
  • After EDA, it checks stats for Numerical & Categorical Variables. Then proceeds towards model building after creating final df with joining the tables and dropping irrelevant columns. As per the chosen target variable 'left', the independent variables were divided into continuous and categorical variables, and in the categorical variables, two columns were label encoded manually and the rest were processed for One-Hot Encoding.
  • Then, based on previous experience of EDA, both Random Forest Classification Model and Logistic Regression Model are chosen for this dataset. And as per the analysis the accuracies were 99% (RF) and 90% (LR). Model were fitted on test and train (0.3: 0.7) and gave same accuracy. Considering these as good fits, both the models were saved.
  • After that a Pipeline was created and same analysis were performed in a streamlined manner to build these models. The Accuracies between the built models and the Pipeline models are very close. The reason behind the slight change in the accuracies is that the earlier case, the train & test split was performed after fitting the assembler but in case of ML pipeline, the assembler is inside the stages, so assembler is fitting on split datasets separately as a part of the pipeline. This is also clearly visible in the features column as well. So, this was a good test of the pipeline models in terms of accuracy, and we can conclude that the ML Pipeline is working properly.

Collecting the Models

📚 reference files.

The following files are added for your reference.

  • Capstone.ipynb
  • Capstone Project1.pptx , Capstone Project1.pdf
  • Capstone.zip
  • ERD_Data Model.jpg , ERD_Data Model.svg
  • Python 66.6%
  • HiveQL 3.6%
  • Career Advice
  • Computer Vision
  • Data Engineering
  • Data Science
  • Language Models
  • Machine Learning
  • Programming
  • Cheat Sheets
  • Recommendations
  • Tech Briefs

7 Things Students Are Missing in a Data Science Resume

Adding these 7 key elements to your resume will improve your odds of getting an interview call. Remember, after graduating from the university, your full-time job is to find a job, so put some effort into preparing your resume.

7 Things Students Are Missing in a Data Science Resume

As I reflect on my days as a student, I now realize that there were a few crucial elements that were missing from my data science resume. These shortcomings probably resulted in my being rejected for various job positions. Not only was I unable to present myself as a valuable asset to potential teams, but I also struggled to showcase my ability to solve data science problems. However, with time,  I got better and collaborated with multiple teams to figure out what I was missing and how I could do better if I had to start over.

In this blog, I will share the 7 things that students often overlook in their data science resumes, which can prevent hiring managers from calling them for interviews. 

1. Simple and Readable Resume

Complicating your resume with technical terms, too much information, or unconventional formats can lead to it being rejected right away. Your resume should be easy to read and understand, even by someone not deeply versed in data science. Use a clean, professional layout with clear headings, bullet points, and a standard font. Avoid dense blocks of text. Remember, the goal is to communicate your skills and experiences as quickly and effectively as possible to the hiring manager.

2. Quantifiable Achievements

When you are listing your previous work experiences or projects in the experience section, it is recommended to focus on quantifiable achievements rather than simply listing your responsibilities. 

For example, instead of stating "Developed machine learning models", you could write "Developed a machine learning model that increased sales by 15%." This will demonstrate the tangible impact of your work and showcase your ability to drive results.

3. Data Science Specific Skills

When creating a list of your technical skills, it's crucial to highlight the ones that are directly relevant to data science. Avoid including skills that are not related to data science, such as graphic designing or video editing. Keep your list of skills concise, and write the number of years of experience you have in each. 

Make sure to mention programming languages like Python or R, data visualization tools like Tableau or Power BI, and data analysis tools like SQL or pandas. Additionally, it's worth mentioning your experience with popular machine learning libraries such as PyTorch or scikit-learn.

4. Soft Skills and Teamwork

Data science is not solely dependent on technical abilities. Collaboration and communication skills are crucial. Including experiences where you worked as part of a team, especially in multidisciplinary settings or instances where you communicated complex data insights to non-technical stakeholders, can demonstrate your soft skills.

5. Real World Experience

Employers value practical, hands-on experience in the field of data science. If you have completed internships, projects, or research in data science, be sure to highlight these experiences in your resume. Include details about the projects you worked on, the tools and technologies you used, and the results you achieved.

Students often underestimate the power of showcasing relevant projects. Whether it’s a class assignment, a capstone project, or something you built for fun, include projects that demonstrate your skills in data analysis, programming, machine learning, and problem-solving. Be sure to describe the project goal, your role, the tools and techniques used, and the outcome. Links to GitHub repositories or project websites can also add credibility.

6. Adaptability and Problem Solving Skills

The field of data science is continually evolving, and employers are seeking candidates who can adapt to new challenges and technologies. 

As a data scientist, you may find yourself jumping from being a data analyst to a machine learning engineer in just a few months. Your company may even ask you to deploy machine learning models in production and learn how to manage them. 

The role of a data scientist is fluid, and you have to be mentally prepared for the role changes. You can demonstrate your adaptability and problem-solving skills by highlighting any experiences in which you had to learn a new tool or technique quickly, or where you successfully tackled a complex problem.

7. Links to a Professional Portfolio

Creating an online portfolio and sharing it on your resume is extremely important. This will enable the hiring managers to quickly have a look at your previous projects and the tools you have used to solve certain data problems. You can check out the top platform for creating a data science portfolio for free: 7 Free Platforms for Building a Strong Data Science Portfolio

Failing to include a link to your GitHub repository or a personal website where you showcase your projects is a missed opportunity. 

Final Thoughts

One important thing to keep in mind while submitting your resume for job applications is to modify it according to the job requirements. Look for the skills required for the job and try to include them in your resume to increase your chances of getting an interview call. Apart from your resume, networking, and LinkedIn can be very helpful in finding jobs and freelance projects. Consistently maintaining your LinkedIn profile and posting regularly can go a long way in establishing your professional presence.    

Abid Ali Awan ( @1abidaliawan ) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • 5 Portfolio Projects for Final Year Data Science Students
  • Must-haves on Your Data Science Resume
  • 7 Machine Learning Portfolio Projects to Boost the Resume
  • KDnuggets News, September 21: 7 Machine Learning Portfolio Projects…
  • The Optimal Way to Input Missing Data with Pandas fillna()
  • How to Identify Missing Data in Time-Series Datasets

capstone project on big data

Get the FREE ebook 'The Great Big Natural Language Processing Primer' and 'The Complete Collection of Data Science Cheat Sheets' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Latest Posts

  • 10 GitHub Repositories to Master Python
  • 10 GitHub Repositories to Master Computer Science
  • 5 Free SQL Courses for Data Science Beginners
  • Exploring the OpenAI API with Python
  • Convert Python Dict to JSON: A Tutorial for Beginners
  • Mistral 7B-V0.2: Fine-Tuning Mistral’s New Open-Source LLM with Hugging Face
  • 5 Data Analyst Projects to Land a Job in 2024
  • 5 AI Courses From Google to Advance Your Career
  • Top 7 Model Deployment and Serving Tools
  • Distribute and Run LLMs with llamafile in 5 Simple Steps

capstone project on big data

Elon Musk bets big on Tesla, not cars

  • Medium Text

Tesla CEO and X owner Elon Musk in Paris

For more insights like these, click here New Tab , opens new tab to try Breakingviews for free.

Editing by Lauren Silva Laughlin and Sharon Lam

Breakingviews Reuters Breakingviews is the world's leading source of agenda-setting financial insight. As the Reuters brand for financial commentary, we dissect the big business and economic stories as they break around the world every day. A global team of about 30 correspondents in New York, London, Hong Kong and other major cities provides expert analysis in real time. Sign up for a free trial of our full service at https://www.breakingviews.com/trial and follow us on Twitter @Breakingviews and at www.breakingviews.com . All opinions expressed are those of the authors.

Commercial office space for lease in Seattle

Breakingviews Chevron

Market forces knock ominously on us realtors’ door.

The United States has more than 1.5 million realtors helping people buy and sell homes – more agents than there are currently homes for sale. That odd imbalance is the result of decades of distortions that have benefited the real-estate industry at the expense of its customers. A recent ruling promises to change the way homebuying works, creating some winners, and many losers both in the industry

Preview at Sotheby's in New York

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

IMAGES

  1. Capstone Project Ideas for Data Science

    capstone project on big data

  2. Capstone Project

    capstone project on big data

  3. Capstone Project Ideas For Data Analytics

    capstone project on big data

  4. Capstone Projects 2020

    capstone project on big data

  5. Request a Powerful Data Science Capstone from Us & Shine

    capstone project on big data

  6. Master of Science in Data Science

    capstone project on big data

VIDEO

  1. Best capstone Project awardee talk: Session 1 (September 2023 Cycle)

  2. Capstone Project Part 3

  3. Data Strategy Capstone Project

  4. Capstone Project Part 2

  5. ARGD Capstone Poster

  6. Capstone project e-commerce website

COMMENTS

  1. Big Data

    Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the ...

  2. GitHub

    Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the ...

  3. Big Data

    Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the ...

  4. Big Data Capstone Project

    The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project. Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge. ...

  5. AdelaideX: Big Data Capstone Project

    The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project. Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge. ...

  6. Big Data Capstone Project

    The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project ...

  7. HKUSTx: Big Data Technology Capstone Project

    The Big Data Technology Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this MicroMasters program to a medium-scale project. Play Video for Big Data Technology Capstone Project. Watch the video. 4 weeks. 4-8 hours per week. Self-paced.

  8. 10 Unique Data Science Capstone Project Ideas

    Project Idea #10: Building a Chatbot. A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

  9. UCSD Big Data Specialization General Materials and my Capstone Project

    In the final Capstone Project, developed in partnership with data software company Splunk, i'll apply the skills i learned to do basic analyses of big data. This specilization contains 6 following courses: Introduction to Big Data; Big Data Modeling and Management Systems; Big Data Integration and Processing; Machine Learning With Big Data

  10. Big Data

    Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During ...

  11. 21 Interesting Data Science Capstone Project Ideas [2024]

    Best Data Science Capstone Project Ideas - According to Skill Level. Data science capstone projects are a great way to showcase your skills and apply what you've learned in a real-world context. Here are some project ideas categorized by skill level: Beginner-Level Data Science Capstone Project Ideas. 1. Exploratory Data Analysis (EDA) on a ...

  12. 25+ Solved End-to-End Big Data Projects with Source Code

    Apache Spark is an open-source big data processing engine that provides high-speed data processing capabilities for large-scale data processing tasks. It offers a unified analytics platform for batch processing, real-time processing, machine learning, and graph processing. 23. Apache Nifi.

  13. Capstone Projects

    Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project at the beginning of the year and work on the project over the course of two semesters. Most projects are sponsored by an organization—academic ...

  14. big-data-projects · GitHub Topics · GitHub

    Big Data and AI Engineering bootcamp 2nd capstone project. Using Big Data Tools to predict the probability of university enrollment for Egypt's High School students. 🏫 📚 🔬 data-science machine-learning big-data pyspark apache-pig big-data-analytics big-data-visualization big-data-projects big-data-processing

  15. Big Data

    Offered by University of California San Diego. Welcome to the Capstone Project for Big Data! In this culminating project, you will build a ... Enroll for free.

  16. Big Data Capstone Project

    The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project. Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge. ...

  17. UCSD Data Science Capstone Projects: 2021-2022

    This page contains the project materials for UCSD's Data Science Capstone sequence. Projects are grouped into subject-matter areas called domains of inquiry, led by the domain mentors listed below. Each project listing contains: The title and abstract, A link to the project's website. A link to the project's code repository.

  18. How I created my first Data Analytics Capstone Project

    I completed this Data Analytics Capstone Project as a part of Google Data Analytics Professional Course on Coursera. ... And it is a also nearer to Big Data real life vehicle tracking systems case ...

  19. Master's in Data Science

    Capstone 2024: Submissions Now Open Master's in Data Science Capstone Project Capstone Project CDS master's students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving …

  20. Big Data

    Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the ...

  21. Master of Computational Data Science

    Students must also complete a capstone project in which they work on a research project at CMU or on an industry-sponsored project. In total, students must complete 144 eligible units of study, including eight 12-unit courses, two 12-unit seminar courses, and one 24-unit capstone course. Students must choose at minimum five core courses.

  22. MICS Capstone Projects

    The Master of Information and Data Science (MIDS) is an online degree preparing data science professionals to solve real-world problems. The 5th Year MIDS program is a streamlined path to a MIDS degree for Cal undergraduates. ... Data Science Spring 2024 Capstone Project Showcase. Apr 25, 2024, 5:00 pm to 7:00 pm. MIMS 2024 Final Project ...

  23. Subham2S/BigData-Engineering-Capstone-Project-1

    BigData Engineering Capstone Project with Tech-stack : Linux, MySQL, sqoop, HDFS, Hive, Impala, SparkSQL, SparkML, git - Subham2S/BigData-Engineering-Capstone-Project-1. ... One of the big corporations needed data engineering services for a decade's worth of employee data. All employee datasets from that period were provided in six CSV files.

  24. 7 Things Students Are Missing in a Data Science Resume

    6. Adaptability and Problem Solving Skills. The field of data science is continually evolving, and employers are seeking candidates who can adapt to new challenges and technologies. As a data scientist, you may find yourself jumping from being a data analyst to a machine learning engineer in just a few months.

  25. Elon Musk bets big on Tesla, not cars

    Newcomer Xiaomi unveiled a sub-$30,000 car, and BYD's models go as cheap as $10,000. In the United States, cars costing between $30,000 and $35,000 are by far the most popular, according to ...

  26. TCS Profit Rises on Contract Wins for Cloud, IT Projects

    Analysts, on average, had projected a profit of 120.3 billion rupees. Sales climbed 3.5% to 612.4 billion rupees. India's nearly $250 billion tech industry, led by TCS, is trying to bounce back ...

  27. Loudoun supervisors approve slashed version of Ashburn data center

    Loudoun County lawmakers have approved a controversial data center proposal in Ashburn, reversing an earlier decision, but not before drastically forcing down the project's size — yet another ...