machine learning homework assignments

Assignments

Jump to: [Homeworks] [Projects] [Quizzes] [Exams]

There will be one homework (HW) for each topical unit of the course. Due about a week after we finish that unit.

These are intended to build your conceptual analysis skills plus your implementation skills in Python.

HW0 : Numerical Programming Fundamentals
HW1 : Regression, Cross-Validation, and Regularization
HW2 : Evaluating Binary Classifiers and Implementing Logistic Regression
HW3 : Neural Networks and Stochastic Gradient Descent
HW4 : Trees
HW5 : Kernel Methods and PCA

After completing each unit, there will be a 20 minute quiz (taken online via gradescope).

Each quiz will be designed to assess your conceptual understanding about each unit.

Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions.

You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

There will be three larger "projects" throughout the semester:

Project A: Classifying Images with Feature Transformations
Project B: Classifying Sentiment from Text Reviews
Project C: Recommendation Systems for Movies

Projects are meant to be open-ended and encourage creativity. They are meant to be case studies of applications of the ML concepts from class to three "real world" use cases: image classification, text classification, and recommendations of movies to users.

Each project will due approximately 4 weeks after being handed out. Start early! Do not wait until the last few days.

Projects will generally be centered around a particular methodology for solving a specific task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report (2-4 pages), describing your approach and providing several figures/tables to explain your results to the reader.

You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.

Assignments

Each homework will be due ~1 week after it is released. It is meant to immediately test recent knowledge acquired in class, using mostly code exercises but also some written questions building math and conceptual reasoning skills.

HW0: Basic Numeric Operations in Numpy
Can you divide datasets into train/test?
Can you perform nearest-neighbors search?

HW1: Practical Introduction to Regression

Head-to-head comparison of k-NN regression and linear regression
Using a validation set to tune model complexity
Understanding evaluation metrics and correspondence with training metrics

HW2: Regularized Regression

L1 vs L2 penalties, using Ridge and Lasso from sklearn
Exploration of basis functions
Selecting hyperparameters by cross-validation

HW3: Binary Classifiers & Evaluation

Using logistic regression and decision tree classifiers from sklearn
Creating and comparing confusion matrices and ROC curves

HW4: Multi-layer perceptrons

Backpropagation
Choosing activation functions
Choosing optimization algorithms

HW5: Random Forests and SVMs

HW6: Dimensionality Reduction and Clustering

Projects are meant to be open-ended, simulate case studies found in "the real world", and encourage creativity.

Each project will usually be due 2-3 weeks after being handed out. Projects will generally be centered around a particular methodology and task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report. You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.

Project 1: From-Scratch Implementation of Logistic Regression
Project 2: Text Sentiment Classifiers for Online Reviews
Project 3: Recommendation Systems

COS 402: Artificial Intelligence

Dean's date reminder: Unlike the other homeworks, this one must be turned in on time. This final homework is due on "dean's date," the latest possible due date allowed by university policy. As per university rules, this also means that you will need written permission from the appropriate dean to turn it in late.

Part I: Written Exercises

The written exercises are available here in pdf.

Part II: Programming

The topic of this assignment is machine learning for supervised classification problems. Here are the main components of the assignment:

Implementation of the machine learning algorithm of your choice.
Comparison of your learning algorithm to those implemented by your fellow students on a small set of benchmark datasets.
A systematic experiment of your choice using your algorithm.
A short written report describing and discussing what you did and what results you got.

For this assignment, you may choose to work individually or in pairs. You are encouraged to work in pairs since you are likely to learn more, have more fun and have an easier time overall. (The written exercises should still be done individually though.)

Note that part of this assignment must be turned in by Thursday, January 8. See "What to turn in" below. Also, be sure to plan your time carefully as the systematic experiment may take hours or days to run (depending on what you decide to do for this part).

A machine learning algorithm

The first part of the assignment is to implement a machine learning algorithm of your choice. We have discussed several algorithms including naive Bayes, decision trees, AdaBoost, SVM's and neural nets. R&N discuss others including decision stumps and nearest neighbors. There are a few other algorithms that might be appropriate for this assignment such as the (voted) perceptron algorithm and bagging. You may choose any of these to implement. More details of these algorithms are given below, in some cases with pointers for further reading. For several of these algorithms, there are a number of design decisions that you will need to make; for instance, for decision trees, you will need to decide on a splitting criterion, pruning strategy, etc. In general, you are encouraged to experiment with these algorithms to try to make them as accurate as you can. You are welcome to try your own variants, but if you do, you should compare to the standard vanilla version of the algorithm as well.

If you are working individually, you should implement one algorithm. If you are working with a partner, the two of you together should implement two algorithms. You are welcome to implement more algorithms if you wish.

If it happens that you have previously implemented a learning algorithm for another class or independent project, you should choose a different one for this homework.

For this assignment, you may wish to do outside reading about your chosen algorithm, but by no means are you required to do so. Several books on machine learning have been placed on reserve at the Engineering Library. Although outside reading is allowed, as usual, copying, borrowing, looking at, or in any way making use of actual code that you find on-line or elsewhere is not permitted. Please be sure to cite all your outside sources (other than lecture and R&N) in your report.

Notes on debugging: It can be very hard to know if a machine learning program is actually working. With a sorting program, you can feed in a set of numbers and see if the result is a sorted list. But with machine learning, we usually do not know what the "correct" output should be. Here are some suggestions for debugging your program:

Run your program on small, hand-built datasets where you know exactly what the answer should be.
Compare your results to those of fellow classmates who are implementing the same algorithm. You can do this directly with them, or you can compare to results appearing on the course website (see below). Of course, this can be tricky since there may be algorithmic differences that are not indicative of bugs.
If your code is broken down into multiple methods or modules (as it should be), check each separately. If you are working as a pair, take turns checking each other's code.
If you have time, implement the same algorithm in two different ways, for instance, once in java and once in matlab. Compare the results.
Some algorithms come with theoretical guarantees (for instance, we discussed a theoretical upper bound on the training error of AdaBoost). Check to be sure that your implementation conforms with the theory. In general, keep your eye out for behavior that seems unreasonable.

Comparison on benchmark datasets

We have set up a mechanism whereby you will be able to compare the performance of your program to that of your fellow students. The idea is to simulate what happens in the real world where you need to build a classifier using available labeled data, and then you use that classifier on new data that you never got to see or touch during training. Here is how it works:

We are providing four benchmark datasets described below. Each dataset includes a set of labeled training examples and another set of unlabeled test examples. Once you have your learning algorithm written, debugged and working, you should try training your algorithm on the training data and producing a set of predictions on all of the test examples. These predictions can then be submitted using moodle. If you then press the "Run Script" button, your submitted predictions will be compared to the correct test labels and the resulting test error rate will be posted here where everyone can see how well everyone else's programs are performing. The website will show, for each such submission, the date submitted, the author(s) of the program, a short description that you provide of the learning algorithm used, and the test error rate achieved. The name listed as "author" will be a name that you provide. So, if you wish to remain anonymous on the website, you can do so by using a made-up name of your choice, or even a random sequence of letters, but something that will allow you to identify your own entry. (However, please refrain from using a name that might be considered offensive or inappropriate.)

The "description" you provide in submitting your test predictions should clearly describe the algorithm you are using, and any important design decisions you made (such as parameter settings). This one or two sentence description should be as understandable as possible to others in the class. For instance, try to avoid cryptic abbreviations. (In contrast, the "author" you provide in submitting your test predictions can be any name you wish.) While being brief, try to give enough information that a fellow classmate might have a reasonable chance of reproducing your results.

Once you have seen your results on test data, you may wish to try to improve your algorithm, or you may wish to try another algorithm altogether (although the assignment does not require you to do so). Once you have done this, you may submit another set of test predictions. However, to avoid the test sets becoming overused (leading to statistically meaningless results), each student will be limited to submitting three sets of predictions for each benchmark dataset. Note that this limit is per student, not per team; in other words, if you are working as a pair, then together you can submit up to six sets of predictions per dataset.

Your grade will not depend on how accurate a classifier you are able to produce relative to the rest of the class. Although you are encouraged to do the best you can, this should not be regarded as anything more than a fun (I hope) communal experiment exploring various machine learning algorithms. This also means that, in choosing an algorithm to implement, it is more important to choose an algorithm that interests you than to choose one that you expect will give the best accuracy. The greater a variety of algorithms that are implemented, the more interesting will be the results of our class experiment, even if some of those algorithms perform poorly.

A systematic experiment

The third part of this assignment is to run a single systematic experiment. For instance, you might want to produce a "learning curve" such as the one shown in R&N Figure 18.11a. In such a curve, the accuracy (or error) is measured as the training set size is varied over some range. To be more specific, here is how you might produce such a curve. The provided test datasets cannot be used here since they are unlabeled, and since you are limited to making only three sets of predictions on each. Instead, you can split the provided training set into two subsets, one for training, and the other for measuring performance. For instance, if you have 2000 training examples, you might hold out 1000 of those examples for measuring performance. You can then run your learning algorithm on successively larger subsets of the remaining 1000 examples, say of size 20, 50, 100, 200, 500, 1000. Each run of your algorithm will generate a classifier whose error can be measured on the held-out set of 1000 examples. The results can then be plotted using matlab, gnuplot, excel, etc. Note that such an experiment, though requiring multiple runs of the same algorithm, can be programmed to run entirely automatically, thus reducing the amount of human effort required, and also substantially reducing the chance of making mistakes.

This is just one possible experiment you might wish to run. There are many other possibilities. For instance, if you are using neural nets, you might want to plot accuracy as a function of the number of epochs of training. Or if you are using boosting, you might plot accuracy as a function of the number of rounds. Another possibility is to compare the accuracy of two different variants of the same algorithm, for instance, decision trees with and without boosting, or decision trees with two different splitting criteria.

This general approach of holding out part of the training set may also be helpful for improving the performance of your learning algorithm without using the "real" test set. For instance, if your algorithm has a parameter (like the learning rate in neural nets) that needs to be tuned, you can try different settings and see which one seems to work the best on held out data. (This could also count as a systematic experiment.) You might then use this best setting of the learning rate to train on the entire training set and to generate test predictions that you submit for posting on the class website.

In general, such held-out sets should consist of about 500-1000 examples for reliable results.

Note that systematic experiments of this kind can take a considerable amount of computation time to complete, in some cases, many hours or even days, depending on the experiment and the algorithm being used. Therefore, it is very important that you start on this part of the assignment as early as possible.

If you are working as a pair, it is okay to do just one experiment. However, whatever experiments you do should involve at least two of the algorithms that you implemented. For instance, you might produce learning curves for both.

A written report

The fourth part of this assignment is to write up your results clearly but concisely in a short report. Your report should include all of the following (numbers in brackets indicate roughly how many paragraphs you might want to write for each bullet):

[1-2] A description of the algorithm(s) that you implemented. The description should include enough implementation details that a motivated classmate would be able to reproduce your results.
[1] A brief description of what strategies you used to test that your program is working correctly since, as noted above, it can be difficult to know if a machine learning program is working.
[1] A description of the experiment(s) that you carried out, again, with enough detail for a fellow classmate to reproduce your results.
[1] The results of your experiment, possibly summarized by a figure.
[1] The accuracy of your algorithm on the provided test sets, and a comparison to other methods used by others in the class.
[1-3] A discussion of your results. For instance, what do your results tell us about the learning algorithm(s) you studied? Were the results in any way surprising, or were they what you expected, and why? How do they fit with theory and intuition? Can you conclude anything about what kind of algorithms might be better for what kind of problems?

If you are working as a pair, you only need to submit a single report (in which case, your report might be slightly longer than indicated by the numbers above).

The code we are providing

We are providing a class DataSet for storing a dataset, and for reading one in from data files that we provide or that you generate yourself for testing. Each dataset is described by an array of training examples, an array of labels and an array of test examples. Each example is itself an array of attribute values. There are two kinds of attributes: numeric and discrete. Numeric attributes have numeric values, such as age, height, weight, etc. Discrete attributes can only take on values from a small set of discrete values, for instance, sex (male, female), eye color (brown, blue, green), etc. Below, we also refer to binary attributes; these are numeric attributes that happen to only take the two values 0 and 1.

Numeric attributes are stored by their actual value as an integer (for simplicity, we don't allow floating point values). Discrete attributes are stored by an index (an integer) into a set of values. The DataSet class also stores a description of each attribute including its name, and, in the case of discrete attributes, the list of possible values. Labels are stored as integers which must be 0 or 1 (we will only consider two-class problems). The names of the two classes are also stored as part of the DataSet class.

A dataset is read in from three files with the names <stem>.names , <stem>.train and <stem>.test . The first contains a description of the attributes and classes. The second and third contain the labeled training examples and unlabeled test examples. A typical <stem>.names file looks like the following:

yes no age numeric eye-color brown blue green

The first line must contain the names of the two classes, which in this case are called " yes " and " no ". After this follows a list of attributes. In this case, the second line of the file says that the first attribute is called " age ", and that this attribute takes numeric values. The next line says that the second attribute is called " eye-color ", and that it is a discrete attribute taking the three values " brown ", " blue " and " green ".

A typical <stem>.train file might look like this:

33 blue yes 15 green no 25 green yes

There is one example per line consisting of a list of attribute values (corresponding to those described in the <stem>.names file), followed by the class label.

A <stem>.test file has exactly the same format except that the label is omitted, such as the following:

33 green 19 blue

The DataSet class has a constructor taking a file-stem as an argument that will read in all three files and set up the public fields of the class appropriately. The .train and .names files are required, but not the .test file (if no .test file is found, a non-fatal warning message will be printed and an empty test set established).

Working with several different kinds of attributes can be convenient when coding a dataset but a nuisance when writing a machine learning program. For instance, neural nets prefer all of the data to be numeric, while decision trees are simplest to describe when all attributes are binary. For this reason, we have provided additional classes that will read in a dataset and convert all of the attributes so that they all have the same type. This should make your job much, much simpler. Each of these classes is in fact a subclass of DataSet (see the references listed on the course home page for an explanation of subclasses and how to use them), and each has a constructor taking as argument a file-stem. The three classes are NumericDataSet , BinaryDataSet and DiscreteDataSet , which convert any dataset into data that is entirely numeric, binary or discrete. (In addition, BinaryDataSet is a subclass of NumericDataSet .) So, for instance, if you are using neural nets and want your data to be entirely numeric, simply load the data using a command like this:

ds = new NumericDataSet(filestem);

Using these subclasses inevitably has the effect of changing the attributes. When converting discrete attributes to numeric or binary, a new binary attribute is created for each value. For instance, the eye-color attribute will become three new binary attributes: eye-color=brown , eye-color=blue and eye-color=green ; if eye-color is blue on some example, then eye-color=blue would be given the value 1, and the others the value 0. A numeric (non-binary) attribute is converted to binary by creating new binary attributes in a similar fashion. Thus, the numeric attribute age would be replaced by the binary attributes age>=19 , age>=25 , age>=33 . If age actually is 25, then age>=19 and age>=25 would be set to 1, while age>=33 would be set to 0. When converting a numeric (including binary) attribute to discrete, we simply regard it as a discrete attribute that can take on the possible values of the original numeric attribute. Thus, in this example, age would now become a discrete attribute that can take on the values " 15 ", " 19 ", " 25 " and " 33 ". Note that all ordering information has been lost between these values.

If you produce your own dataset, it is important to know that the provided code assumes that numeric attributes can only take on a fairly small number of possible values. If you try this code out with a numeric attribute that takes a very large number of values, you probably will run into memory and efficiency issues. All of the provided datasets have been set up so that this should not be a problem.

The DataSet class also includes a method printTestPredictions that will print the predictions of your classifier on the test examples in the format required for submission. The output of this method should be stored in a file called <stem>.testout and submitted using moodle.

We also are providing an interface called Classifier that your learning program and the computed classifier (hypothesis) should adhere to. This interface has three methods: predict , which will compute the prediction of the classifier on a given example; algorithmDescription , which simply returns a very brief but understandable description of the algorithm you are using for inclusion on the course website; and author , which returns the "author" of the program as you would like it to appear on the website (can be your real name or a pseudonym). A typical class implementing this interface will also include a constructor where the actual learning takes place.

A very simple example of a class implementing the Classifier interface is provided in BaselineClassifier.java which also includes a simple main for loading a dataset, training the classifier and printing the results on test data.

All code and data files can be obtained from this directory , or all at once from this zip file . Data is included in the data subdirectory.

Documentation on the provided Java classes is available here .

The datasets we are providing

We are providing four datasets, all consisting of real-world data suitably cleaned and simplified for this assignment.

The first two datasets consist of optical images of handwritten digits. Some examples are shown in R&N Figure 20.29 (the data we are providing actually comes from the same source, although ours have been prepared somewhat differently). Each image is a 14x14 pixel array, with 4 pixel-intensity levels. The goal is to recognize the digit being represented. In the first and easier dataset with file-stem ocr17 , the goal is to distinguish 1's from 7's. In the second and harder dataset with file-stem ocr49 , the goal is to distinguish 4's from 9's.

The third dataset consists of census information. Each example corresponds to a single individual with attributes such as years of education, age, race, etc. The goal is predict whether this individual has an income above or below $50,000. The file-stem is census .

The fourth dataset consists of DNA sequences of length 60. The goal is to predict whether the site at the center of this window is a "splice" or "non-splice" site. The file-stem is dna .

The DNA dataset consists of 1000 training examples and 2175 test examples. All of the other datasets consist of 2000 training examples and 4000 test examples.

It might happen that you think of ways of figuring out the labels of the test examples, for instance, manually looking at the OCR data to see what digit is represented, or finding these datasets on the web. Please do not try anything of this kind, as it will ruin the spirit of the assignment. The test examples should not be used for any purpose other than generating predictions that you then submit. You should pretend that the test examples will arrive in the future after the classifier has already been built and deployed.

The code that you need to write

Your job is to create one or more Classifier classes implementing your learning algorithm and the generated classifier. Since we do not intend to do automatic testing of your code, you can do this however you wish. You also will probably need to write some kind of code to carry out your systematic experiment.

Because we will not be relying on automatic testing, we ask that you make an extra effort to document your code well to make it as readable as possible.

What to turn in

For the purposes of submitting to moodle , we have divided this assignment in two.

On moodle, under the assignment called " A7: Machine Learning (test predictions) ", you should turn in the following:

For each of the four datasets, a file called <stem>.testout generated by printTestPredictions and containing author, algorithm description and predictions on all test examples. These should be submitted (and the "Run Script" button pushed) as early as possible so that others can compare their results to yours. At the latest, you (or your partner, if working as a pair) should submit a first round of test predictions by Thursday, January 8 (although you can continue to submit more rounds of test predictions after this date). Keep in mind that you may not submit more than three sets of predictions per student and per dataset. (Moodle will automatically prevent you from doing so.)

In addition, under the assignment called " A7: Machine Learning (code) ", you should turn in the following:

Any java code that you wrote in a form that will compile and run, should the TA's wish to try it out.
A readme.txt file explaining briefly how your code is organized, what data structures you are using, or anything else that will help the TA's understand how your code works overall.

Finally, your program report should be submitted in hard copy as described on the assignments page.

If you are working with a partner, the two of you together only need to submit your code once, and you only need to prepare and turn in a single written report . Be sure that it is clear who your partner is. In all cases, the written exercises should be completed and turned in individually.

You do not need to submit any code or anything in writing by Thursday, January 8. The only thing you need to submit by that date is a single round of predictions on each of the test sets. The reason this part is due before the rest of the assignment is so that you will have time to compare your results to those of your fellow classmates when you write up your report. You can continue to submit test predictions (up to three rounds, including the one due on January 8), up until the assignment due date (Tuesday, January 13).

What you will be graded on

You will be graded on completing each of the components of this assignment, as described above. More emphasis will be placed on your report than on the code itself. You have a great deal of freedom in choosing how much work you want to put into this assignment, and your grade will in part reflect how great a challenge you decide to take on. Creativity and ingenuity will be one component of your grade. Here is a rough breakdown, with approximate point values in parentheses, of how much each component is worth:

[20] The learning algorithm (correct and complete implementation; debugging and testing; adequate documentation).
[10] Submitting test predictions via moodle and comparing to results of others.
[20] A systematic experiment.
[10] The overall presentation of the report (should be clear, concise and well written).
[15] The discussion of results (should be thoughtful, perceptive and insightful).
[10] Overall creativity, ingenuity and ambitiousness.

As noted above, your grade will not at all depend on how well your algorithm actually worked, provided that its poor performance is not due to an incorrect or incomplete implementation.

Full credit for this assignment will be worth around 85 points. However, exceptional effort will receive extra credit of 5-20 points.

Algorithms you can choose from

Here is a list of machine learning algorithms you can choose from for the programming assignment. Most of these are described further in the books requested for reserve at the Engineering Library (see below). A few additional pointers are also provided below. Be sure to take note of the R&N errata detailed on the written exercises for this homework.

Decision trees

These were discussed in class, and also in R&N Section 18.3. To implement them, you will need to decide what splitting criterion to use, when to stop growing the tree and what kind of pruning strategy to use.

This algorithm was discussed in class, and also in R&N Section 18.4. AdaBoost is an algorithm that must be combined with another "weak" learning algorithm, so you will need to implement at least one other algorithm (which might work well if you are working as a pair). Natural candidates for weak learning algorithms include decision stumps or decision trees. You also will need to decide how the weak learner will make use of the weights D t on the training examples. One possibility is to design a weak learner that directly minimizes the weighted training error. The other option is to select a random subset of the training examples on each round of boosting by resampling according to distribution D t . This means repeatedly selecting examples from the training set, each time selecting example i with probability D t ( i ). This is done "with replacement", meaning that the same example may appear in the selected subset several times. Typically, this procedure will be repeated N times, where N is the number of training examples.

Finally, you will need to decide on the number of rounds of boosting. This is usually in the 100's to 1000's.

See also this overview paper , or this survey .

Support-vector machines (SVM's)

This algorithm was discussed in class, and also in R&N Section 20.6. Even so, we did not describe specific algorithms for implementing it, so if you are interested, you will need to do some background reading. One of the books on reserve is all about kernel machines (including SVM's). You can also have a look at this tutorial paper , as well as the references therein and some of the other resources and tutorials at www.kernel-machines.org . The SMO algorithm is a favorite technique for computing SVM's.

Since implementing SVM's can be pretty challenging, you might instead want to implement the very simple (voted) perceptron algorithm, another "large margin" classifier described below which also can be combined with the kernel trick, and whose performance is substantially similar to that of SVM's.

Neural networks

This algorithm was discussed in class, and also in R&N Section 20.5 in considerable detail. You will need to choose an architecture, and you have a great deal of freedom in doing so. You also will need to choose a value for the "learning rate" parameter, and you will need to decide how long to train the network for. You might want to implement just a single-layer neural network, or you might want to experiment with larger multi-layer networks. You can also try maximizing likelihood rather than minimizing the sum of squared errors as described in R&N Eq. (20.13) and the surrounding text. It can be proved that taking this approach with a single-layer net has the important advantage that gradient ascent can never get stuck in a local maximum. (This amounts to a tried and true statistical method called logistic regression.)

Naive Bayes

We discussed this algorithm in class much earlier in the semester, but it can be used as a very simple algorithm for classification learning. It is described in R&N at the very end of Section 13.6, and also in the middle of Section 20.2. Although simple, and although the naive independence assumptions underlying it are usually wrong, this algorithm often works better than expected. In estimating probabilities, you will probably want to work with log probabilities and use Laplace ("add-one") smoothing as in HW#5. This algorithm works best with discrete attributes.

Decision stumps

This is probably the simplest classifier on this list (and the least challenging to implement). They are briefly touched upon in R&N Section 18.4. A decision stump is a decision tree consisting of just a single test node. Given data, it is straightforward to search through all possible choices for the test node to build the decision stump with minimum training error. These make good, truly weak, weak hypotheses for AdaBoost.

Nearest neighbors

We did not discuss this algorithm in detail in class, but it is discussed in R&N Section 20.4. The idea is simple: during training, all we do is store the entire training set. Then given a test example, we find the training example that is closest to it, and predict that the label of the test example is the same as the label of its closest neighbor. As described, this is the 1-nearest neighbor algorithm. In the k -nearest neighbor algorithm, we find the k closest training examples and predict with the majority vote of their labels. In either case, it is necessary to choose a distance function for measuring the distance between examples.

(Voted) perceptron algorithm

The perceptron algorithm (not to be confused with the algorithm given in R&N Figure 20.21) is one of the oldest learning algorithms, and also a very simple algorithm. Like SVM's the algorithm's purpose is to learn a separating hyperplane defined by a weight vector w . Starting with an initial guess for w , the algorithm proceeds to cycle through the examples in the training set. If example x is on the "correct" side of the hyperplane defined by w , then no action is taken. Otherwise, y x is added to w . The algorithm has some nice theoretical properties, and can be combined with the kernel trick. Also, there is a version of the algorithm in which the average of all of the weight vectors computed along the way are used in defining the final weight vector defining the output hypothesis. All this is described in this paper . [Unfortunately, this paper has some annoying typos in Figure 1. The initialization line should read: "Initialize: k := 1, v 1 := 0 , c 1 := 0." Also, the line that reads "If ŷ = y then..." should instead read "If ŷ = y i then..."]

This is an "ensemble" method similar to boosting, somewhat simpler though not quite as effective overall. As in boosting, we assume access to a "weak" or base learning algorithm. This base learner is run repeatedly on different subsets of the training set. Each subset is chosen by selecting N of the training examples with replacement from the training set, where N is the number of training examples. This means that we select one of the training examples at random, then another, then another and so on N times. Each time, however, we are selecting from the entire training set, so that some examples will appear more than once, and some won't appear at all. The base learner is trained on this subset, and the entire procedure is repeated some number of times (usually around 100). These "weak" or base hypotheses are then combined into a single hypothesis by a simple majority vote. For more detail, see this paper . This algorithm works best with an algorithm like decision trees as the base learner.

If you are interested in implementing some other algorithm not listed here, please contact me first.

Books on reserve at the Engineering Library

Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone. Classification and regression trees. Wadsworth, 1984.
J. Ross Quinlan. C4:5: programs for machine learning. Morgan Kaufmann, 1993.
Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.
Christopher M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

Foundations of Machine Learning

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.

The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist . The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.

Please fill out this short online form to register for access to our course's Piazza discussion board. Applications are processed manually, so please be patient. You should receive an email directly from Piazza when you are registered. Common questions from this and previous editions of the course are posted in our FAQ .

The first lecture, Black Box Machine Learning , gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Highlights and Distinctive Features of the Course Lectures, Notes, and Assignments

Geometric explanation for what happens with ridge, lasso, and elastic net regression in the case of correlated random variables.
Investigation of when the penalty (Tikhonov) and constraint (Ivanov) forms of regularization are equivalent.
Concise summary of what we really learn about SVMs from Lagrangian duality.
Proof of representer theorem with simple linear algebra, emphasizing it as a way to reparametrize certain objective functions.
Guided derivation of the math behind the classic diamond/circle/ellipsoids picture that "explains" why L1 regularization gives sparsity (Homework 2, Problem 5)
From scrach (in numpy) implementation of almost all major ML algorithms we discuss: ridge regression with SGD and GD (Homework 1, Problems 2.5, 2.6 page 4), lasso regression with the shooting algorithm (Homework 2, Problem 3, page 4), kernel ridge regression (Homework 4, Problem 3, page 2), kernelized SVM with Kernelized Pegasos (Homework 4, 6.4, page 9), L2-regularized logistic regression (Homework 5, Problem 3.3, page 4),Bayesian Linear Regession (Homework 5, problem 5, page 6), multiclass SVM (Homework 6, Problem 4.2, p. 3), classification and regression trees (without pruning) (Homework 6, Problem 6), gradient boosting with trees for classification and regression (Homework 6, Problem 8), multilayer perceptron for regression (Homework 7, Problem 4, page 3)
Repeated use of a simple 1-dimensional regression dataset, so it's easy to visualize the effect of various hypothesis spaces and regularizations that we investigate throughout the course.
Investigation of how to derive a conditional probability estimate from a predicted score for various loss functions, and why it's not so straightforward for the hinge loss (i.e. the SVM) (Homework 5, Problem 2, page 1)
Discussion of numerical overflow issues and the log-sum-exp trick (Homework 5, Problem 3.2)
Self-contained introduction to the expectation maximization (EM) algorithm for latent variable models.
Develop a general computation graph framework from scratch, using numpy, and implement your neural networks in it.

Prerequisites

The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment , which is a preview of some of the math concepts that show up in the first part of the course.

Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
Python programming required for most homework assignments.
Recommended: At least one advanced, proof-based mathematics course
Recommended: Computer science background up to a "data structures and algorithms" course
(HTF) refers to Hastie, Tibshirani, and Friedman's book The Elements of Statistical Learning
(SSBD) refers to Shalev-Shwartz and Ben-David's book Understanding Machine Learning: From Theory to Algorithms
(JWHT) refers to James, Witten, Hastie, and Tibshirani's book An Introduction to Statistical Learning

Assignments

GD, SGD, and Ridge Regression

Lasso Regression

SVM and Sentiment Analysis

Kernel Methods

Probabilistic Modeling

Multiclass, Trees, and Gradient Boosting

Computation Graphs, Backpropagation, and Neural Networks

The cover of Hands-On Machine Learning with Scikit-Learn and TensorFlow

Teaching Assistants

Stanford University

Stanford engineering, s tanford e ngineering e verywhere, cs229 - machine learning, course details, course description.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs; VC theory; large margins); reinforcement learning and adaptive control. The course will also discuss recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing. Students are expected to have the following background: Prerequisites: - Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. - Familiarity with the basic probability theory. (Stat 116 is sufficient but not necessary.) - Familiarity with the basic linear algebra (any one of Math 51, Math 103, Math 113, or CS 205 would be much more than necessary.)

DOWNLOAD All Course Materials

Ng also works on machine learning algorithms for robotic control, in which rather than relying on months of human hand-engineering to design a controller, a robot instead learns automatically how best to control itself. Using this approach, Ng's group has developed by far the most advanced autonomous helicopter controller, that is capable of flying spectacular aerobatic maneuvers that even experienced human pilots often find extremely difficult to execute. As part of this work, Ng's group also developed algorithms that can take a single image,and turn the picture into a 3-D model that one can fly-through and see from different angles.

Course Handouts

Lecture handouts, review notes, assignments, course sessions (20):, transcripts, stanford center for professional development.

Stanford Home
Maps & Directions
Search Stanford
Emergency Info
Terms of Use
Non-Discrimination
Accessibility

Office Hours

All homework assignments consist of two parts, a written section (due Tuesdays) and a programming section (due Thursdays). The instructions for both sections are included in the assignment zip files.

Programming assignments will be distributed through svn. See the zip file for additional instructions.

All assignments are due at 11:59AM Central Time (Just before noon) . See syllabus for more detailed schedule regarding due dates.

Assignments

Assignment 1: Introduction + Python — Design by Colin, Review by Yucheng
Assignment 2: Linear Regression — Design by Raymond, Review by Jyoti
Assignment 3: Binary Classification Design by Youjie, Review by Jyoti
Assignment 4: Support Vector Machine — Design by Raymond, Review by Ishan
Assignment 5: Multiclass Classification — Design by Yucheng, Review by Safa
Assignment 6: Deep Neural Networks — Design by Safa, Review by Yuan-Ting
Assignment 7: Structured Prediction — Design by Colin, Review by Yucheng
Assignment 8: k-Means — Design by Jyoti, Review by Youjie
Assignment 9: Gaussian Mixture Models — Design by Ishan, Review by Colin
Assignment 10: Variational Autoencoder — Design by Yuan-Ting, Review by Raymond
Assignment 11: Generative Adverserial Network — Design by Ishan, Review by Yuan-Ting
Assignment 12: Q-learning — Design by Safa, Review by Youjie
Assignment 1
Assignment 2
Assignment 3
Assignment 4
Assignment 5
Assignment 6
Assignment 7
Assignment 8
Assignment 9
Assignment 10
Assignment 11
Assignment 12

How to submit

Written assignments are submitted through gradescope (self-enrollment code 96P5BZ) . When submitting you will be asked to assign each problem to one or mutiple pages in your solution. Make sure you link the problem to the corresponding pages when submitting to avoid that your solution will not be graded.

Programming Section :

Programming assignments will be distributed in a svn repository. We will grade the program that is checked into svn at (11:59AM Central Time) on the due day. Any updates after the deadline will not be graded.

Please take note of the running time when implementing the solution, we will terminate the autograder if the solution runs more than 5 times slower than our implementation.

Additionally, you are required to follow the pycodestyle coding convention. Failure to follow the style guide may result in deduction of points for the programming section.

We will report grades for both the written section and the programming section through Compass 2G .

Academic Honesty

Feel free to discuss the assignments at the concept-level with other students, no specifics.
All solutions should be written individually.
Do not show other students your homework. (This includes Piazza. Do not post partial code or solutions.)
Be sure to acknowledge references you used.
Copying from other students / online sources or letting other students copy your work will result in a 0 for the assignment. A second attempt of cheating will result in grade F for the entire course.

Late Homework Policy

Homework will not be accepted after the due date. The lowest scoring homework will be dropped.

Helpful Resources

Python: Python Tutorial
NumPy: NumPy User Guide
NumPy: NumPy for Matlab users
SVN: SVN Tutorial

Machine Learning DS-GA 1003 · Spring 2019 · NYU Center for Data Science

About this course.

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build. A tentative syllabus can be found here .

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science . Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science , which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza , where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers. anonymized version of our Piazza board .

Other information:

Course details can be found in the syllabus .
The Course Calendar contains all class meeting dates.
All course materials are stored in a GitHub repository . Check the repository to see when something was last updated.
For registration information, please contact Kathryn Angeles .
The course conforms to NYU’s policy on academic integrity for students .

Prerequisites

DS-GA-1001: Intro to Data Science or its equivalent
DS-GA-1002: Statistical and Mathematical Methods or its equivalent
Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate calculus (primarily differential calculus), probability theory, and statistics. (The coverage in the 2015 version of DS-GA 1002, linked above, is sufficient.)
Python programming required for most homework assignments.
Recommended: Computer science background up to a "data structures and algorithms" course
Recommended: At least one advanced, proof-based mathematics course
Some prerequisites may be waived with permission of the instructor
You can also self-assess your preparation by filling out the Prerequisite Questionnaire

Homework (40%) + Midterm Exam (30%) + Final Exam (30%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

Midterm Exam (100 min) Tuesday, March 12th, 5:20–7pm.
Final Exam (100 min) Thursday, May 16th, 6-7:50pm (confirmed).
See Assignments section for homework-related deadlines.

The cover of Elements of Statistical Learning

Instructors

Julia Kempe

[email protected]

Julia is the Director of the NYU Center for Data Science (CDS). She is a professor of Computer Science and Mathematics at CDS and the NYU Courant Institute.

David Rosenberg

[email protected]

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP .

Section Leaders

Sreyas Mohan (Head TA)

[email protected]

Sreyas is a second year PhD student in the Data Science Program at CDS working with Prof. Carlos Fernandez-Granda and Prof. Eero Simoncelli.

Xintian Han

[email protected]

Xintian is a second year PhD student in the Data Science Program at CDS working with Prof. Rajesh Ranganath.

Sanyam Kapur (Head Grader)

[email protected]

Sanyam is a Masters Student in Computer Science at NYU Courant. He is currently working towards improving Markov Chain Monte Carlo methods.

Aakash Kaku

Aakash is a second-year Masters student in the Data Science program at NYU. He is interested in solving problems in the healthcare domain using machine learning.

Mingsi Long

Mingsi is a second year student in the Data Science Program at NYU CDS.

Mihir is a Master's student in Data Science at the NYU Center for Data Science, interested in computer vision, reinforcement learning, and natural language understanding.

Tingyan Xiang

Tingyan is a second-year Masters student in the Data Science program at NYU.

Yi is a second year student at the CS department at NYU Tandon.

Introduction to Machine Learning

Course overview

This class is an introductory undergraduate course in machine learning. The class will briefly cover topics in regression, classification, mixture models, neural networks, deep learning, ensemble methods and reinforcement learning.

Prerequisites: You should understand basic probability and statistics, (STA 107, 250), and college-level algebra and calculus. For example it is expected that you know about standard probability distributions (Gaussians, Poisson), and also how to calculate derivatives. Knowledge of linear algebra is also expected, and knowledge of mathematics underlying probability models (STA 255, 261) will be useful. For the programming assignments, you should have some background in programming (CSC 270), and it would be helpful if you know Matlab or Python. Some introductory material for Matlab will be available on the course website as well as in the first tutorial.

October 6nd: New pdf and code for Assignment 1 with problems fix. Due date Oct. 19 at noon.
October 2nd: Assignment 1 posted at the bottom of the page
September 27th: Dont forget to complete your Form at https://docs.google.com/forms/d/1O6xRNnKp87GrDM74tkvOMhMIJmwz271TgWdYb6ZitK0/viewform?usp=send_form
September 27th: Slide on Precision and Recall in lecture 3 has been corrected
September 18th: Piazza for this course can be found at https://piazza.com/utoronto.ca/fall2015/csc411/home
August 25th: Creation of Webpage

Course information

Lectures: Monday, Wednesday 12-1 (section 1), 3-4 (section 2), Thursday 6-8 (section 3)

Lecture Room: MP134 (section 1), SS2106 (section 2), BA1200 (section 3)

Instructor: Raquel Urtasun (section 1 and 2), Ruslan Salakhutdinov (section 3)

Tutorials: Friday 12-1 (section 1), 3-4 (section 2), Thursday 8-9 (section 3)

Tutorial Room: MP134 (section 1), SS2106 (section 2), BA1200 (section 3)

Office Hours: Raquel Urtasun: Monday 4:10-5:40, Pratt Building, Room 290E. Ruslan Salakhutdinov: Thursdays 1-2pm in Pratt Building, Room 290F. Additionally, you can also ask questions about the course to the CSC2515 instructor. Rich Zemel: Thursday 4-5 Pratt Building, Room 290D.

Requirements

The format of the class will be lecture, with some discussion. We strongly encourage interaction and questions. There are assigned readings for each lecture that are intended to prepare you to participate in the class discussion for that day.

Detailed Requirements

Homework assignments.

Collaboration on the assignments is not allowed. Each student is responsible for his or her own work. Discussion of assignments and programs should be limited to clarification of the handout itself, and should not involve any sharing of pseudocode or code or simulation results. Violation of this policy is grounds for a semester grade of F, in accordance with university regulations.

The schedule of assignments is included in the syllabus. Assignments are due at the beginning of class/tutorial on the due date. Because they may be discussed in class that day, it is important that you have completed them by that day. Assignments handed in late but before 5 pm of that day will be penalized by 5% (i.e., total points multiplied by 0.95); a late penalty of 10% per day will be assessed thereafter. Extensions will be granted only in special situations, and you will need a Student Medical Certificate or a written request approved by the instructor at least one week before the due date.

There will be a mid-term in class on TBA, which will be a closed book exam on all material covered up to that point in the lectures, tutorials, required readings, and assignments.

The final will not be cumulative, except insofar as concepts from the first half of the semester are essential for understanding the later material.

We expect students to attend all classes, and all tutorials. This is especially important because we will cover material in class that is not included in the textbook. Also, the tutorials will not only be for review and answering questions, but new material will also be covered.

There is no required textbook for this course. There are several recommended books. We will recommend specific chapters from two books: Introduction to Machine Learning by Ethem Alpaydin, and Pattern Recognition and Machine Learning by Chris Bishop. We will also recommend other readings.

Click on the syllabus

Tentative Syllabus

Schedule (tentative), assignments.

Assignment 1: document and code . Due date Oct. 19 at noon.
Assignment 2: document and code . Due date Nov. 16 at noon.
Assignment 3: document . Due date Dec. 3 at midnight.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

This repository contains links to machine learning exams, homework assignments, and exercises that can help you test your understanding.

fatosmorina/machine-learning-exams

Folders and files, repository files navigation, carnegie mellon university (cmu).

The fall 2009 10-601 midterm ( midterm and solutions )
The spring 2009 10-601 midterm ( midterm and solutions )
The fall 2010 10-601 midterm ( midterm , solution )
The 2001 10-701 midterm ( midterm , solutions )
The 2002 10-701 midterm ( midterm , solutions )
The 2003 10-701 midterm ( midterm , solutions )
The 2004 10-701 midterm ( midterm , solutions )
The 2005 spring 10-701 midterm ( midterm and solutions )
The 2005 fall 10-701 midterm ( midterm and solutions )
The 2006 fall 10-701 midterm ( midterm , solutions )
The 2007 spring 10-701 midterm ( midterm , solutions )
The 2008 spring 10-701 midterm ( midterm and solutions )
Additional midterm examples ( questions , solutions )
The 2001 final ( final , solutions )
The 2002 final ( final with some figs missing , solutions )
The 2003 final ( final , solutions )
The 2004 final ( solutions )
The 2006 fall final ( final , solutions )
The 2007 spring final ( final , solutions )
The 2008 fall final ( final , solutions )
The 2009 spring 701 midterm , final
The 2010 fall 601 midterm
The 2012 fall 601 midterm
The 2012 spring 701 final
May 2015 final
March 2015 midterm
The 2012 fall midterm
The 2015 fall 701 midterm , solutions
The 2011 fall midterm
The spring 2014 midterm
The spring 2013 final
The 2007 10-701 spring ( final , solutions )
The 2008 10-701 fall ( final , solutions )
The 2012 10-701 spring final ( final and solutions )

Stanford University

Cs230 deep learning.

Fall Quarter 2018 Midterm Exam Solution
Spring Quarter 2018 Midterm Exam Solution
Winter Quarter 2018 Midterm Exam Solution
Fall Quarter 2019 Midterm Exam Solution
Winter Quarter 2019 Midterm Exam Solution
Fall Quarter 2020 Midterm Exam Solution
Winter Quarter 2020 Midterm Exam Solution
Winter Quarter 2021 Midterm Exam Solution
Spring Quarter 2021 Midterm Exam Solution

CS224N Natural Language Processing with Deep Learning

Winter 2017 Midterm Exam

Introduction to Machine Learning

The 2020 final exam , solutions

University of Texas

Machine learning.

[Midterm] ( https://www.cs.utexas.edu/~dana/MLClass/practice-midterm-2.pdf )

University of Toronto

Neural networks and deep learning.

2019 Midterm

Technical University of Munich

In2346: introduction to deep learning (i2dl).

2020 Mock exam , solutions

University of Pennsylvania

Cis 520 machine learning.

2016 Midterm Exam

University of Washington

10-701 machine learning.

2007 Autumn Midterm: [Exam] [Solutions]
2009 Autumn Midterm: [Exam]
2013 Spring Midterm (CSE446): [Exam]
2013 Autumn Midterm: [Exam]
2013 Autumn Final: [Exam]
2014 Autumn Midterm: [Exam] [Solutions]

University of Edinburgh

Machine learning and pattern recognition (mlpr) tutorials, autumn 2018.

Tutorial 1, week 3, html , pdf
Tutorial 2, week 4, html , pdf
Tutorial 3, week 5, html , pdf
Tutorial 4, week 6, html , pdf
Tutorial 5, week 7, html , pdf
Tutorial 6, week 8, html , pdf
Tutorial 7, week 9, html , pdf

Contributions

Spread the word
Open pull requests with improvements and new links

Contributors 2

CS 4/5780: Introduction to Machine Learning

Instructors: Anil Damle and Kilian Weinberger

Contact: [email protected] and [email protected]

Office hours: Anil (typically Monday 3:15 pm - 4:15 pm and Wednesday 10:30 am - 11:30 am) and Kilian

Lectures: Tuesdays and Thursdays from 11:25 am till 12:40 pm in Statler Hall 185 (Statler Auditorium). Lecture recordings are available on Canvas

Course overview: The course provides an introduction to machine learning, focusing on supervised learning and its theoretical foundations. Topics include regularized linear models, boosting, kernels, deep networks, generative models, online learning, and ethical questions arising in ML applications.

Prerequisites: probability theory (e.g. BTRY 3080, ECON 3130, MATH 4710, ENGRD 2700), linear algebra (e.g. MATH 2940), calculus (e.g. MATH 1920), and programming proficiency (e.g. CS 2110).

Course Staff

Office Hours: Calendar link

For enrolled students the companion Canvas page serves as a hub for access to Ed Discussions (the course forum), Vocareum (for course projects), Gradescope (for HWs), and quizzes (for the placement exam and paper comprehension quizzes). If you are enrolled in the course you should automatically have access to the site. Please let us know if you are unable to access it.

News and important dates

November 9 — Homework 5 due
November 9 — Project 5 due

Homework, projects, and exams

Your grade in this course is comprised of three components: homework, exams, and projects. Please also read through the given references in concert with the lectures.

Students enrolled in this course at the graduate level (i.e., enrolled in 5780) are required to read assigned research papers and complete the associated online quiz. Papers will be assigned roughly once every one to two weeks.

There will be two exams for this class, an evening prelim and a final exam.

Midterm: 7:30 pm on Thursday, October 21 in Statler 185 and 196
Final: 2:00 pm on Sunday, December 12

Final grades are based on homework assignments, programming projects, and the exams. For the 5780 level version of the course the research comprehension quizzes will also factor in.

Projects: 40%
Homework: 10%
Projects: 35%
Paper comprehension: 10%

Regardless of which grading scheme you are targeting (or ends up being the maximizer), homework must be completed. Homework (staring with homework 2) will be graded for correctness and these scores will be used to compute your overall homework grade. Provided you make a good faith effort (as specified in class) on the homework they do not factor into your final grade under the first scheme above. However, failure to provide a good faith effort for any homework assignments will result in a 5% penalty per missing assignment.

Undergraduates enrolled in 4780 may choose to do the paper comprehension assignments; if completed you will receive the higher of your two grades between the above schemes.

A tentative schedule follows, and includes the topics we will be covering and relevant reference material (More details are given in the references section).

Core references

While this course does not explicitly follow a specific textbook, there are several that are very useful references to supplement the course.

Machine Learning A Probabilistic Perspective by Murphy We will provide section numbers to this text alongside many of the lectures (abbreviated as MLaPP in the schedule). This text is available digitally through the Cornell University Library. [ Cornell library ]
The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman This text provides a comprehensive introduction to statistical learning and provides in-depth discussion of many of the topics in this course (abbreviated as ESL in the schedule). The book is available directly from the authors. [ Book website ]

Additional references

An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani This book provides a good overview of some methods in statistical learning, some of which we will discuss. The book is available online through the books website and via the Cornell Library. [ Book website ]
Patterns, Predictions, and Actions by Hardt and Recht A very nice new book that covers many of the topics we do in this class (abbreviated as PPA in the schedule). The book is available directly from the authors. [ Book website ]
Fairness and Machine Learning by Barocas, Hardt, and Narayanan While a work in progress, this text provides insight into fairness as a central tenet of machine learning. In particular, it highlights ethical challenges that arise in the practice of machine learning. The current version of this book is available directly from the authors. [ Book website ]

Background references

Linear Algebra by Kahn Academy Relive the basics of linear Algebra. Everybody loves Kahn Academy. [ Linear Algebra (Kahn Academy) ]
Linear algebra course by Strang Portions of this course will utilize your knowledge of linear algebra. If you feel you need additional preparation, or would like to revisit the topic, you may find Gilbert Strangs linear algebra course quite useful. [ MIT Open Courseware ]
Matrix Methods in Data Analysis, Signal Processing, and Machine Learning by Strang A subsequent course to the above by Strang covers some of the same topics we will (particularly for the linear algebra part of the course) and you may find the videos a useful additional resource. [ MIT Open Courseware ]

Course policies

Inclusiveness, covid-19 considerations.

We understand that the ongoing global health pandemic impacts all of you in varied and profound ways. Therefore, flexibility is important as we continue to navigate the current state of affairs. While many aspects of this course are built with flexibility in mind, if situations arise that may require additional accommodations please reach out to the instructors to discuss potential arrangements.

Mental health resources

Cornell University provides a comprehensive set of mental health resources and the student group Body Positive Cornell has put together a flyer outlined the resources available.

Participation

You are encouraged to actively participate in class. This can take the form of asking questions in class, responding to questions to the class, and actively asking/answering questions on the online discussion board.

Collaboration policy

Students are free to share code and ideas within their stated project/homework group for a given assignment, but should not discuss details about an assignment with individuals outside their group. The midterm and final exam are individual assignments and must be completed by yourself.

Academic integrity

The Cornell Code of Academic Integrity applies to this course.

Accommodations

In compliance with the Cornell University policy and equal access laws, we are available to discuss appropriate academic accommodations that may be required for student with disabilities. Requests for academic accommodations are to be made during the first three weeks of the semester, except for unusual circumstances, so arrangements can be made. Students are encouraged to register with Student Disability Services to verify their eligibility for appropriate accommodations.

IntroML-Fall2021

Csc 2515 fall 2021: introduction to machine learning.

Machine learning (ML) is a set of techniques that allow computers to learn from data and experience, rather than requiring humans to specify the desired behaviour manually. ML has become increasingly central both in AI as an academic field, and in industry. This course introduces the main concepts and ideas in machine learning, and provides an overview of many commonly used machine learning algorithms. It also serves as a foundation for more advanced ML courses.

By the end of this course, the students will learn about (roughly categorized)

Machine Learning Problems: Supervised (regression and classification), Unsupervised (clustering, dimension reduction), Reinforcement Learning
Models: Linear and Nonlinear (Basis Expansion and Neural Networks)
Loss functions: Squared Loss, Cross Entropy, Hinge, Exponential, etc.
Regularizers: l1 and l2
Probabilistic viewpoint: Maximum Likelihood Estimation, Maximum A Posteriori, Bayesian inference
Bias and Variance Tradeoff
Ensemble methods: Bagging and Boosting
Optimization technique in ML: Gradient Descent and Stochastic Gradient Descent

The students are expected to learn the intuition behind many machine learning algorithms and the mathematics behind them. Through homework assignments, they will also learn how to implement these methods and use them to solve simple machine learning problems. More details can be found in syllabus (to be posted) and piazza .

Announcements:

Teaching staff:.

Email: [email protected]
Office Hour: Thursdays, 1-2PM (on Zoom)
Office hours: Check the schedule of homework assignments as they vary. (on Zoom)

Time & Location:

Location: Room MS 2172 + Virtual (preferred to reduce chance of COVID)
Time: Tuesdays, 11AM-1PM
Location: Virtual
Time: Thursdays, 2-3PM

Assignments and Courswork

These are the main components of the course. The details are described below. You need to use MarkUs to submit your solutions.

Note: This is tentative

Four homework assignments (40%): 10% each
Take-home Test (10%): Release date: Dec 4, Due date: Dec 6
Project (30%)
Reading Assignments (10%): Due date: Dec 10
Questions & Answers (10%)
Bonus (5%): Finding typos in the slides, active class participation, evaluating the class, etc.

Homework Assignments

This is a tentative schedule of the homework assignments. We plan to release them on Tuesday evenings and they will be due in 10 days (Monday of two weeks after), bit if we haven’t covered the topic of the homework yet, we postpone it accordingly.

Research Project

Read the instruction carefully!

Proposal (5%): Nov 15
Project Report (25%): Dec 13

Note: If the number of teams is too many (more than 15-20) or we are late on the main content of the class, we may skip the presentation, and add this 5% to Project Report (making it 25%). Update: Since the number of students are very large, there will not be a presentation.

Reading Assignments

The following papers are a combination of seminal papers in ML, topics that we didn’t cover in lectures, or active research areas. You need to choose five (5) papers out of them, depending on your interest. We will post the papers as the course progresses (so check here often). Please read them and try to understand them as much as possible. It is not important that you completely understand a paper or go into detail of the proofs (if there is any), but you should put some effort into it.

After reading each paper:

You should summarize it in a short paragraph (100-200 words). Highlight the main points of the paper. Ignore the less interesting aspects.
Try to come up with one or two suggestions on how the method/idea described in the paper can be used or extended.

Note: that this is an incomplete and biased list. I have many favourite papers that are not included in this short list.

Ming Yuan and Yi Lin, “Model selection and estimation in regression with grouped variables,” Journal of Royal Statistical Society (B), 2006. PDF
Ali Rahimi and Recht, “Random Features for Large-Scale Kernel Machines,” NIPS, 2007. PDF
Leon Bottou and Olivier Bousquet, “The Tradeoffs of Large Scale Learning,” Advances in Neural Information Processing Systems (NeurIPS), 2007. PDF
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems (NeurIPS), 2012. PDF
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” CVPR, 2016. PDF
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al., “Human-level control through deep reinforcement learning,” Nature, 2015. PDF
Martin Zinkevich, “Online Convex Programming and Generalized Infinitesimal Gradient Ascent,” ICML, 2003. PDF
Ian Goodfellow, Jonathan Shlens, and Christian Szegedy, “Explaining and Harnessing Adversarial Examples,” ICLR, 2015. PDF
Niranjan Srinivas, Andreas Krause, Sham Kakade, Matthias Seeger, “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design,” ICML, 2010. PDF
Moritz Hardt, Eric Price, and Nathan Srebro, “Equality of opportunity in supervised learning,” Advances in Neural Information Processing Systems (NeurIPS), 2016. PDF
Ronan Collobert and Jason Weston, “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” International Conference on Machine Learning (ICML), 2008. PDF
Jeffrey Pennington, Richard Socher, and Christopher D. Manning, “GloVe: Global Vectors for Word Representation,” Empirical Methods in Natural Language Processing (EMNLP), 2014. PDF

We will post the papers as the course progresses.

Questions & Answers

The goal is to encourage you to reflect on the content of each lecture. You do this by writing down one or two questions based on the content of that lecture. You also need to write down your thoughts on the answer for those questions. You do not need to answer them successfully or completely. It is enough to show that you have seriously thought about them. You have until 5PM of the Monday after the lecture is finished to submit your Q&A.

Computing Resources

For the homework assignments, we will use Python, and libraries such as NumPy , SciPy , and scikit-learn . You have two options:

The easiest option is probably to install everything yourself on your own machine.

If you don’t already have python, install it. We recommend using Anaconda . You can also install python directly if you know how.

Optionally, create a virtual environment for this class and step into it. If you have a conda distribution run the following commands: conda create --name csc2515 source activate csc2515
Use pip to install the required packages pip install scipy numpy autograd matplotlib jupyter sklearn

All the required packages are already installed on the Teaching Labs machines.

assignments

List of project topics

Please find a tentative list here . Some more links might be added in the next few days. Feel free to pick a topic outside this list. Remember, the deadline for *choosing* a project (and letting me know, along with partner information) is Monday, 27th March.

Homework problems

We will have a stream of homework problems, following every class. Since this is an advanced graduate level class, solving these problems right after class will (hopefully) help you understand the material better.

Part I: Foundations of Learning Theory

Problem 1. Consider the problem of classifying points in the two-dimensional plane, i.e., $\mathcal{X} = \mathbb{R}^2$. Suppose that the (unknown) true label of a point $(x, y)$ is given by sign$(x)$ (we define sign$(0) = +1$, for convenience). Suppose the input distribution $\mathcal{D}$ is the uniform distribution over the unit circle centered at the origin.

(a) Consider the hypothesis $h$ as shown in the figure below ($h$ classifies all the points on the right of the line as $+1$ and all the points to the left as $-1$). Compute the risk $L_{\mathcal{D}}(h)$, as a function of $\theta$ (which is, as is standard, given in radians).

(b) Suppose we obtain $1/\theta$ (which is given to be an integer $\ge 2$) training samples (i.e., samples from $\mathcal{D}$, along with their true labels). What is the probability that we find a point whose label is "inconsistent" with $h$? Can you bound this probability by a constant independent of $\theta$?

Problem 2. Suppose $A_1, A_2, \dots, A_n$ are events in a probability space.

(a) Suppose $\Pr[A_i] = \frac{1}{2n}$ for all $i$. Then, show that the probability that none of the $A_i$'s occur is at least $1/2$.

(b) Give a concrete example of events $A_i$ for which $\Pr[A_i] = \frac{1}{n-1}$ for all $i$, and the probability that none of them occur is zero.

(c) Suppose $n \ge 3$, and $\Pr[A_i] = \frac{1}{n-1}$, but the events are all independent . Show that the probability that none of them occur is $\ge 1/8$.

Problem 3. In our proof of the no-free lunch theorem, we assumed the algorithm $A$ to be deterministic. Let us now see how to allow randomized alorithms. Let $A$ be a randomized map from set $X$ to set $Y$. Formally, this means that for every $x \in X$, $A(x)$ is a random variable, that takes values in $Y$. Suppose $|X| < c |Y|$, for some constant $c<1$.

(a) Show that there exists a $y \in Y$ such that $\max_{x \in X} \Pr[ A(x) = y] \le c$.

(b) Show that this implies that for any distribution $\mathcal{D}$ over $X$, $\Pr_{x \sim \mathcal{D}} [A(x) = y] \le c$ (for the $y$ shown to exist in part (a)).

Problem 4. Recall that the VC dimension of a hypothesis class $\mathcal{H}$ is the size of the largest set that it can "shatter".

(a) Consider the task of classifying points on a 2D plane, and let $\mathcal{H}$ be the class of axis parallel rectangles (points inside the rectangle are "+" and points outside are "-"). Prove that the VC dimension of $\mathcal{H}$ is 4.

(b) This time, let $\mathcal{X} = \mathbb{R}^d \setminus \{0\}$ (origin excluded), and let $\mathcal{H}$ be the set of all hyperplanes through the origin (points on one side are "+" and the other side are "-"). Prove that the VC dimension of $\mathcal{H}$ is $\le d$. (HINT: consider any set of $d+1$ points. They need to be linearly dependent. Now, could it happen that $u, v$ are "+", but $\alpha u + \beta v$ is "-" for $\alpha, \beta \ge 0$? Can you generalize this?)

(c) (BONUS) Let $\mathcal{X}$ be the points on the real line, and let $\mathcal{H}$ be the class of hypotheses of the form $\text{sign}(p(x))$, where $p(x)$ is a polynomial of degree at most $d$ (for convenience, define sign$(0) = +1$). Prove that the VC dimension of this class is $d+1$. (HINT: the tricky part is the upper bound. Here, suppose $d=2$, and suppose we consider any four points $x_1 < x_2 < x_3 < x_4$. Can the sign pattern $+,-,+,-$ arise from a degree $2$ polynomial?)

Problem 5. In the examples above (and in general), a good rule of thumb for the VC dimension of a function class is the number of parameters involved in defining a function in that class. However, this is not universally true, as illustrated in this problem: let $\mathcal{X}$ be the points on the real line, and define $\mathcal{H}$ to be the class of functions of the form $h_\theta := \text{sign}( \sin \theta x)$, for $\theta \in \mathbb{R}$. Note that each hypothesis is defined by the single parameter $\theta$.

Prove that the VC dimension of $\mathcal{H}$ is infinity.

So where does the "complexity" of the function class come from? (BONUS) prove that if we restrict $\theta$ to be a rational number whose numerator and denominator have at most $n$ bits, then the VC dimension is $O(n)$.

***** This concludes the problems for HW1. The five problems above are due on Wednesday Feb 15 (in class). *****

Problem 6. (Convexity basics) For this problem, let $f$ be a convex function defined over a convex set $K$, and suppose the diameter of $K$ is $1$.

(a) Let $x \in K$, and suppose $f(x) = 2$ and $\lVert \nabla f(x)\rVert = 1$. Give a lower bound on $\min_{z \in K} f(z)$.

(b) Let $x^*$ be the minimizer of $f$ over $K$ (suppose it is unique), and let $x$ be any other point. The intuition behind gradient descent is that the vector: $- \nabla f(x)$ points towards $x^*$. Prove that this is indeed true, in the sense that $\langle \nabla f(x), x - x^* \rangle \ge 0$ (i.e., the negative gradient makes an acute angle with the line to the optimum).

(c) Suppose now that the function $f$ is strictly convex , i.e., $f(\lambda x + (1-\lambda) y) Prove that all the maximizers of $f$ over $K$ lie on the boundary of $K$. [ Hint: You may want to use the definition that a point $x$ is not on the boundary iff there exist points $y, z \in K$ such that $x = (y+z)/2$.]

Problem 7. (Gradient descent basics)

(a) Give an example of a function defined over $\mathbb{R}$, for which for any step-size $\eta > 0$ (no matter how small), gradient descent with step size $\eta$ oscillates around the optimum point (i.e., never gets to distance $< \eta/4$ to it), for some starting point $x \in \mathbb{R}$.

(b) Consider the function $f(x, y) = x^2 + \frac{y^2}{4}$, and suppose we run gradient descent with starting point $(1,1)$, and $\eta = 1/4$. Do we get arbitrarily close to the minimum? Experimentally, find the threshold for $\eta$, beyond which gradient descent starts to oscillate.

Problem 8. (Stochastic gradient descent) Suppose we have points $(a_1, b_1), (a_2, b_2), \dots, (a_n, b_n)$ in the plane, and suppose that $|a_i| \le 1$, and $|b_i| \le 1$ for all $i$. Let $f(x, y) = \frac{1}{n} \sum_{i=1}^n f_i(x, y)$, where $f_i(x, y) = (x - a_i)^2 + (y - b_i)^2$.

(a) What is the point $(x, y)$ that minimizes $f(x, y)$?

(b) Suppose we perform gradient descent (on $f$) with step size $0 < \eta< 1$. Give a geometric interpretation for one iteration.

(d) Pick $n=100$ random points in $[-1,1]^2$ (uniformly), and run SGD for fixed $\eta = 1/2$, as above. Write down what the distance to optimum is, after T = 10, T=100, and T=1000 iterations (if you want to be careful, you should average over 5 random choices for the initialization.) Now consider a dropping step size $\eta_t = 1/t$, and write down the result for $T$ as above.

Problem 9. (Numeric accuracy in MW updates) Consider the randomized experts setting we saw in class (we maintain a distribution over experts at each time, and the loss of the algorithm at that time is the expected loss over the distribution). Consider a setting where the experts predict $0/1$, and the loss is either $0$ or $1$ for each expert. We saw how to update the probabilities (multiply by $e^{-\eta}$ if an expert makes a mistake, keep unchanged otherwise, and renormalize).

One of the common issues here is that numeric errors in such computations tend to compound if not done carefully. Suppose we have $N$ experts, and we start with a uniform distribution over them. Let $p_t^{(i)}$ denote the probability of expert $i$ at time $t$, for the ``true'' (infinite precision) multiplicative weight algorithm, and let $q_t^{(i)}$ denote the probabilities that the `real life' algorithm uses (due to precision limitations).

(a) One simple way to deal with limited precision is to zero out weights that are ``too small''. Specifically, suppose we set $q_t^{(i)} = 0$ if $ q_t^{(i)}/ \max_j q_t^{(j)} Hint: in this case, we are ``losing'' all information about an expert.]

(b) A simple way to overcome this is to avoid storing probabilities, but instead to maintain the number of mistakes $m_t^{(i)}$. Prove how this suffices to recover the probabilities $p_t^{(i)}$ (assuming infinite precision arithmetic).

(c) It turns out that we can use the idea in part (b) to come up with a distribution $q_t$ such that (1) $q_t$ differs from $p_t$ by $ < \epsilon$ in the $\ell_1$ norm, i.e., $\sum_{i} |p_t^{(i)} - q_t^{(i)}| < \epsilon$, and (2) $q_t$ can be represented using finite precision arithmetic (the precision depends on $\epsilon$).

Now, suppose we use $q_t$ to sample (as a proxy for $p_t$), show that the expected number of mistakes of the resulting algorithm is bounded by $(1+\eta) \min_i m_T^{(i)} + O(\log N/\eta) + \epsilon T$.

(d) The bound above is not great if there is an expert who makes very small number of mistakes, say a constant (because we think of $\epsilon$ as a constant, and $T$ as tending to infinity). Using the hint that we are dealing with binary predictions, and say by setting $\eta = 1/10$, can you come up with a way to run the algorithm, so that it uses computations of ``word size'' only $O(\log (NT))$, and obtain a mistake bound of $(1 + 1/5) \min_i m_T^{(i)} + O(\log N)$?

(NOTE: using word size $k$ means that every variable used is represented using $k$ bits; e.g., the C++ "double" uses 64 bits, and so if all probabilities are declared as doubles, you are using a word size of 64 bits.)

Since many of you had trouble with this, here is the (SOLUTION) .

***** This concludes the problems for HW2. The four problems above are due on Monday Mar 27 (in class). *****

Problem 10. Consider the simple experts setting: we have $n$ experts $E_1, \dots, E_n$, and each one makes a $0/1$ prediction each morning. Using these predictions, we need to make a prediction each morning, and at the end of the day we get a loss of $0$ if we predicted right, and $1$ if we made a mistake. This goes on for $T$ days.

Consider an algorithm that at every step, goes with the prediction of the `best' (i.e., the one with the least mistakes so far) expert so far. Suppose that ties are broken by picking the expert with a smaller index. Give an example in which this strategy can be really bad -- specifically, the number of mistakes made by the algorithm is roughly a factor $n$ worse than that of the best expert in hindsight.

Problem 11. We saw in class a proof that the VC dimension of the class of $n$-node, $m$-edge threshold neural networks is $O((m+n)\log n)$. Let us give a ``counting'' proof, assuming the weights are binary ($0/1$). (This is often the power given by VC dimension based proofs -- they can `handle' continuous parameters that cause problems for counting arguments).

(a) Specifically, how many ``network layouts'' can there be with $n$ nodes and $m$ edges? Show that $\binom{n(n-1)/2}{m}$ is an upper bound.

(b) Given a network layout, argue that the number of `possible networks' is at most $2^m (n+1)^n$. [HINT: what can you say about the potential values for the thresholds?]

(c) Use these to show that the VC dimension of the class of binary-weight, threshold neural networks is $O((m+n) \log n)$.

Problem 12. (Importance of random initialization) Consider a neural network consisting of (resp.) the input layer $x$, hidden layer $y$, hidden layer $z$, followed by the output node $f$. Suppose that all the nodes in all the layers compute a `standard' sigmoid. Also suppose that every node in a layer is connected to every node in the next layer (i.e., each layer is fully connected).

Now suppose that all the weights are initialized to 0, and suppose we start performing SGD using backprop, with a fixed learning rate. Show that at every time step, all the edge weights in a layer are equal.

Problem 13. Let us consider networks in which each node computes a rectified linear (ReLU) function (described in the next problem), and show how they can compute very `spiky' functions of the input variables. For this exercise, we restrict to one-variable.

(a) Consider a single (real valued) input $x$. Show how to compute a ``triangle wave'' using one hidden layer (constant number of nodes) connected to the input, followed by one output $f$. Formally, we should have $f(x) = 0$ for $x \le 0$, $f(x) = 2x$ for $0 \le x \le 1/2$, $f(x) = 2(1-x)$ for $1/2 \le x \le 1$, and $f(x) = 0$ for $x \ge 1$. [HINT: choose the thresholds and coefficients for the ReLU's appropriately.] [HINT2: play with a few ReLU networks, and try to plot the output as a function of the input.]

(b) What happens if you stack the network on top of itself? (Describe the function obtained). [Formally, this means the output of the network you constructed above is fed as the input to an identical network, and we are interested in the final output function.]

(c) Prove that there is a ReLU network with one input variable $x$, $2k+O(1)$ layers, all coefficients and thresholds being constants, that computes a function that has $2^k$ ``peaks'' in the interval $[0,1]$.

(The function above can be shown to be impossible to approximate using a small depth ReLU network, without an exponential blow-up in the width.)

Problem 14. In this exercise, we make a simple observation that width isn't as "necessary" as depth. Consider a network in which each node computes a rectified linear (ReLU) unit -- specifically the function at each node is of the form $\max \{0, a_1 y_1 + a_2 y_2 + \dots + a_m y_m + b\}$, for a node that has inputs $y_1, \dots, y_m$. Note that different nodes could have different coefficients and offsets ($b$ above is called the offset).

Consider a network with one (real valued) input $x$, connected to $n$ nodes in a hidden layer, which are in turn connected to the output node, denoted $f$. Show that one can construct a depth $n + O(1)$ network, with just 3 nodes in each layer, to compute the same $f$. [HINT: three nodes allow you to "carry over" the input; ReLU's are important for this.]

***** This concludes the problems for HW3. The four problems above are due on Wednesday, Apr 19 (in class) *****

*** The following problems can be used to make up for problems in the past HWs. You need to submit them by May 6 in order to get credit. ***

Problem 15. Let us consider the $k$-means problem, where we are given a collection of points $x_1, x_2, \dots, x_n$, and the goal is to find $k$ means $\mu_1, \dots, \mu_k$, so as to minimize $\sum_i \lVert x_i - \mu_{c(i)} \rVert^2$, where $c(i)$ is the mean closest to $x_i$. Consider Lloyd's algorithm (seen in class), which starts with an arbitrary initial clustering, and in each step "improves" it, by finding the cluster centers, and then re-mapping the points to the closest current center (the re-mapping forms the new clustering, used in the next iteration).

(a) Show that the objective value is non-increasing in this process.

(b) We saw in class that Lloyd's algorithm is sensitive to initialization. Construct an example with $k=3$, with points on a line, in which the solution to which Lloyd's algorithm converges has a cost that is a 10 factor worse than the best clustering. (HINT: to avoid the issues of "to what does the Lloyd's algorithm converge?", start with an initialization in which Lloyd's algorithm makes no change to the clustering.)

Problem 16. (Random projection vs SVD) Try the following and report your findings: Let RandPoint() be a function that returns a random point in $[0,1]^n$, i.e., it outputs a point in $\mathbb{R}^n$, each of whose coordinates is uniformly random in $[0,1]$. Set n=50. First, generate a point $A$ using RandPoint(). Now, generate 400 points using RandPoint(). To the first 200 points, add $A$. The resultant 400 points will form our dataset (denoted by a $50 \times 400$ matrix) whose columns are the points.

(a) First take projections onto three "random directions", i.e., generate three random $\pm 1$ vectors $v_1, v_2, v_3$ (every entry of each of these vectors is 1 or -1) in $\mathbb{R}^{50}$, and for each point $x$ in the dataset, consider the 3-D vector $(\langle x, v_1 \rangle, \langle x, v_2 \rangle, \langle x, v_3 \rangle )$. Plot these points in 3D.

(b) Next, compute the SVD (using Matlab/numpy). Let $v_1, v_2, v_3$ be the "top 3" left singular vectors. Again, for every point $x$ in the dataset, consider the vector $(\langle x, v_1 \rangle, \langle x, v_2 \rangle, \langle x, v_3 \rangle )$ and plot these points.

Report the differences (they should jump out, if you did things correctly). Can you explain why this happens?

Problem 17. (Stochastic block model) I mentioned briefly in class that SVD can be used in surprising settings. We will now see an example of this. Construct an undirected graph as follows: let $n = 400$ be the number of nodes. Now, pick 200 of these nodes at random and form the set $A$, and let the rest of the nodes be set $B$. Now, add edges randomly as follows: iterate over all pairs $(i, j)$ with $i$ less than $j$, and (a) if $i, j$ are both in $A$ or both in $B$, add the edge with probability 0.5, and (b) otherwise, add the edge with probability 0.2. The graph is undirected, so considering $i$ less than $j$ suffices to describe the graph.

Now, consider the adjacency matrix of the graph (this is an $n \times n$ symmetric matrix, where the $i,j$'th entry is $1$ if there is an edge between $i$ and $j$ and $0$ otherwise). Consider the top two eigenvectors of this matrix, and call them $u$ and $v$. Now, for every vertex $i$, consider the two-D "point" $(u[i], v[i])$. Plot these points on the plane. What do you observe?

(This is the idea behind the so-called "spectral clustering" of graphs. We can find the top few eigenvectors of the adjacency matrix, represent each vertex now as a point in a suitable space, and cluster the points. The representation is called an "embedding" of the graph into a small dimensional space.)

Problem 12. Let $N$ be a neural network in which each neuron computes a `rectified linear' unit.

Aditya Bhaskara

Office: MEB 3120 bhaskara AT cs DOT utah DOT edu

Course home Resources Instructor

This course will be offered once every two years, in the spring semester.

CS 335: Machine Learning

Lectures: Tues, Thurs 11:30am-12:45pm Fourth Hour: Fri 8:30am-9:20am Room: Clapp Laboratory 206 Office hours: Tues 1-3pm, Thurs 9:15-11:15am, Clapp 200 Piazza : https://www.piazza.com/mtholyoke/spring2020/cs335/home Gradescope : https://www.gradescope.com/courses/76996 Moodle : https://moodle.mtholyoke.edu/course/view.php?id=17913

Learning Goals

Understand the general mathematical and statistical principles that allow one to design machine learning algorithms.
Identify, understand, and implement specific, widely-used machine learning algorithms.
Learn how to apply and evaluate the performance of machine learning algorithms.
Derive analytical solutions for mathematical fundamentals of ML (probability, matrix and vector manipulation, partial derivatives, basic optimization, etc.).
Derive and implement learning algorithms.
Identify and evaluate when an algorithm is overfitting and the relationships between regularization, training size, training accuracy, and test accuracy.
Identify real-world problems where machine learning can have impact.
Implement machine learning tools on real data and evaluate performance.
Produce proficient oral and written communication of technical ideas and procedures.
Homeworks (4) — 40%
"Celebrations of learning" (2) — 20%
Project — 30%
Class engagement — 10%
Idea proposal — 2%
Paper and group selection — 2%
Literature review — 5%
Weekly reports (4) — 8%
Final report — 13%

Course schedule

An Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani: an accessible undergraduate machine learning textbook with statistics focus.
Course handouts from Stanford CS 229 by Andrew Ng
Google's Python class
Norm Matloff’s Fast Lane to Python
Stanford CS 231 Python Numpy Tutorial
Stanford CS 231 IPython tutorial

Academic Honesty

Organize study groups.
Clarify ambiguities or vague points in class handouts, textbooks, assignments, and labs.
Discuss assignments at a high level to understand what is being asked for, and to discuss related concepts and the high-level approach.
Refine high-level ideas/concepts for projects (i.e. brainstorming).
Outline solutions to assignments with others using diagrams or pseudocode, but not actual code.
Walk away from the computer or write-up to discuss conceptual issues if you get stuck.
Get or give help on how to operate the computer, terminal, or course software.
Get or give limited debugging help. Debugging includes identifying a syntax or logical error but not helping to write or rewrite code.
Submit the result of collaborative coding work if and only if group work is explicitly permitted (or required).
Look at another student’s solutions.
Use solutions to same or similar problems found online or elsewhere.
Search for homework solutions online.
Turn in any part of someone else's work as your own (with or without their knowledge).
Share your code or written solutions with another student.
Share your code or snippets of your own code online.
Save your work in a public place, such as a public github repository.
Allow someone else to turn in your work as their own. (Be sure to disconnect your network drive when you logout and remove any printouts promptly from printers.)
Collaborate while writing programs or solutions to written problems. (But see above about specific ways to give or get debugging help.)
Write homework assignments together unless it is specified as a group assignment.
Collaborate with anyone outside your group for a group assignment.
Use resources during a quiz or exam beyond those explicitly allowed in the quiz/exam instructions. (If it is not listed, don’t use it. Ask if you are unsure.)
Submit the same or similar work in more than one course. (Always ask the instructor if it is OK to reuse any part of a different project in their course.).

Inclusion and Equity

Accommodations, communication policy, acknowledgments.

Browse Course Material

Course info.

Prof. Philippe Rigollet

Departments

Mathematics

As Taught In

Algorithms and Data Structures
Artificial Intelligence
Data Mining
Applied Mathematics
Discrete Mathematics
Probability and Statistics

Learning Resource Types

Mathematics of machine learning, assignments.

You are leaving MIT OpenCourseWare

Probabilistic Machine learning For mechanics

Lecture notes.

Introduction to ML and Review of probability a nd statistics

Review of Bayesian Statistics - Part 2

Review of Bayesian Statistics - Part 3

Prior modeling, Conjugate Prior, Exponent ial Family

Bayesian linear regression - Part 1

Bayesian linear regression - Part 2

Bayesian inference using sampling methods - Part 1

Bayesian inference using sampling methods - Part 2

Bayesian inference using sampling methods - Part 3

Bayesian inference using sampling methods - Part 4

Bayesian inference using sampling methods - Part 5

Bayesian inference using sampling methods - Part 6

Bayesian inference using sampling methods - Part 7

Approximate methods for Bayesian inference - Part 1

Approximate methods for Bayesian inference - Part 2

S parse linear regression - Part 1

Sparse linear regression - Part 2

Gaussian process - Part 1

Gaussian process - Part 2

Sparse Gaussian Process - Part 1

Sparse Gaussian Process - Part 2

Factor analysis, Probabilistic PCA, Duel Probabilistic PCA, and GP-LVM

Deep Gaussian Process

Invertible neural network - Part 1

Invertible neural network - Part 2

Diffusion Model - Part 1

Diffusion Model - Part 2

Review of the course and way Ahead

Introduction to Statistical Computing and Probability and Statistics

Sum and Product Rules, Conditional Probability, Independence, PDF and CDF, Bernoulli, Categorical and Multinomial Distributions, Poisson, Student’s T, Laplace, Gamma, Beta and Pareto distribution.

Generative Models; Bayesian concept learning, Likelihood, Prior, Posterior, Posterior predictive distribution, Plug-in Approximation

Bayesian Model Selection (continued) and Prior Models, Hierarchical Bayes, Empirical Bayes

Bayesian linear regression

Introduction to Monte Carlo Methods, Sampling from Discrete and Continuum Distributions, Reverse Sampling, Transformation Methods, Composition Methods, Accept-Reject Methods, Stratified/Systematic Sampling

Importance sampling, Gibbs sampling, MCMC, Metropolis Hasting algorithm

Sequential importance sampling, Sequential Monte Carlo

Latent variable model, probabilistic PCA, Expectation maximization

Gaussian process and variational inference

Some advanced topics in probabilistic ML: Bayesian neural network, Invertible neural network , Diffusion model

Lecture notes and references will be provided on the course web site. The following books are recommended:

Bishop, C.M. Pattern recognition and Machine learning, Springer, 2007.

Murphy, K.P. “Machine learning: A Probabilistic Perspective”, MIT press, 2022.

Rasmussen, Carl Edward. Gaussian processes in machine learning, In Summer school on machine learning, pp. 63-71. Springer, Berlin, Heidelberg, 2003

Homework- 1 : Bayesian linear regression, Sampling method [ homework ]

Homework- 2 : Gaussian Process, Approximate methods for Bayesian inference [ homework ]

Homework- 3 : Gaussian Process [ homework ]

Homework 4 : Unsupervised learning and generative modeling [ homework ]

Practical- 0 : Introduction to statistical computing [ QP ]

Practical-1: Effect of prior in Bayesian linear regression [ QP ]

Practical-2: Sampling method in Bayesian linear regression [ QP ]

Practical-3: Approximate inference in Bayesian linear regression [ QP ]

Practical- 4 : Equation discovery using ML [ QP ]

Each student will have to complete a term project as part of this course

Course info

Credit: 4 Units ( 3 -0- 2 )

Timing: Lecture - Monday and Thursday (9:30 am - 11:00 am), Practical - To be decided

Venue: To be announced

Instructor : Dr. Souvik Chakraborty

Teaching Assistants: Navaneeth. N, Shailesh Garg, Tapas Tripura

Course Objective: In this course, the students will be introduced to the fundamentals of probabilistic machine learning and its application in computational mechanics. Students are expected to learn different probabilistic machine learning algorithms and applications in solving mechanics problems The course will emphasize on the mathematical learning of these concepts along with applications. The course is particularly designed for PG, Ph.D., and senior UG students.

Intended audience: Senior UG, PG, and Ph.D. students

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
ACS AuthorChoice
PMC11129294

EspalomaCharge: Machine Learning-Enabled Ultrafast Partial Charge Assignment

Yuanqing wang.

† Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York 10065, United States

‡ Simons Center for Computational Chemistry and Center for Data Science, New York University, New York, New York 10004, United States

Iván Pulido

Kenichiro takaba.

§ Pharmaceutical Research Center, Advanced Drug Discovery, Asahi Kasei Pharma Corporation, Shizuoka 410-2321, Japan

Benjamin Kaminow

∥ Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, Cornell University, New York, New York 10065, United States

Jenke Scheen

⊥ Open Molecular Sciences Foundation, Davis, California 95618, United States

John D. Chodera

Associated data.

An external file that holds a picture, illustration, etc.
Object name is jp4c01287_0005.jpg

Atomic partial charges are crucial parameters in molecular dynamics simulation, dictating the electrostatic contributions to intermolecular energies and thereby the potential energy landscape. Traditionally, the assignment of partial charges has relied on surrogates of ab initio semiempirical quantum chemical methods such as AM1-BCC and is expensive for large systems or large numbers of molecules. We propose a hybrid physical/graph neural network-based approximation to the widely popular AM1-BCC charge model that is orders of magnitude faster while maintaining accuracy comparable to differences in AM1-BCC implementations. Our hybrid approach couples a graph neural network to a streamlined charge equilibration approach in order to predict molecule-specific atomic electronegativity and hardness parameters, followed by analytical determination of optimal charge-equilibrated parameters that preserve total molecular charge. This hybrid approach scales linearly with the number of atoms, enabling for the first time the use of fully consistent charge models for small molecules and biopolymers for the construction of next-generation self-consistent biomolecular force fields. Implemented in the free and open source package EspalomaCharge , this approach provides drop-in replacements for both AmberTools antechamber and the Open Force Field Toolkit charging workflows, in addition to stand-alone charge generation interfaces. Source code is available at https://github.com/choderalab/espaloma-charge .

Introduction

Molecular mechanics (MM) force fields abstract atoms as point charge-carrying particles, with their electrostatic energy ( U e ) calculated by some Coulomb’s law 6

(or some modified form), where k e is Coulomb constant (energy * distance 2 /charge 2 ) and r ij the interatomic distance. In fixed-charge MMs force fields, the partial charges q i are treated as constant, static parameters, independent of instantaneous geometry. As such, partial charge assignment—the manner in which partial charges are assigned to each atom in a given system based on their chemical environments—plays a crucial role in molecular dynamics (MD) simulation, determining the electrostatic energy ( U e ) at every step and shaping the energy landscape.

Traditionally, Partial Charges Have Been Derived from Expensive Ab Initio or Semiempirical Quantum Chemical Approaches

In the early stages of development of molecule mechanics (MM) force fields, ab initio methods were used to generate electrostatic potentials (ESP) on molecular surfaces from which restrained ESP (RESP) charge fits were derived. 2 This process proved to be expensive, especially for large molecules or large numbers of molecules (e.g., in virtual screening, where data sets now approach 10 9 molecules 11 ). This led to the development of the AM1-bond charge correction (BCC) charge scheme 16 , 17 —a method for approximating RESP fits at the HF/6-31G* level of theory, by first calculating population charges using the much less expensive AM1 semiempirical level of theory and subsequently correcting charges via BCCs. As a result, this approach has been widely adopted by the MMs community utilizing force fields such as GAFF 28 and the open force fields. 26

Despite this progress, there are still multiple drawbacks with AM1-BCC. First, the computation is dependent on the generation of one or more conformers, which contributes to the discrepancy among the results of different chemoinformatics toolkits. While conformer ensemble selection methods such as ELF10 a attempt to minimize these geometry-dependent effects, they do not fully eliminate them, and significant discrepancies between toolkits can remain.

Machine Learning Approaches to Charge Assignment Have Recently Been Proposed but Face Challenges in Balancing Generalization with the Ability to Preserve Total Molecular Charge

The rising popularity of machine learning has led to a desire to exploit new approaches to rapidly predict partial atomic charges. For example, recent work from Bleiziffer et al. 4 employed a random forest approach to assign charges based on atomic features but faced the issue of needing to preserve total molecular charge while making predictions on an atomic basis—they distribute the difference between predicted and reference charge evenly among atoms. Similarly, Metcalf et al. 22 preserve the total charge by allowing only charge transfer in message-passing form resulting in zero net-charge change. A more classical approach by Gilson et al. 10 tackles the charge constraint problem in a clever manner: instead of directly predicting charges, by predicting atomic electronegativity and electronic hardness, a simple constrained optimization problem inspired by physical charge equilibration (QEq) 27 can be solved analytically to yield partial charges that satisfy total molecular charge constraints. In spite of its experimental success, its ability to reproduce quantum-chemistry-based charges is heavily dependent upon the discrete atom typing scheme to classify and group atoms by their chemical environments. Additionally, charges have been considered in new deep machine learning potential models, 20 and machine learning has also been employed to come up with electrostatic parameters for Drude oscillator force fields. 21

Recently, Wang 29 and Wang et al. 31 designed a graph neural networks-based atom typing scheme, termed Espaloma (extensible surrogate potential optimized by message-passing algorithms), to replace the human expert-derived, discrete atom types with continuous atom embeddings ( Figure Figure1 1 ). This allows atoms with subtle chemical environment differences to be distinguished by the model without the need to painstakingly specify heuristics.

An external file that holds a picture, illustration, etc.
Object name is jp4c01287_0008.jpg

EspalomaCharge Generates AM1-BCC ELF10 Quality Charges in an Ultrafast Manner Using Machine Learning

Theory: espaloma graph neural networks for chemical environment perception, qeq, and espalomacharge, espaloma uses graph neural networks to perceive atomic chemical environments.

Espaloma ( 31 ) uses graph neural networks (GNNs) 1 , 9 , 14 , 19 , 30 , 34 to assign continuous latent representations of chemical environments to atoms that replace human expert-derived discrete atom types. These continuous atom representations are subsequently used to assign symmetry-preserving parameters for atomic, bond, angle, torsion, and improper force terms.

where edges incident to a node v pool their embeddings to form aggregated neighbor embedding a v , and finally, a node update

QEq Is a Physically Inspired Model for Computing Partial Charges while Maintaining Total Molecular Charge

We adopt the method proposed by Gilson et al. 10 where we predict the electronegativity e i and hardness s i of each atom i , which are defined as the first- and second-order derivative of the potential energy in QEq approaches 27

Next, we minimize the second-order Taylor expansion of the charging potential energy contributed by these terms, neglecting interatomic electrostatic interactions

which, as it turns out, has an analytical solution given by Lagrange multipliers

We thus use the Espaloma framework to predict the unconstrained atomic electronegativity ( e ) and hardness ( s ) parameters used in eq 8 to assign partial charges in a manner that ensures that total molecular charge sums to Q . It is worth noting that, by the equivalence analysis proposed in Wang et al., 31 the tabulated atom typing scheme Gilson et al. 10 uses amounts to a model working analogously to a Weisfeiler-Lehman test 33 with hand-written kernel, whereas here we replace this with an end-to-end differentiable GNN model to greatly expand its resolution and ability to optimize based on reference charges.

EspalomaCharge Has Time Complexity in the Number of Atoms

Experiments: espalomacharge accurately reproduces am1-bcc charges at a fraction of its cost.

We show, in this section, that the discrepancy between EspalomaCharge and the OpenEye toolkit is comparable to or smaller than that between AmberTools 5 and OpenEye. EspalomaCharge is fast and scalable to larger systems, taking seconds to parameterize a biopolymer with 100 residues on CPU.

SPICE Data Set Covers Biochemically and Biophysically Interesting Chemical Space

To curate a data set representing the chemical space of interest for biophysical modeling of biomolecules and drug-like small molecules, we use the SPICE 8 data set, enumerating reasonable protonation and tautomeric states with the OpenEye Toolkit. We generated AM1-BCC ELF10 charges for each of these molecules using the OpenEye Toolkit and trained EspalomaCharge ( Figure Figure1 1 ) to reproduce the partial atomic charges with a squared loss function. This model, with its parameters distributed with the code, is used in all of the characterization results hereafter.

EspalomaCharge Is Accurate, Especially on Chemical Spaces Where Training Data Is Abundant

First, upon training on the 80% training set of SPICE, we test on the 10% held-out test set to benchmark the in-distribution (similar chemical species) performance of EspalomaCharge ( Table 1 , first half). Notably, the discrepancy [measured by charge root-mean-square error (RMSE)] between EspalomaCharge and OpenEye is comparable with or smaller than that between AmberTools 5 and OpenEye—two popular chemoinformatics toolkits for assigning AM1-BCC charges to small molecules. Since it is a common practice in the community to use these two toolkits essentially interchangeably, we argue that the discrepancy between these could be established as a baseline below which the error is no longer meaningful.

We prepare several out-of-distribution external data sets to test the generalizability of EspalomaCharge to other molecules of significance to chemical and biophysical modeling, including a filtered list of FDA-approved drugs, a subset of the ZINC 12 , 15 purchasable chemical space, and finally the FreeSolv 23 data set consisting of molecules with experimental and computationally estimated solvation free energy. The discrepancy between EspalomaCharge and OpenEye is lower than or comparable with that between AmberTools and OpenEye, demonstrating that the high performance of EspalomaCharge is generalizable, at least within chemical spaces frequently used in chemical modeling and drug discovery.

To pinpoint the source of the error for EspalomaCharge, we stratified the molecules by the number of atoms and total molecular charge, computing the errors on each subset ( Figure Figure2 2 ). Compared to the error baseline, EspalomaCharge is most accurate where there was abundant data in the training set. This is especially true when it comes to stratification by net molecular charge since the extrapolation from small systems to larger systems is encoded in the inductive biases of GNNs. Given the performance of well-sampled charge bins, it seems likely the poor performance for molecules with more exotic −4 and −5 net charges will be resolved once the data set is enriched with more examples of these states.

An external file that holds a picture, illustration, etc.
Object name is jp4c01287_0010.jpg

EspalomaCharge shows smaller average charge RMSE than AmberTools on well-represented regions of chemical space. SPICE data set test set performance stratified by total charge (left panel) and molecule size (right panel). To better illustrate the effects of limited training data on stratified performance, the number of test (upper number) and training (lower number) molecules falling into respective categories are also annotated with test set distribution plotted as histogram.

It is worth mentioning that unified application programming interfaces (API) (Listing 3) integrated in Open Force Field toolkits are responsible for generating the performance benchmark experiments above. Additionally, a command–line interface (CLI) is also provided for seamless integration of EspalomaCharge into Amber workflows (Listing 4).

EspalomaCharge Is Fast, Even on Large Biomolecular Systems

Apart from the accurate performance, the drastic difference in the speed of parametrization is also observed in the benchmarking experiments. For the small molecule data sets in Table 1 , EspalomaCharge is 300–3000 times faster than AmberTools and 15–75 times faster than OpenEye.

An external file that holds a picture, illustration, etc.
Object name is jp4c01287_0012.jpg

EspalomaCharge is fast, even for large systems. Wall time required to assign charges to ACE-ALA n -NME peptides with different toolkits is shown on a log plot, illustrating that EspalomaCharge on the CPU or GPU is orders of magnitude faster than semiempirical-based charging methods for larger molecules or biopolymers and is practical even for assigning charges to proteins of practical size. Fluctuation in traces is due to the stochasticity in timing trials.

Batching many molecules into a single charging calculation can provide significant speed benefits when parameterizing large virtual libraries by making maximum use of hardware parallelism. EspalomaCharge provides a seamless way to achieve these speedups when providing a sequence of molecules, rather than single molecules at a time, as the input to the charge function in the API (Listing 5). In this case, the molecular graphs are batched with their adjacency matrix concatenated diagonally, processed by GNN and QEq models, and subsequently unbatched to yield the result. For instance, the wall time needed to parameterize all 100 ACE-ALA n -NME molecules from n = 1, ..., 100 depicted in Figure Figure3 3 at once, in batch mode, is 7.11 s with CPU—only marginally longer than the time required to parameterize the largest molecule in the data set, indicating that hardware resources are barely being saturated at this point.

Error from Experiment in Explicit Solvent Hydration Free Energies Is Not Statistically Significantly Different between EspalomaCharge, AmberTools, and OpenEye Implemnetations of AM1-BCC

While the charge deviations between EspalomaCharge and other toolkit implementations of AM1-BCC are comparable to the deviation between toolkits, it is unclear how the magnitude of these charge deviations translates into deviations of observable condensed-phase properties (such as free energies) from the experiment. To assess this, we carried out explicit solvent hydration free energy calculations, which serve as an excellent gauge of the impact of parameter perturbations, 24 as the result is heavily dependent upon the small-molecule charges. We use each set of charges in calculating the hydration free energies for the molecules in FreeSolv 7 (see Detailed Methods section in Supporting Information ), a standard curated data set of experimental hydration free energies. In Figure Figure4 4 , we compare the computed explicit solvent hydration free energies with experimental measurements and quantify the impact of the charge model on both deviation statistics (RMSE) and correlation statistics ( R 2 ) with the experiment. We note that EspalomaCharge provides statistically indistinguishable performance compared to AmberTools 5 and the OpenEye toolkit on both metrics, RMSE and R 2 . This encouraging result suggests that any discrepancy introduced by EspalomaCharge is unlikely to significantly alter the qualitative behavior of MD simulations in terms of ensemble averages or free energies.

An external file that holds a picture, illustration, etc.
Object name is jp4c01287_0004.jpg

EspalomaCharge introduces little error to explicit hydation free energy prediction. Calculated-vs-experimental explicit solvent hydration free energies computed with AM1-BCC charges provided by EspalomaCharge, AmberTools, and the OpenEye Toolkit, respectively. Simulations used the GAFF 2.11 small molecule force field 28 and TIP3P water 18 with particle mesh Ewald electrostatics (see Detailed Methods section in Supporting Information ). Annotated are RMSE and R 2 score there between and bootstrapped 95% confidence interval. See also Appendix Figure S3 for comparison among computed hydration free energies.

EspalomaCharge Assigns High-Quality Conformation-Independent AM1-BCC Charges Using a Modern Machine Learning Infrastructure That Supports Accelerated Hardware

Ability to assign topology-driven conformation-independent self-consistent charges to small molecules and biopolymers prepares the community for next-generation unified force fields, espalomacharge provides a simple api and cli for facile integration into popular workflows.

EspalomaCharge is a pip -installable (Listing 1) open software package (see the Detailed Methods section in Supporting Information ), making it easy to integrate into existing workflows with minimal complexity. Assigning charges to molecules using the EspalomaCharge Python API is simple and straightforward (Listing 2). A GPU can be used automatically, and entire libraries can be rapidly parameterized in batch mode (Listing 5). EspalomaCharge provides both a Python API and a convenient CLI, allowing EspalomaCharge to be effortlessly integrated into popular MM and MD workflows such as the OpenForceField toolkit (Listing 3) and Amber (Listing 4).

One-Hot Embedding Cannot Generalize to Rare or Unseen Elements

One-hot element encoding is used in the architecture, making the model unable to perceive elemental similarities. This would compromise per-node performance for rare elements and prevent the model from being applied on unseen elements. Possible ways to mitigate this limitation include encoding the elemental physical properties as the node input.

Future Expansions of the Training Set Could Further Mitigate Errors

As shown in Figure Figure2 2 , the generalization error is heavily dependent on the data abundance within the relevant stratification of the training set—bins containing more training data show higher accuracy. Future work could aim to systematically identify underrepresented regions of chemical space and expand training data sets to reduce error for uncommon chemistries and exotic charge states, either with larger static training sets or using active learning techniques.

Multiobjective Fitting Could Enhance Generalizability

Though EspalomaCharge produces an accurate surrogate for AM1-BCC charges, these small errors in charges can translate to larger deviations in ESP (see Supporting Information Figure S2). Since the function mapping charges (together with conformations) to ESPs are simple and differentiable, one can easily incorporate ESP as a target in the training process, using ESPs derived either from reference charges or (as in the original RESP 2 ) to quantum chemical ESPs. A multiobjective strategy that includes multiple targets (such as charges and ESPs), potentially with additional charge regularization terms (as in RESP 2 ), could result in more generalizable models with lower ESP discrepancies. Furthermore, similar observables can be incorporated into the training process to improve the utility of the model in modeling of real condensed-phase systems. For instance, condensed-phase properties, such as densities or dielectric constants, other quantum chemical properties, or even experimentally measured binding free energies.

Acknowledgments

Research reported in this publication was supported by the National Institute for General Medical Sciences of the National Institutes of Health under award numbers R01GM132386 and R01GM140090. YW acknowledges funding from NIH grant R01GM132386 and the Sloan Kettering Institute, as well as the Schmidt Science Fellowship, in partnership with the Rhodes Trust. JDC acknowledges funding from NIH grants R01GM132386 and R01GM140090. The authors would like to thank the Open Force Field consortium for providing constructive feedback, especially Christopher Bayly, OpenEye; David Mobley, UC Irvine; and Michael Gilson, UC San Diego.

Special Issue

Published as part of The Journal of Physical Chemistry A virtual special issue “Recent Advances in Simulation Software and Force Fields”.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jpca.4c01287 .

Detailed methods—from code implementation, data set curation, to model training—used to produce the results in this paper ( PDF )

The authors declare no competing financial interest.

JDC is a current member of the Scientific Advisory Board of OpenEye Scientific Software, Redesign Science, Ventus Therapeutics, and Interline Therapeutics and has equal interest in Redesign Science and Interline Therapeutics. The Chodera laboratory receives or has received funding from multiple sources, including the National Institutes of Health, the National Science Foundation, the Parker Institute for Cancer Immunotherapy, Relay Therapeutics, Entasis Therapeutics, Silicon Therapeutics, EMD Serono (Merck KGaA), AstraZeneca, Vir Biotechnology, Bayer, XtalPi, Interline Therapeutics, the Molecular Sciences Software Institute, the Starr Cancer Consortium, the Open Force Field Consortium, Cycle for Survival, a Louis V. Gerstner Young Investigator Award, and the Sloan Kettering Institute. A complete funding history for the Chodera lab can be found at http://choderalab.org/funding . YW has limited financial interest in Flagship Pioneering, Inc. and its subsidiaries.

a ELF10 denotes that the ELF (“electrostatically least-interacting functional groups”) conformer selection process was used to generate 10 diverse conformations from the lowest energy 2% of conformers. Electrostatic energies are assessed by computing the sum of all Coulomb interactions in vacuum using the absolute values of MMFF charges assigned to each atom. 13 AM1-BCC charges are generated for each conformer and then averaged.

Supplementary Material

Battaglia P. W.; Hamrick J. B.; Bapst V.; Sanchez-Gonzalez A.; Zambaldi V.; Malinowski M.; Tacchetti A.; Raposo D.; Santoro A.; Faulkner R.; et al. Relational inductive biases, deep learning, and graph networks . arXiv 2018, arXiv:1806.01261. [ Google Scholar ] preprint
Bayly C. I.; Cieplak P.; Cornell W.; Kollman P. A. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the resp model . J. Phys. Chem. 1993, 97 ( 40 ), 10269–10280. 10.1021/j100142a004. [ CrossRef ] [ Google Scholar ]
Berman H. M.; Westbrook J.; Feng Z.; Gilliland G.; Bhat T. N.; Weissig H.; Shindyalov I. N.; Bourne P. E. The Protein Data Bank . Nucleic Acids Res. 2000, 28 ( 1 ), 235–242. 10.1093/nar/28.1.235. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Bleiziffer P.; Schaller K.; Riniker S. Machine learning of partial charges derived from high-quality quantum-mechanical calculations . J. Chem. Inf. Model. 2018, 58 ( 3 ), 579–590. 10.1021/acs.jcim.7b00663. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Case D.; Belfon K.; Ben-Shalom I.; Brozell S.; Cerutti D.; Cheatham T. III; Cruzeiro V.; Darden T.; Duke R.; et al. Amber 2020 , 2020.
Coulomb C. Premier-[troisième] mémoire sur l’electricité et le magnétisme. Nineteenth Century Collections Online (NCCO): Science, Technology, and Medicine: 1780–1925 ; Académie Royale des sciences, 1785. [ Google Scholar ]
Duarte Ramos Matos G.; Kyu D. Y.; Loeffler H. H.; Chodera J. D.; Shirts M. R.; Mobley D. L. Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the freesolv database . J. Chem. Eng. Data 2017, 62 ( 5 ), 1559–1569. 10.1021/acs.jced.7b00104. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Eastman P.; Behara P. K.; Dotson D. L.; Galvelis R.; Herr J. E.; Horton J. T.; Mao Y.; Chodera J. D.; Pritchard B. P.; Wang Y.; et al. Spice, a Dataset of Drug-Like Molecules and Peptides for Training Machine Learning Potentials ; Nature Publishing Group, 2022. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E. Neural message passing for quantum chemistry . arXiv 2017, arXiv:1704.01212. [ Google Scholar ] preprint
Gilson M. K.; Gilson H. S. R.; Potter M. J. Fast assignment of accurate partial atomic charges: An electronegativity equalization method that accounts for alternate resonance forms . J. Chem. Inf. Comput. Sci. 2003, 43 ( 6 ), 1982–1997. 10.1021/ci034148o. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Glaser J.; Vermaas J. V.; Rogers D. M.; Larkin J.; LeGrand S.; Boehm S.; Baker M. B.; Scheinberg A.; Tillack A. F.; Thavappiragasam M.; et al. High-throughput virtual laboratory for drug discovery using massive datasets . Int. J. High Perform. Comput. Appl. 2021, 35 ( 5 ), 452–468. 10.1177/10943420211001565. [ CrossRef ] [ Google Scholar ]
Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules . ACS Cent. Sci. 2018, 4 ( 2 ), 268–276. 10.1021/acscentsci.7b00572. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Halgren T. A. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94 . J. Comput. Chem. 1996, 17 ( 5–6 ), 490–519. 10.1002/(sici)1096-987x(199604)17:5/6<490::aid-jcc1>3.0.co;2-p. [ CrossRef ] [ Google Scholar ]
Hamilton W.; Ying Z.; Leskovec J. Inductive representation learning on large graphs . Adv. Neural Inf. Process. Syst. 2017, 30 , 1024–1034. [ Google Scholar ]
Irwin J. J.; Shoichet B. K. ZINC–a free database of commercially available compounds for virtual screening . J. Chem. Inf. Model. 2005, 45 ( 1 ), 177–182. 10.1021/ci049714+. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Jakalian A.; Bush B. L.; Jack D. B.; Bayly C. I. Fast, efficient generation of high-quality atomic charges. am1-bcc model: I. method . J. Comput. Chem. 2000, 21 ( 2 ), 132–146. 10.1002/(SICI)1096-987X(20000130)21:2<132::AID-JCC5>3.0.CO;2-P. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Jakalian A.; Jack D. B.; Bayly C. I. Fast, efficient generation of high-quality atomic charges. am1-bcc model: Ii. parameterization and validation . J. Comput. Chem. 2002, 23 ( 16 ), 1623–1641. 10.1002/jcc.10128. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Jorgensen W. L.; Chandrasekhar J.; Madura J. D.; Impey R. W.; Klein M. L. Comparison of simple potential functions for simulating liquid water . J. Chem. Phys. 1983, 79 ( 2 ), 926–935. 10.1063/1.445869. [ CrossRef ] [ Google Scholar ]
Kipf T. N.; Welling M. Semi-supervised classification with graph convolutional networks . arXiv 2016, arXiv:1609.02907. [ Google Scholar ] preprint
Ko T. W.; Finkler J. A.; Goedecker S.; Behler J. A fourth-generation high-dimensional neural network potential with accurate electrostatics including non-local charge transfer . Nat. Commun. 2021, 12 ( 1 ), 398. 10.1038/s41467-020-20427-2. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Kumar A.; Pandey P.; Chatterjee P.; MacKerell A. D. Deep neural network model to predict the electrostatic parameters in the polarizable classical drude oscillator force field . J. Chem. Theory Comput. 2022, 18 ( 3 ), 1711–1725. 10.1021/acs.jctc.1c01166. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Metcalf D. P.; Jiang A.; Spronk S. A.; Cheney D. L.; Sherrill C. D. Electron-passing neural networks for atomic charge prediction in systems with arbitrary molecular charge . J. Chem. Inf. Model. 2021, 61 ( 1 ), 115–122. 10.1021/acs.jcim.0c01071. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Mobley D. L.; Bannan C. C.; Rizzi A.; Bayly C. I.; Chodera J. D.; Lim V. T.; Lim N. M.; Beauchamp K. A.; Shirts M. R.; Gilson M. K.; et al. Open force field consortium: Escaping atom types using direct chemical perception with smirnoff v0. 1 . BioRxiv 2018, 286542. 10.1101/286542. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Mobley D. L.; Dumont E.; Chodera J. D.; Dill K. A. Comparison of charge models for fixed-charge force fields: small-molecule hydration free energies in explicit solvent . J. Phys. Chem. B 2007, 111 ( 9 ), 2242–2254. 10.1021/jp0667442. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Paszke A.; Gross S.; Chintala S.; Chanan G.; Yang E.; DeVito Z.; Lin Z.; Desmaison A.; Antiga L.; Lerer A.. Automatic Differentiation in Pytorch ; OpenReview, 2017. [ Google Scholar ]
Qiu Y.; Smith D. G.; Boothroyd S.; Jang H.; Hahn D. F.; Wagner J.; Bannan C. C.; Gokey T.; Lim V. T.; Stern C. D.; et al. Development and benchmarking of open force field v1. 0.0—the parsley small-molecule force field . J. Chem. Theory Comput. 2021, 17 ( 10 ), 6262–6280. 10.1021/acs.jctc.1c00571. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Rappe A. K.; Goddard W. A. I. Charge equilibration for molecular dynamics simulations . J. Phys. Chem. 1991, 95 ( 8 ), 3358–3363. 10.1021/j100161a070. [ CrossRef ] [ Google Scholar ]
Wang J.; Wolf R. M.; Caldwell J. W.; Kollman P. A.; Case D. A. Development and testing of a general amber force field . J. Comput. Chem. 2004, 25 ( 9 ), 1157–1174. 10.1002/jcc.20035. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Wang Y. Graph Machine Learning for (Bio) Molecular Modeling and Force Field Construction . Ph.D. Thesis; Weill Medical College of Cornell University, 2023. [ Google Scholar ]
Wang Y.; Chodera J. D. Spatial attention kinetic networks with e(n)-equivariance . arXiv 2023, arXiv:2301.08893. [ Google Scholar ] preprint
Wang Y.; Fass J.; Kaminow B.; Herr J. E.; Rufa D.; Zhang I.; Pulido I.; Henry M.; Bruce Macdonald H. E.; Takaba K.; Chodera J. D. End-to-end differentiable construction of molecular mechanics force fields . Chem. Sci. 2022, 13 , 12016–12033. 10.1039/D2SC02739A. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Wang Y.; Fass J.; Stern C. D.; Luo K.; Chodera J. Graph nets for partial charge prediction . arXiv 2019, arXiv:1909.07903. [ Google Scholar ]
Weisfeiler B.; Leman A. The reduction of a graph to canonical form and the algebgra which appears therein . nti, Series 1968, 2 , 12. [ Google Scholar ]
Xu K.; Hu W.; Leskovec J.; Jegelka S. How powerful are graph neural networks? . arXiv 2018, arXiv:1810.00826. [ Google Scholar ] preprint

Survey of machine learning techniques for Arabic fake news detection

Open access
Published: 28 May 2024
Volume 57 , article number 157 , ( 2024 )

Cite this article

You have full access to this open access article

Ibtissam Touahri 1 &
Azzeddine Mazroui 2

Social media platforms have emerged as primary information sources, offering easy access to a wide audience. Consequently, a significant portion of the global population relies on these platforms for updates on current events. However, fraudulent actors exploit social networks to disseminate false information, either for financial gain or to manipulate public opinion. Recognizing the detrimental impact of fake news, researchers have turned their attention to automating its detection. In this paper, we provide a thorough review of fake news detection in Arabic, a low-resource language, to contextualize the current state of research in this domain. In our research methodology, we recall fake news terminology, provide examples for clarity, particularly in Arabic contexts, and explore its impact on public opinion. We discuss the challenges in fake news detection, outline the used datasets, and provide Arabic annotation samples for label assignment. Likewise, preprocessing steps for Arabic language nuances are highlighted. We also explore features from shared tasks and their implications. Lastly, we address open issues, proposing some future research directions like dataset improvement, feature refinement, and increased awareness to combat fake news proliferation. We contend that incorporating our perspective into the examination of fake news aspects, along with suggesting enhancements, sets this survey apart from others currently available.

Avoid common mistakes on your manuscript.

1 Introduction

The advent of Web 2.0 has facilitated real-time human interaction and the rapid dissemination of news. Alongside traditional news outlets, individuals increasingly rely on online platforms to express themselves and gather information. Social networks serve as hubs for a plethora of data, including opinions, news, rumors, and fake news generated by internet users. While these platforms offer instant access to information, they also facilitate the dissemination of unchecked data, which can inadvertently mislead users. The proliferation of fake news on social media, fueled by sensational and inflammatory language that aims at maximizing engagement, is a growing concern. Additionally, social media platforms often employ fear-based and persuasive language in their content, further amplifying the impact of misinformation. Satirical content, in particular, poses a unique challenge as it can skew public perception and be exploited for political and commercial gain.

The dissemination of misleading or fake statements, often appealing to emotions, can significantly influence public opinion and lead individuals to disregard factual information. The 2016 US presidential campaign, widely reported to have been influenced by fake news, brought heightened awareness to the detrimental impact of misinformation (Bovet and Makse 2019 ). Furthermore, during the coronavirus pandemic, claims regarding COVID-19 often circulated without credible references (Shahi et al. 2021 ). Indeed, many studies have reported that on social networks, the pandemic was accompanied by a large amount of fake and misleading news about the virus that has spread faster than the facts (Yafooz et al. 2022 ) (Alhindi et al. 2021 ). For example, fake news had claimed that COVID-19 is caused by 5G technology, which led to a misunderstanding of the pandemic among the public (Touahri and Mazroui 2020 ). Hence, fake news has attracted attention in all countries and cultures, from the US elections to the Arab Spring (Rampersad and Althiyabi 2020 ). Extensive research related to these claims for the English language has been conducted (Zhou et al. 2020 ), but few researches have focused on the Arabic language which has specific characteristics (Shahi et al. 2021 ; Saeed et al. 2018 , 2021 ).

The study of fake news is a multidisciplinary endeavor, bringing together experts from computer and information sciences, as well as political, economic, journalistic, and psychological fields. This collaborative approach is essential for comprehensive understanding and effective solutions.

Online Fake news encompasses various aspects, including the individuals or entities creating the news, the content itself, those disseminating it, the intended targets, and the broader social context in which it circulates (Zhou and Zafarani 2020 ; Wang et al. 2023 ). The primary sources of information vary in terms of trustworthiness, with government communication platforms generally being the most trusted, followed by local news channels, while social media platforms are typically viewed with lower levels of trust (Lim and Perrault 2020 ). People's political orientations can influence their perception of the accuracy of both genuine and fake political information, potentially leading to an overestimation of accuracy based on their ideological beliefs (Haouari et al. 2019 ). News can be classified as genuine or fake. Moreover, we can find the multi-label datasets and the multi-class level of classification (Shahi et al. 2021 ). Fake news differs from the truth in content quality, style, and sentiment, while containing similar levels of cognitive and perceptual information (Ali et al. 2022 ; Al-Ghadir et al. 2021 ; Ayyub et al. 2021 ). Moreover, they are often matched with shorter words and longer sentences (Zhou et al. 2020 ).

Detecting fake news in Arabic presents several unique challenges compared to English. Here are some ways in which Arabic fake news detection differs:

Language structure : Arabic morphology is complex since an inflected word in the Arabic language can form a complete syntactic structure. For example, the word “فأعطيناكموه” /f > ETynAkmwh/ Footnote 1 (and we gave it to you) contains a proclitic, a verb, a subject and two objects. The linguistic complexity of the Arabic language, complex morphology and a rich vocabulary, may pose challenges for natural language processing (NLP) tasks, including fake news detection.

Dialectal variations : Even though Modern Standard Arabic (MSA) is the official language in Arab countries, many social media users use dialects to express themselves. Arabic encompasses numerous dialects across different regions, each with its vocabulary, grammar, and expressions. This diversity makes it challenging to develop models that can effectively identify fake news across various Arabic dialects. Moreover, besides the varieties of languages spoken according to the countries, the written one is also affected by code-switching which is frequent on the Web, as Internet users switch between many languages and dialects using the Web Arabic, namely Arabizi, Franco-Arabic and MSA, which makes their expressions composed of various languages. Some Arabic studies on fake news detection are aware of the presence of dialect in the tweets analyzed. By considering the dialects of North Africa and the Middle East (Ameur and Aliane 2021 ) (Yafooz et al. 2022 ), it has been proved that fake news detection systems can perform less well when dialect data is not processed (Alhindi et al. 2021 ).

Cultural nuances : Arabic-speaking communities have distinct cultural norms, beliefs, and sensitivities that influence how information is perceived and shared. Understanding these cultural nuances is essential for accurately detecting fake news in Arabic.

Data availability : English fake news detection is characterized by performing systems built on large resources and advanced approaches. The Arabic language in turn can borrow these methodologies to build systems or custom approaches for fake news detection. However, this is faced with the scarcity of its resources and its complex morphology and varieties (Nassif et al. 2022 ; Himdi et al. 2022 ; Awajan 2023 ). Compared to English, there is relatively less labeled data available for training fake news detection models in Arabic. This scarcity of data makes it challenging to develop robust and accurate detection algorithms.

Socio-political context : The socio-political landscape in Arabic-speaking regions differs from that of English-speaking countries. Fake news may serve different purposes and target different socio-political issues, requiring tailored approaches for detection.

In summary, Arabic fake news detection requires specialized techniques that account for the language's unique characteristics, dialectal variations, cultural nuances, data availability, and socio-political context. Building effective detection systems in Arabic necessitates interdisciplinary collaboration and a deep understanding of the language and its socio-cultural context. This raises the need for thorough studies to address Arabic fake news detection.

In the following, we define our research methodology. We then delineate the terminologies pertinent to fake news and its processes, providing illustrative examples to aid comprehension, particularly within the context of the Arabic language. We explore the interplay between fake news and public opinion orientation, highlighting overlapping domains and key challenges in detection. Representative datasets and their applications in various studies are outlined, with Arabic annotation samples to illustrate label assignment considerations based on language, context, topic, and information dissemination. We delve into the preprocessing steps, emphasizing the unique characteristics of the Arabic language. Additionally, we discuss the potential features extractable from shared tasks, presenting their implications and main findings. Finally, we address open issues in fake news detection, proposing avenues for future research, including dataset enhancement, feature extraction refinement, and increased awareness to mitigate fake news proliferation.

2 Research methodology

In this section, we define the main research questions based on which our study is performed. Then, we describe the whole research process and we discuss the scope of our research.

2.1 Research questions

We established a set of questions to address the purpose of our research. They range from broad to more specific questions that help in describing, defining and explaining the main aspects of a fake news detection system.

RQ1 : What is fake news and how it does affect people and society?

RQ2 : What are the criteria for a fake news detection process?

RQ3 : What are the main sources from which data are extracted?

RQ4 : What are the main annotations for the retrieved claims?

RQ5 : How to create a pertinent model for detecting fake news?

RQ6 : Are automatic or manual detection of fake news sufficient regarding the large spread of information?

RQ7 : How to prevent the spread of fake news?

We base the research process on the established questions. During this process, we aim to select papers that discuss Arabic fake news detection.

2.2 Search process

Since we are looking for relevant papers in the domain of Arabic fake news detection, we started by querying Google Scholar using ("Fake" OR "misinformation" OR "disinformation" OR "deception" OR "satirical hoaxes" OR "serious fabrication" OR "clickbait" OR "information pollution" OR "deceptive content" OR "rumors" OR "propaganda") AND ("Arabic" OR "Language"). Applying these search terms resulted in a large number of articles from which we selected those that contained relevant information. Indeed, we have used exclusion criteria to keep only those that align with the scope of our research. We thus collected 75 articles. The search process and the covered aspects are depicted in Fig. 1 .

Search process

2.3 Scope of the study

After selecting the articles that align with the scope of our research, we attempted to answer the previous research questions (RQs). Among these articles, some authors constructed the datasets and corpora used in their research, detailing the various stages of data construction. Others utilized existing datasets and applied diverse machine learning techniques, including classical methods, deep neural networks, and transformers. Additionally, certain articles focused on strategies to curb the dissemination of fake news.

The general framework of fake news detection and related components are depicted in Fig. 2 . The construction of basic knowledge represents the main step in the development of a fake news detection system. It requires careful source selection and definition of annotation levels. The multiple annotation levels are helpful to deal with variations in the style of claims. Moreover, the detection model addressed the main corpus characteristics and studied its usefulness when generalizing its application. Awareness techniques, in turn, have been described to make people aware of fake news.

General framework of our study

In the following, we define the fake news terminology, and we present the fake news detection processes by describing their approaches and illustrating them with examples.

3 Fake news terminology

Fake news is a common term employed to describe the fake content spread on the Web (Saadany et al. 2020 ). Digital communication has generated a set of concepts related to fake news that can be used interchangeably, namely misinformation, disinformation, deception, satirical hoaxes, serious fabrication, clickbait, information pollution, and deceptive content (Elsayed et al. 2019 ; Touahri and Mazroui 2018 ). They can mislead users' opinions since they include misleading information, rumors, propaganda, and techniques that influence people's mindsets (Touahri and Mazroui 2020 ; Shahi et al. 2021 ; Barron-Cedeno et al. 2020 ; Baly et al. 2018 ). The aforementioned categories differ in dependence on many factors such as targeted audience, genre, domain, and deceptive intent (Da San Martino et al. 2019 ).

The emergence of fake news on the Web has motivated domain interested researchers to perform various tasks and develop automated systems that support multiple languages (Alhindi et al. 2021 ) in order to detect fake news and prevent its disastrous effects from occurring. Among these tasks, we have:

Check-worthiness that determines whether a claim is check-worthy (Haouari et al. 2019 ). It is a ranking task where the systems are asked to produce sentence scores according to check-worthiness. Checkworthiness is the first step in determining the relevance of a claim to be checked.

Stance detection is a fake news detection subtask that searches documents for evidence and defines the documents that support the claim and those that contradict it (Touahri and Mazroui 2019 ; Ayyub et al. 2021 ). Stance detection aims to judge a claim's factuality according to the supporting information. Related information can be annotated as discuss, agree or disagree with a specific claim. Stance detection differs from fake news detection in that it is not for veracity but consistency. Thus, stance detection is insufficient to predict claim veracity since a major part of documents may support false claims (Touahri and Mazroui 2019 ; Elsayed et al. 2019 ; Alhindi et al. 2021 ; Hardalov et al. 2021 ).

Fact-checking is a task that assesses the public figures and the truthfulness of claims Khouja ( 2020 ). A claim is judged trustful or not based on its source, content and spreader credibility. Factuality detection identifies whether a claim is fake or true. The terms genuine, true, real and not fake can be used interchangeably.

Sentiment analysis is the emotion extraction task, such as customer reviews of products. The task is not to do a claim objective verification but it aims to detect opinions to not be considered facts and hence prevent their misleading effects (Touahri and Mazroui 2018 , 2020 ; Saeed et al. 2020 , 2021 ; Ayyub et al. 2021 ).

We exemplify the concepts using the statement "حماية أجهزة أبل قوية بحيث لا تتعرض للفيروسات" (Protection for Apple devices is strong so that they are not exposed to viruses). In Table 1 , the first sentence aligns with the claim, while the second contradicts it. Specifically, "قوية" contradicts "ليست قوية" and " لا تتعرض " contrasts with " تتعرض ". Consequently, when the FactChecking system encounters conflicting sentences, it labels the claim as false; otherwise, it deems it as true.

There are several steps to detect fake news that were covered by Barrón-Cedeno et al. (Barrón-Cedeño et al. 2020 ) discussed various tasks such as determining the check-worthiness of claims as well as their veracity. Stance detection between a claim-document pair (supported, refuted, not-enough-information); (agree, disagree, discuss, unrelated) has been studied (Baly et al. 2018 ) as well as defining claim factuality as fake or real (Ameur and Aliane 2021 ).

4 Challenges

The old-fashioned manual rhythm to detect fake news cannot be kept by fact-checkers regarding the need for momentary detection of claims veracity (Touahri and Mazroui 2019 ). Truth often cannot be assessed by computers alone, hence the need for collaboration between human experts and technology to detect it. Automatic fake news detection is technically challenging for several reasons:

Data diversity : Online information is diverse since it covers various subjects, which complicates the fake news detection task (Khalil et al. 2022 ; Najadat et al. 2022 ). The data may come from different sources and domains, which may complicate their processing (Zhang and Ghorbani 2020 ). The Arabic language can also be considered a criterion when dealing with its complex morphology.

Momentary detection : Fake news is written to deceive readers. It spreads rapidly and its generation mode varies momentarily, making existing detection algorithms ineffective or inapplicable. To improve information reliability, systems to detect fake news in real time should be built (Brashier et al. 2021 ). Momentary detection of fake news on social media seeks to identify them on newly emerged events. Hence, one cannot rely on news propagation information to detect fake news momentarily as it may not exist. Most of the existing approaches that learn claim-specific features can hardly handle the challenge of detecting newly emerged factuality since the features cannot be transferred to unseen events (Haouari et al. 2021 ).

Lack of information context: The information context is important to detect fake news (Himdi et al. 2022 ). In some cases, retrieving information context is not evident since it requires a hard research process to find the context and real spreader. Moreover, data extraction ethics may differ from one social media to another, which may affect the data sufficiency to detect fake news.

Misinformation: Sometimes fake information is spread by web users unintentionally, and based on their credibility fake news may be considered true (Hardalov et al. 2021 ; Sabbeh and Baatwah 2018 ).

An example of fake news is depicted in Fig. 3 . An account owner denies the claim spread by The Atlas Times page on Twitter. The post has many likes and retweets besides some comments that support it by ironing the predicate. The predicate can therefore be considered false.

Example of the spread of Arabic fake news

In this section, we delve into the datasets curated for Arabic fake news detection. We provide illustrative examples of annotated tweets from prior investigations alongside the methods used for their annotation. Subsequently, we outline their sources, domains, and sizes. Additionally, we explore the research endeavors that have utilized these datasets (Table 2 ).

Given the limited availability of resources for Arabic fake news detection, numerous studies have focused on developing linguistic assets and annotating them using diverse methodologies, including manual, semi-supervised, or automatic annotation techniques.

5.1 Manual annotation

The study (Alhindi et al. 2021 ) presented AraStance, an Arabic Stance Detection dataset of 4,063 news articles that contains true and false claims from politics, sports, and health domains, among which 1642 are true. Each claim–article pair has a manual stance label either agree, disagree, discuss, or unrelated. Khouja (Khouja 2020 ) constructed an Arabic News Stance (ANS) corpus related to international news, culture, Middle East, economy, technology, and sports; and was collected from BBC, Al Arabiya, CNN, Sky News, and France24. The corpus is labeled by 3 to 5 annotators who selected true news titles and generated fake/true claims from them through crowdsourcing. The corpus contains 4,547 Arabic News annotated as true or false, among which 1475 are fake. The annotators have used the labels paraphrase, contradiction, and other/not enough information to associate 3,786 pairs with their evidence. (Himdi et al. 2022 ) have introduced an Arabic fake news articles dataset for different genres composed through crowdsourcing. An Arabic dataset related to COVID-19 was constructed (Alqurashi et al. 2021 ). The tweets are labeled manually as containing misinformation or not. The dataset contains 1311 misinformation tweets out of 7,475. The study (Ameur and Aliane 2021 ) presented the manually annotated multi-label dataset “AraCOVID19-MFH” for fake news and hate speech detection. The dataset contains 10,828 Arabic tweets annotated with 10 different labels which are hate, Talk about a cure, Give advice, Rise moral, News or opinion, Dialect, Blame and negative speech, Factual, Worth Fact-Checking and contains fake information. The corpus contains 459 tweets labeled as fake news; whereas, for 1,839 tweets the annotators were unable to decide which tag to affect. Ali et al. (Ali et al. 2021 ) introduced AraFacts, a publicly available Arabic dataset for fake news detection. The dataset collected from 5 Arabic fact-checking websites consists of 6,222 claims along with their manual factual labels as true or false.

Information such as fact-checking article content, topics, and links to web pages or posts spreading the claim are also available. In order to target the topics most concerned by rumors, Alkhair et al. (Alkhair et al. 2019 ) constructed a fake news corpus that contains 4,079 YouTube information related to personalities deaths which gave 3435 fake news after their annotation based on keywords and pretreatment, among which 793 are rumor. Al Zaatari et al. (Al Zaatari et al. 2016 ) constructed a dataset that contains a total of 175 blog posts with 100 posts annotated as credible, 57 as fairly credible, and 18 as non-credible. There are 1570 tweets related to these posts manually annotated as credible out of 2708. Haouari et al. (Haouari et al. 2021 ) introduced ArCOV19-Rumors an Arabic Twitter dataset for misinformation detection composed of 138 verified claims related to COVID-19. The 9,414 relevant tweets to those claims identified by the authors were manually annotated by veracity to support research on misinformation detection, which is one of the major problems faced during a pandemic. Among the annotated tweets 1,753 are fake, 1,831 true and 5,830 others. ArCOV19-Rumors covers many domains politics, social, entertainment, sports, and religious. Besides the aforementioned annotation approaches, data true content can be manually altered to generate fake claims about the same topic Khouja ( 2020 ).

5.2 Semi-supervised and automatic annotation

Statistical approaches face limitations due to the absence of labeled benchmark datasets for fake news detection. Deep learning methods have shown superior performance but demand large volumes of annotated data for model training. However, online news dynamics render annotated samples quickly outdated. Manual annotation is costly and time-intensive, prompting a shift towards automatic and semi-supervised methods for dataset generation. To bolster fact-checking systems, fake news datasets are automatically generated or extended using diverse approaches, including automatic annotation. In this respect, various papers presented their approaches. In (Mahlous and Al-Laith 2021 ), there was a reliance on the France-Press Agency and the Saudi Anti-Rumors Authority fact-checkers to extract a corpus that was manually annotated into 835 fake and 702 genuine tweets. Then an automatic annotation was performed based on the best performing classifier. Elhadad et al. (Elhadad et al. 2021 ) automatically annotated the bilingual (Arabic/English) COVID-19-FAKES Twitter dataset using 13 different machine learning algorithms and employing 7 various feature extraction techniques based on reliable information from different official Twitter accounts. The authors (Nakov et al. 2021 ) have collected 606 Arabic and 2,589 English Qatar fake tweets about COVID-19 vaccines. They have analyzed the tweets according to factuality, propaganda, harmfulness, and framing. The automatic annotation of the Arabic tweets gave 462 factual and 144 not. The study (Saadany et al. 2020 ) introduced datasets concerned with political issues related to the Middle East. A dataset that consists of fake news contains 3,185 articles scraped from ‘Al-Hudood’ and ‘Al-Ahram Al-Mexici’ Arabic satirical news websites. They also collected a dataset from ‘BBC-Arabic’, ‘CNN-Arabic’ and ‘Al-Jazeera news’ official news sites. The dataset consists of 3,710 real news articles. The websites from which data has been scraped are specialized in publishing true and fake news. Nagoudi et al. (Nagoudi et al. 2020 ) have presented AraNews, a POStagged news dataset. The corpus was constructed based on a novel method for the automatic generation of Arabic manipulated news based on online news data as seeds for the generation model. The dataset contains 10,000 articles annotated using true and false tags. Moreover, Arabic fake news can be generated by translating fake news from English into Arabic (Nakov et al. 2018 ).

We summarize in Table 3 the main datasets by specifying their sources, their sizes, the domains concerned, the tags adopted, and the labeling way.

6 Fake news detection

6.1 preprocessing.

The content posted online is often chaotic and marked by considerable ambiguity. Therefore, before proceeding to feature extraction, it is imperative to conduct a preprocessing phase. Below, we outline the general procedures along with those tailored specifically for the Arabic language:

General steps

Special characters removal : special characters such as {∗ ,@,%,&…} are not criteria to detect fake news and are not specific to a language, their removal helps to clean the text (Alkhair et al. 2019 ).

Punctuation removal: they are considered non significant to detect fake news (Al-Yahya et al. 2021 ).

URL links removal : their presence in the raw text may be considered noise. However, they may represent a page with important content (Alkhair et al. 2019 ).

Duplicated comments removal : the retweet or duplicate comments have to be deleted since it is sufficient to process a piece of text just once (Alkhair et al. 2019 ).

Balancing data : data imbalance can mislead the classification process. Therefore, it is essential to balance the data to represent each factual or false class equally (Jardaneh et al. 2019 ).

Reducing repeated letters, characters and multiple spaces (Al-Yahya et al. 2021 ): since letters are not repeated more than twice in a word, nor are spaces that are unique between words in a sentence, it is essential to eliminate these repetitions to achieve the correct form of a word or a sentence.

Tokenization helps in splitting sentences into word sequences using delimiters such as space or punctuation marks. This step precedes converting texts into features (Oshikawa et al. 2018 ) .

Stemming, lemmatization, and rooting are language related steps that help to cover a large set of words by representing them with their common stems, lemmas and roots (Oshikawa et al. 2018 ).

Normalization and standardization : Normalizing data giving them the same representation. In the Arabic language, some letters may be replaced with others (Jardaneh et al. 2019 ).

Arabic specific steps

Foreign language words removal: they don’t belong to the processed language (Alkhair et al. 2019 ).

Non-Arabic letter removal: the transliterated text can be removed since it is lowly represented within the studied corpora (Alkhair et al. 2019 ).

Replacing hashtags and emojis with relevant signification (Al-Yahya et al. 2021 ): for example ☺ may be replaced with سعيد.

Removing stop words: stop words such as أنت لكن ما are considered non significant in detecting fake news (Al-Yahya et al. 2021 ). The stop words are specific to each language.

Diacritization removal: since diacritic marks don’t cover all the terms, their suppression helps to normalize the terms representation (Al-Yahya et al. 2021 ).

6.2 Feature extraction

Previous studies on fake news detection have relied on various features extracted from labeled datasets to represent information. In this section, we detail the most commonly utilized features in fake news detection:

Source features : These check whether a specific user account is verified, determine its location, creation date, activity, user details and metadata including, the job, affiliation and political party of the user and whether the account is with a real name (Jardaneh et al. 2019 ; Sabbeh and Baatwah 2018 ). The account real name and details are important to assess the news creator credibility. Moreover, it is important to know whether the news spreader belongs to opposite parties which tend to be fake. Also, the creation date is important since an account nearly created during a specific event can be considered fake in comparison to an early created one. Source features are very helpful in determining the credibility of a news creator; However, Fake information may be spread unintentionally on account of high credibility.

Temporal features : These capture the temporal spread of information and the overlap between the posting time of user comments. Fake news may be retweeted rapidly, so it is important to capture the comment temporal information and whether it overlaps with a specific event that pushes the spreader to publish fake information. The temporal features are among the most important ones to detect fake news; however, they are still not sufficient to deal with the strength of such a phenomenon.

Content features : Content can be true if it contains pictures, hashtags or URLs since they may lead to a trustful source of information to prove the factuality of a claim. Also, if the content is retweeted by trustful accounts or has positive comments from users, it can be considered true (Sabbeh and Baatwah 2018 ). Content features cover information and their references; However, there is a need to check the references credibility.

Lexical features : These include character and word level features extracted from text (Sabbeh and Baatwah 2018 ). Analyzing claims at the term level is crucial for determining their sentiment (positive or negative) and verifying their factual accuracy. Identifying common lexical features among claims is also valuable. However, while lexical features are significant, they should be complemented with additional features to effectively identify fake claims.

Linguistic features : Analyzing the linguistic features of a claim can help determine its veracity without considering external factual information. Term frequency, bag-of-words, n-grams, POS tagging, and sentiment score are some of the main features used for fake news detection Khouja ( 2020 ). Linguistic features categorize data based on language and highlight its defining elements, relying on lexical, semantic, and structural characteristics. While linguistic features aid in content analysis and representation, the absence of contextual information can potentially lead to misidentification during fake news detection.

Semantic features : These are features that capture the semantic aspects of a text that are useful to extract data meaning (Sabbeh and Baatwah 2018 ) and their variations according to the context. Semantic features identify the meaning of a claim but not its veracity.

Sentiment features: Sentiment analysis may improve the fake news prediction accuracy (Jardaneh et al. 2019 ) since a sentimental comment may be fake as it doesn’t depend on facts. As opinions influence people’s behaviors, it has numerous applications in real life such as in politics, marketing, and social media (Ayyub et al. 2021 ). Sentiment features are important to distinguish between opinions and facts; However, we may express a fact using an opinion such as I like the high quality of Apple smartphones.

The mentioned features are complimentary, so we can’t rely on one feature without the other. Each feature has its main characteristics that make it indispensable for fake news detection.

6.3 Classification approaches

In this section, we outline various studies conducted in Arabic fake news detection, detailing the features employed, the models developed, and the achieved performances. We categorize these studies into three main approaches: those based on classical machine learning, deep learning or transformers.

6.3.1 Classical machine learning

Classical machine learning for fake news detection involves the application of traditional algorithms and techniques to analyze and classify textual data to discern between authentic and fabricated news articles. These methods typically rely on feature engineering, where relevant characteristics of the text are extracted and used to train models such as support vector machines (SVM), logistic regression (LR), decision trees (DT), and random forests (RF). Features can include linguistic patterns, sentiment analysis, lexical and syntactic features, and metadata associated with the news articles. The trained models are then employed to classify new articles as either genuine or fake based on the learned patterns and characteristics present in the data. Researchers may collect a dataset of Arabic news articles labeled as fake or genuine. They then preprocess the text, extract relevant features, and train a machine learning classifier on the labeled dataset. The classifier can then be used to predict the authenticity of new Arabic news articles. Arabic satirical news have lexico-grammatical features that distinguish them (Saadany et al. 2020 ). Based on this claim, a set of machine learning models for identifying satirical fake news has been tested. The model achieved an accuracy of up to 98.6% based on a dataset containing 3,185 fake and 3,710 real articles. (Alkhair et al. 2019 ) have used a dataset of 4,079 news, where 793 are rumors, based on which they have trained a model on 70% of the data and tested it on the remainder. They have classified comments as rumor and no rumor using the most frequent words as features and three machine learning classifiers namely, Support Vector Machine (SVM), Decision Tree (DT) and Multinomial Naïve Bayes (MNB). They attained a 95.35% accuracy rate with SVM. The researchers (Sabbeh and Baatwah 2018 ) utilized a dataset comprising 800 news items sourced from Twitter and devised a machine learning model for assessing the credibility of Arabic news. They incorporated topic and user-related features in their model to evaluate news credibility, ensuring a more precise assessment. By verifying content and analyzing user comments' polarity, they classified credibility using various classifiers, including Decision Trees. Consequently, they achieved an accuracy of 89.9%. The authors (Mahlous and Al-Laith 2021 ) extracted n-gram TF-IDF features from a dataset containing 835 fake and 702 genuine tweets, achieving an F1-score of 87.8% using Logistic Regression (LR). The authors (Thaher et al. 2021 ) have extracted a Bag of Words and features including content, user profiles, and word-based features from a Twitter dataset comprising 1,862 tweets (Al Zaatari et al. 2016 ). Their results showed that the Logistic Regression classifier with TF-IDF model achieved the highest scores compared to other models. They reduced dimensionality using the binary Harris Hawks Optimizer (HHO) algorithm as a wrapper-based feature selection approach. Their proposed model attained an F1-score of 0.83, marking a 5% improvement over previous work on the same dataset. (Al-Ghadir et al. 2021 ) evaluated the stance detection model based on the TF-IDF feature and varieties of K-nearest Neighbors (KNN) and SVM on the SemEval-2016 task 6 benchmark dataset. They reached a macro F -score of 76.45%. The authors (Gumaei et al. 2022 ) conducted experiments on a public dataset containing rumor and non-rumor tweets. They have built a model using a set of features, including topic-based, content-based, and user-based features; besides XGBoost-based approach that has achieved an accuracy of 97.18%. Jardaneh et al. (Jardaneh et al. 2019 ) have extracted 46 content and user related features from 1,862 tweets published on topics covering the Syrian crisis and employed sentiment analysis to generate new features. They have based the identification of fake news on a supervised classification model constructed based on Random Forest (RF), Decision Tree, AdaBoost, and Logistic Regression classifiers. The results revealed that sentiment analysis led to improving the prediction accuracy of their system that filters out fake news with an accuracy of 76%.

6.3.2 Deep learning

Deep neural approaches for fake news detection involve the use of deep learning models based on neural network architectures, to automatically learn and extract relevant features from textual data to distinguish genuine news articles from fabricated news articles. These approaches typically utilize neural network architectures such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and more recently, Transformer-based models like BERT and GPT. In deep neural approaches, the models are trained on large volumes of labeled data, where the neural networks learn to represent the underlying patterns and relationships within the text. These models can be used for Arabic fake news detection since they automatically learn representations of the text and capture complex patterns and relationships. Researchers may build a deep neural network architecture tailored to Arabic text data. For example, they can use a CNN for text classification, where the network learns to identify important features in the text through convolutional layers. By training the model on a large dataset of labeled Arabic news articles, it can learn to distinguish between fake and genuine news. (Yafooz et al. 2022 ) proposed a model to detect fake news about the Middle-east COVID-19 vaccine on YouTube videos. The model is based on sentiment analysis features and a deep learning approach which helped to reach an accuracy of 99%. The authors (Harrag and Djahli 2022 ) have used an Arabic balanced corpus to build their model that unifies stance detection, relevant document retrieval and fact-checking. They proposed a deep neural network approach to classify fake and real news by exploiting CNNs. The model trained on selected attributes reached an accuracy of 91%. (Alqurashi et al. 2021 ) have exploited FastText and word2vec word embedding models for more than two million Arabic tweets related to COVID-19. (Helwe et al. 2019 ) extracted various features from a dataset containing 12.8 K annotated political news statements along with their metadata. These features included content and user-related attributes. Their initial model, based on TF-IDF and SVM classifier, achieved an F1-score of 0.57. They also explored word-level, character-level, and ensemble-based CNN models, yielding F1-scores of 0.52, 0.54, and 0.50 respectively. To address the limited training data, they introduced a deep co-learning approach, a semi-supervised method utilizing both labeled and unlabeled data. By training multiple weak deep neural network classifiers in a semi-supervised manner, they achieved significant performance improvement, reaching an F1-score of 0.63.

6.3.3 Transformer-based approaches

Transformer approaches for fake news detection involve the use of transformer-based models, which are a type of deep learning architecture that has gained prominence in NLP tasks. The transformer model has become the foundation for many state-of-the-art NLP models. In the context of fake news detection, transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) are used to analyze and classify textual data. These models can process and understand large amounts of text data by leveraging self-attention mechanisms, which allow them to capture contextual relationships between words and phrases in a given text. Transformer-based models are pre-trained on massive unlabeled corpora of text data and then fine-tuned on specific tasks such as fake news detection. During fine-tuning, the model learns to classify news articles as either genuine or fake based on the patterns and relationships it has learned from the pre-training phase. Transformer-based approaches have shown promising results in fake news detection tasks due to their ability to capture semantic meaning, context, and long-range dependencies within textual data. These models have shown remarkable performance in understanding context and semantics in text. Researchers may fine-tune pre-trained transformer models on Arabic fake news detection datasets. For instance, they can use AraBERT as a base model and fine-tune it on a dataset of labeled Arabic news articles. During fine-tuning, the model learns to effectively capture linguistic nuances and identify linguistic patterns indicative of fake news in Arabic text. The study (Nagoudi et al. 2020 ) aimed to measure the human ability to detect machine manipulated Arabic text based on a corpus that contains 10,000 articles. They reported that changing a certain POS does not automatically flip the sentence veracity. Their system for Arabic fake news detection reached an F1-score of 70.06. They have made their data and models publicly available. Khouja (Khouja 2020 ) explored textual entailment and stance prediction to detect fake news from a dataset that contains 4,547 Arabic news where 1475 are fake. They have constructed models based on pretraining BERT. Their system predicts stance by an F1-score of 76.7 and verifies claims by an F1-score of 64.3. (Al-Yahya et al. 2021 ) have compared language models based on neural networks and transformers for Arabic fake news detection from ArCOV19-Rumors (Haouari et al. 2021 ) and Covid-19-Fakes (Elhadad et al. 2021 ) datasets. They then reported that models based on transformers perform the best, achieving an F1-score of 0.95.

6.3.4 Approach distinction for Arabic fake news

We may differentiate between machine learning (ML), deep learning (DL), and transformer-based approaches in terms of their approach, capabilities, and suitability for Arabic fake news detection based on the following criteria:

Classical machine learning: ML models are effective when the features are well-defined and the dataset is not too large. They can handle relatively small datasets and are interpretable, making it easier to understand why a particular prediction was made. ML approaches are suitable for Arabic fake news detection when the features can effectively capture linguistic patterns indicative of fake news in Arabic text. They may be less effective in capturing complex semantic relationships and context compared to DL and transformer-based models.

Deep learning : DL models excel at learning hierarchical representations of data and can handle large volumes of text data. They can automatically learn features from raw text, making them suitable for tasks where feature engineering may be challenging. DL approaches are suitable for Arabic fake news detection when the dataset is large and diverse, and the linguistic patterns indicative of fake news are complex. They may outperform ML approaches in capturing subtle linguistic cues and context.

Transformer-based approaches : Transformer-based models are state-of-the-art in natural language understanding tasks and excel at capturing context and semantics in text. They can capture bidirectional relationships between words and are highly effective in capturing nuanced linguistic features. Transformer-based approaches are highly suitable for Arabic fake news detection, especially when the dataset is large and diverse. They can effectively capture complex semantic relationships and context in Arabic text, making them well-suited for tasks where understanding linguistic nuances is crucial.

In summary, classical ML approaches are suitable for Arabic fake news detection when the features can effectively capture linguistic patterns, while DL and transformer-based approaches excel at capturing complex semantic relationships and context in Arabic text, making them highly effective for detecting nuanced linguistic cues indicative of fake news. These three families of approaches can interact and complement each other in various ways:

Feature engineering and representation: ML methods often require handcrafted features extracted from the text, such as word frequencies, n-grams, and syntactic features. DL methods can automatically learn features from raw text data, making them suitable for tasks where feature engineering may be challenging. Transformer-based models, such as BERT, leverage pre-trained representations of text that capture rich semantic information. These representations can be fine-tuned for specific tasks, including fake news detection.

Model complexity and performance: ML methods are generally simpler and more interpretable compared to DL and transformer-based models. They may be suitable for tasks where transparency and interpretability are important. DL methods, with their ability to learn hierarchical representations of data, can capture complex patterns and relationships in the text. They may outperform ML methods on tasks that require understanding subtle linguistic cues and context. Transformer-based models, with their attention mechanisms and contextual embeddings, have achieved state-of-the-art performance on various NLP tasks, including fake news detection. They excel at capturing fine-grained semantic information and context.

Ensemble learning: ML, DL, and transformer-based models can be combined in ensemble learning approaches to leverage the strengths of each method. Ensemble methods combine predictions from multiple models to make a final prediction. This can lead to improved performance and robustness, especially when individual models have complementary strengths and weaknesses. In (Noman Qasem et al. 2022 ), several standalone and ensemble machine learning methods were applied to the ArCOV-19 dataset that contains 1480 Rumors and 1677 non-Rumors tweets based on which they have extracted user and tweet features. The experiments showed an interesting accuracy of 92.63%.

Progression and evolution: There is a progression from traditional ML methods to more advanced DL and transformer-based approaches in NLP tasks, including fake news detection. As the field of NLP continues to evolve, researchers are exploring novel architectures, pre-training techniques, and fine-tuning strategies to improve the performance of models on specific tasks, such as fake news detection.

In practice, these approaches are often used in parallel, with researchers and practitioners selecting the method or combination of methods that best suit the task requirements, data characteristics, and computational resources available. The choice of approach may depend on factors such as dataset size, complexity of linguistic patterns, interpretability requirements, and performance goals.

We recall in Table 4 the various studies carried out on the detection of Arab fake news. The conducted studies employed various datasets, features, models, and evaluation metrics. The primary metrics used include Accuracy, Precision, Recall, F1-score, and AUC score. These studies aimed to identify fake news by utilizing diverse approaches, ranging from classical machine learning algorithms to deep learning models.

From Table 4 , a wide range of accuracies achieved by the system, spanning from 76% to over 99%, attributable to differences in datasets and underlying knowledge bases. However, a pertinent question arises when applying the best-performing models to other datasets, often resulting in reduced accuracy. To address this issue and facilitate a fair comparison of proposed approaches, some shared task organizers have made publicly available datasets and proposed tasks. These initiatives aim to mitigate model sensitivity to training data and enhance overall system efficiency.

7 Fake news shared tasks

The organizers of the competition CLEF–2019 CheckThat! Lab (Elsayed et al. 2019 ) proposed task revolves around the Automatic Verification of Claims, as presented by CheckThat! that outlines two primary tasks. The first task focuses on identifying claim fact-check worthiness within political debates. Meanwhile, the second task follows a three-step process. The initial step involves ranking web pages according to their utility for fact-checking a claim. The systems achieved an nDCG@10 lower than 0.55 which is the original ranking in the search result list considered as baseline. The second step classifies the web pages according to their degree of usefulness, the best performing system reached an F1 of 0.31, and the third task extracts useful passages from the useful pages, in which the most occurring model reached an F1 of 0.56. The second step is designed to help automatic fact-checking represented in the fourth task that uses the useful pages to predict a claim factuality in the system that used textual entailment with embedding-based representations for classification has reached the best F1 performance measured by 0.62. The task organizers have released datasets in English and Arabic in order to enable the research community in checkworthiness estimation and automatic claim verification. CheckThat! Lab 2021 task (Shahi et al. 2021 ) focuses on multi-class fake news detection. The lab covers Arabic, English, Spanish, Turkish, and Bulgarian. The best performing systems achieved a macro F1-score between 0.84 and 0.88 in the English language. The paper (Al-Qarqaz et al. 2021 ) describes NLP4IF, the Arabic shared task to check COVID-19 disinformation. The best ranked model for Arabic is based on transformer-based pre-trained language, an ensemble of AraBERT-Base, Asafya-BERT, and ARBERT models and achieved an F1-Score of 0.78. The authors (Rangel et al. 2020 ) have presented an overview of Author Profiling shared task at PAN 2020. The best results have been obtained in Spanish with an accuracy of 82% using combinations of character and word n-grams; and SVM. The task has focused on identifying potential spreaders of fake news based on the authors of Twitter comments, highlighting challenges related to the lack of domain specificity in news. It attracted 66 participants, whose systems were evaluated by the organizers of the task.

8 Discussion

The Arabic fake news detection systems have achieved satisfactory results. However, given the ongoing generation of content, dataset quality struggles to encompass the diversity of generated content. Datasets vary in size, sources, and the hierarchical annotation steps used to detect fake news. Human annotation remains challenging, as multiple aspects must be considered before labeling a claim based on its content. Therefore, semi-supervised and automatic annotation methods have been explored to alleviate the burden of manual annotation.

Detecting fake news requires further effort to be successful, especially in terms of real-time detection, which remains challenging due to the absence of comprehensive detection aspects such as information spread. For instance, information shared by a reputable individual may be perceived as true. Improving public literacy is crucial since individuals need to be educated to discern factual content from misinformation.

The spread of fake news may have disastrous effects on people and society, hence, the detection step must be taken before allowing the spread of data especially on social media which is characterized by wide sets of data. Moreover, social media users have to agree to ethical aspects, and punishments have to be applied to those who spread fake data. Trustful sources must be explored when seeking factual information. The exploration of new models may be useful also. It can be an approach that relies on scoring social media users based on their trustworthiness that pops up every time a new post is created. The score is decreased each time until a specific account is marked as untrustworthy, which can either help prevent the spread of fake information or destroy its base account. Special information may be requested when creating an account in order to not allow a specific person to create more than one account.

In the following, we describe some future directions that may be helpful in detecting fake news and preventing the spread of its negative effects.

9 Future directions for Arabic fake news detection

Researchers have employed a variety of features, including source, context, and content, to enhance fake news detection. Source features aid in targeting analysis, often complemented by content features for improved accuracy. Linguistic analysis has played a role in identifying content characteristics, with lexical and semantic features helping to identify relevant terms such as sentiment. Temporal features capture data spread and event relationships, though content features may lose effectiveness across different contexts. Sentiments alone may not reliably indicate fake news, as they can accompany both genuine and fake information. Additionally, the absence of typos may signal attackers' efforts to enhance content credibility. Profile and graph-based features, used to assess source credibility based on network connections, can provide valuable information for attackers to strategize long-term attacks.

The presented data and results inspire motivation to detail open issues of fake news detection. Consequently, there is a need for potential research tasks that:

Differentiate fake news from other related concepts based on content, intention and authenticity;

Enhance content features by non-textual data;

Investigate the importance of the automatically annotated corpora, lexical features, hand-crafted rules and pretrained models with the aim to facilitate fake news detection and improve its accuracy;

Analyze in depth the performance of current fake news detection models, and at what level their accuracy remains by varying the fields of application or the attack manners,

Improve the detection by adding pertinent features since old ones can be exploited by attackers to make users believe that fake news is true,

Propose new techniques to raise Internet users' awareness of fake news and the devastating effect of this phenomenon.

The aforementioned points highlight some directions and open issues for fake news detection. Besides these common points with other languages, the Arabic language is faced with its dialectal varieties and its complex morphology, which reflect its challenging nature. It is therefore important to explore in future research on Arab fake news these points from different angles to improve existing detection approaches and results.

Arabic can also benefit from studies on other languages to create and expand datasets, improve annotation and classification models in addition to improving customized fake news awareness techniques.

9.1 Datasets

The Arabic datasets are mainly related to politics and the COVID-19 pandemic that has emerged recently. Hence, as further studies, the Arabic language can benefit from foreign language datasets either by translation or collection manner. In this context, many datasets can be explored since they are characterized by a considerable size and variety of domains. Wang ( 2017 ) manually labeled LIAR that contains 12,800 English short statements about political statements in the U.S related to various domains among which elections, economy, healthcare, and education. (Sahoo and Gupta 2021 ; Zhang et al. 2020 ; Shu et al. 2017 ; Kaur et al. 2020 ; Wang et al. 2020 ) datasets are related to English politics and their size varies between 4,048 and 37 000 tweets. The datasets of tweets presented in (Shu et al. 2017 ; Karimi et al. 2018 ) have a considerable size that exceeds 22,140 news articles related to politics, celebrity reports, and entertainment stories. Moreover, Arabic studies need to explore the balance of datasets to reduce the error of differentiating between fake and genuine news (Jones-Jang et al. 2021 ). They should also explore multimodal fake news detection (Haouari et al. 2021 ).

The training of deep learning models requires a large amount of annotated data. Moreover, due to the dynamic nature of online news, annotated samples may become outdated quickly which makes them non-representative of newly emerged events. Manual annotation can’t be the ultimate annotation manner since it is expensive and time-consuming. Hence, automatic and semi-supervised approaches have to be used to generate labeled datasets. To increase the robustness of fact-checking systems, the available fake news datasets can be generated or extended automatically based on various approaches. Among these the ones based on Generative Enhanced Model (Niewinski et al. 2019 ) or reinforced weakly supervised fake news detection approaches (Wang et al. 2020 ) and the alteration of genuine content to generate fake claims about the same topic Khouja ( 2020 ).

Improving existing datasets for Arabic fake news detection involves several strategies aimed at enhancing the quality, diversity, and representativeness of the data. Here are some ways to improve existing datasets:

Data annotation and labeling : Invest in rigorous and consistent annotation and labeling processes to ensure accurate classification of news articles as fake or genuine. Use multiple annotators to mitigate bias and improve inter-annotator agreement. Include diverse perspectives and expertise in the annotation process to capture nuances in fake news detection.

Data augmentation : Augment existing datasets by generating synthetic examples of fake news articles using techniques such as back-translation, paraphrasing, and text summarization. This can help increase the diversity of the dataset and improve model generalization.

Balancing class distribution : Ensure that the dataset has a balanced distribution of fake and genuine news articles to prevent classifier bias towards the majority class. Use techniques such as oversampling, undersampling, or synthetic sampling to balance class distribution and improve classifier performance.

Multimodal data integration : Integrate additional modalities such as images, videos, and metadata (e.g., timestamps, sources) into the dataset to provide richer contextual information for fake news detection. Multimodal datasets can capture subtle cues and patterns that may not be apparent in text alone.

Fine-grained labeling : Consider incorporating fine-grained labels or sub-categories of fake news (e.g., clickbait, propaganda, satire) to provide more detailed insights into the nature and characteristics of fake news articles. Fine-grained labeling can enable more nuanced analysis and model interpretation.

Cross-domain and cross-lingual datasets : Collect and incorporate data from diverse domains and languages to improve model robustness and generalization. Cross-domain and cross-lingual datasets expose models to a wider range of linguistic and contextual variations, enhancing their ability to detect fake news across different domains and languages.

Continuous updating and evaluation : Regularly update and evaluate existing datasets to reflect evolving trends, emerging fake news techniques, and changes in language use. Incorporate feedback from users and domain experts to iteratively improve dataset quality and relevance.

Open access and collaboration : Foster an open-access culture and encourage collaboration within the research community to share datasets, tools, and resources for fake news detection. Open datasets facilitate reproducibility, benchmarking, and model comparison, leading to advancements in the field.

Ethical considerations: Adhere to ethical guidelines and data privacy regulations when collecting and using data, ensuring the protection of individuals' privacy and rights.

By implementing these strategies, researchers and practitioners can enhance the quality and effectiveness of existing datasets for Arabic fake news detection, leading to more robust and reliable detection models.

9.2 Feature extraction

Many features should be explored to develop more sophisticated linguistic and semantic features specific to Arabic language characteristics, including morphology, syntax, and semantics. Indeed, the analysis of the source-credibility features, the number of authors, their affiliations, and their history as authors of press articles can play an important role in fake news detection. Additionally, word count, lexical, syntactic and semantic levels, discourse-level news sources (Shu et al. 2020 ; Elsayed et al. 2019 ; Sitaula et al. 2020 ), as well as Publishing Historical Records (Wang et al. 2018 ) can also contribute to the detection of fake news. The temporal features and hierarchical propagation network on social media must be explored (Shu et al. 2020 ; Ruchansky et al. 2017 ). The studies can be enhanced by the extraction of event-invariant features (Wang et al. 2018 ).

9.3 Classification

Besides the existing classification approaches, the Arabic models have to be domain and content nature aware. Improving existing models for Arabic fake news detection can involve the following approaches:

Model architecture enhancement : Explore advanced neural network architectures and techniques tailored to handle Arabic text, by enhancing attention mechanisms, and memory networks, and enlarging the size of the existing pretrained models to increase the fake news detection systems (Khan et al. 2021 ; Ahmed et al. 2021 ).

Multimodal learning : Incorporate multimodal information, such as images, videos, and metadata, in addition to textual content, to improve the model's understanding and detection of fake news.

Semi-supervised learning : Leverage semi-supervised learning techniques to make more efficient use of limited labeled data by combining it with a large amount of unlabeled data, which is often abundant in real-world scenarios.

Domain adaptation : Investigate domain adaptation methods to transfer knowledge learned from other languages or domains to improve model performance on Arabic fake news detection tasks. Exploring multi-source, multi-class and multi-lingual Fake news Detection (Karimi et al. 2018 ; Wang 2017 ).

Ensemble methods : Combine predictions from multiple models or model variants to enhance the robustness and generalization ability of the overall system.

Continuous evaluation and updating : Regularly evaluate model performance on new data and fine-tune the model parameters or architecture based on feedback to ensure adaptability to evolving fake news detection challenges.

9.4 Fake news awareness techniques

Researchers have investigated the repercussions of fake news on various fronts, proposing methods to counter its influence without relying solely on identification systems. They advocate for raising awareness among individuals and propose alternative detection strategies. To summarize, the awareness techniques encompass the following points:

Investigating the influence of culture and demographics on the fake news spread via social media since culture has the most significant impact on the spread of fake news (Rampersad and Althiyabi 2020 ).

Studying fake news impact on consumer behavior by performing an empirical methodological approach (Visentin et al. 2019 ) and identifying the key elements of fake news that is misleading content that intends to cause reputation harm (Jahng et al. 2020 ).

Sensitizing adults since they are the most targeted by fake news as they share the most misinformation, and this phenomenon could intensify in years to come (Rampersad and Althiyabi 2020 ; Brashier and Schacter 2020 ).

Boosting resilience to misinformation, which may make people more immune to misinformation (Lewandowsky and van der Linden 2021 ).

Increasing fake news identification by helping the rise of information literacy (Jones-Jang et al. 2021 ).

Preventing misinformation on the widespread adoption of health protective behaviors in the population (Yafooz et al. 2022 ), in particular for COVID-19.

Improving the ability to spot misinformation by introducing online games to detect fake news (Basol et al. 2020 ).

10 Conclusion

This survey was structured to help researchers in the field to define their roadmaps based on the proposed and presented information. Indeed, we presented the terminologies related to automatic fake news detection. We highlighted the impact of fake news on the public opinion orientation and the importance of the distinction between facts and opinions. Then, we presented recent Arabic benchmark datasets and addressed the potentially extracted features along with their categories. We described various studies and hence several approaches and experimental results. We have then compared each system results and proposed new recommendations for future approaches. Based on the compiled findings, fake news detection continues to confront numerous challenges, with ample opportunities for enhancement across various facets including feature extraction, model development, and classifier selection. Addressing the open issues and future research directions in fake news detection involves distinguishing between fake news and related concepts like satire, as well as identifying check-worthy content within extensive datasets. The construction of pre-trained models that are invariant to domains, topics, and source or language changes also represents a challenge to be met. Moreover, the construction of models for the detection of newly emergent data to which the system is not accustomed is strongly recommended. Furthermore, the systems must be able to explain what news are fake or true to enhance the existing models. While our survey comprehensively covers contemporary aspects of fake news detection, its scope is constrained by the dynamic nature of fake news research, preventing us from incorporating real-time updates on research advancements.

http://www.qamus.org/transliteration.htm

Ahmed B, Ali G, Hussain A, Baseer A, Ahmed J (2021) Analysis of text feature extractors using deep learning on fake news. Eng Technol Appl Sci Res 11:7001–7005. https://doi.org/10.48084/etasr.4069

Article Google Scholar

Al Zaatari A, El Ballouli R, ELbassouni S, El-Hajj W, Hajj H, Shaban K, Habash N, Yahya E (2016) Arabic corpora for credibility analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp 4396–4401)

Al-Ghadir AI, Azmi AM, Hussain A (2021) A novel approach to stance detection in social media tweets by fusing ranked lists and sentiments. Inf Fusion 67:29–40. https://doi.org/10.1016/j.inffus.2020.10.003

Alhindi T, Alabdulkarim A, Alshehri A, Abdul-Mageed M, Nakov P (2021) AraStance: a multi-country and multi-domain dataset of arabic stance detection for fact checking. ArXiv210413559 Cs

Ali K, Li C, Muqtadir SA (2022) The effects of emotions, individual attitudes towards vaccination, and social endorsements on perceived fake news credibility and sharing motivations. Comput Hum Behav 134:107307

Ali ZS, Mansour W, Elsayed T, Al‐Ali A (2021) AraFacts: the first large Arabic dataset of naturally occurring claims. In Proceedings of the sixth Arabic natural language processing workshop (pp 231–236)

Alkhair M, Meftouh K, Smaïli K, Othman N (2019) An Arabic corpus of fake news: collection, analysis and classification. In: Smaïli K (ed) Arabic language processing: from theory to practice, communications in computer and information science. Springer International Publishing, Cham, pp 292–302. https://doi.org/10.1007/978-3-030-32959-4_21

Chapter Google Scholar

Al-Qarqaz A, Abujaber D, Abdullah MA (2021) R00 at NLP4IF-2021 fighting COVID-19 infodemic with transformers and more transformers. In: Proceedings of the fourth workshop on NLP for internet freedom: censorship, disinformation, and propaganda, online. pp 104–109. https://doi.org/10.18653/v1/2021.nlp4if-1.15

Alqurashi S, Hamoui B, Alashaikh A, Alhindi A, Alanazi E (2021) Eating garlic prevents COVID-19 infection: detecting misinformation on the Arabic content of twitter. ArXiv210105626 Cs

Al-Yahya M, Al-Khalifa H, Al-Baity H, AlSaeed D, Essam A (2021) Arabic fake news detection: comparative study of neural networks and transformer-based approaches. Complexity 2021:1–10. https://doi.org/10.1155/2021/5516945

Ameur MSH, Aliane H (2021) AraCOVID19-MFH: Arabic COVID-19 multi-label fake news and hate speech detection dataset. ArXiv210503143 Cs

Awajan ALBARA (2023) Enhancing Arabic fake news detection for Twitters social media platform using shallow learning techniques. J Theor Appl Inf Technol 101(5):1745–1760

Ayyub K, Iqbal S, Nisar MW, Ahmad SG, Munir EU (2021) Stance detection using diverse feature sets based on machine learning techniques. J Intell Fuzzy Syst 40(5):9721–9740

Baly R, Mohtarami M, Glass J, Màrquez L, Moschitti A, Nakov P (2018) Integrating stance detection and fact checking in a unified corpus. arXiv preprint arXiv:1804.08012

Barron-Cedeno A, Elsayed T, Nakov P, Martino GDS, Hasanain M, Suwaileh R, Haouari F, Babulkov N, Hamdan B, Nikolov A, Shaar S, Ali ZS (2020) Overview of CheckThat! 2020: automatic identification and verification of claims in social media. ArXiv200707997 Cs

Barrón-Cedeño A, Elsayed T, Nakov P, Da San Martino G, Hasanain M, Suwaileh R, Haouari F, Babulkov N, Hamdan B, Nikolov A, Shaar S (2020) Overview of CheckThat! 2020: automatic identification and verification of claims in social media. In: International conference of the cross-language evaluation forum for European languages. Springer, Cham, pp 215–236

Basol M, Roozenbeek J, Van der Linden S (2020) Good news about bad news: gamified inoculation boosts confidence and cognitive immunity against fake news. J Cogn 3:2. https://doi.org/10.5334/joc.91

Bovet A, Makse HA (2019) Influence of fake news in Twitter during the 2016 US presidential election. Nat Commun 10:7. https://doi.org/10.1038/s41467-018-07761-2

Brashier NM, Schacter DL (2020) Aging in an era of fake news. Curr Dir Psychol Sci 29:316–323. https://doi.org/10.1177/0963721420915872

Brashier NM, Pennycook G, Berinsky AJ, Rand DG (2021) Timing matters when correcting fake news. Proc Natl Acad Sci 118:e2020043118. https://doi.org/10.1073/pnas.2020043118

Da San Martino G, Seunghak Y, Barrón-Cedeno A, Petrov R, Nakov P (2019) Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, pp 5636–5646

Elhadad MK, Li KF, Gebali F (2021) COVID-19-FAKES: a twitter (Arabic/English) dataset for detecting misleading information on COVID-19. In: Barolli L, Li KF, Miwa H (eds) Advances in intelligent networking and collaborative systems, advances in intelligent systems and computing. Springer International Publishing, Cham, pp 256–268. https://doi.org/10.1007/978-3-030-57796-4_25

Elsayed T, Nakov P, Barrón-Cedeno A, Hasanain M, Suwaileh R, Da San Martino G, Atanasova P (2019) Overview of the CLEF-2019 CheckThat! Lab: automatic identification and verification of claims. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 10th International Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019, Proceedings 10. Springer International Publishing, pp 301–321

Gumaei A, Al-Rakhami MS, Hassan MM, De Albuquerque VHC, Camacho D (2022) An effective approach for rumor detection of arabic tweets using extreme gradient boosting method. ACM Trans Asian Low-Resour Lang Inf Process 21:1–16. https://doi.org/10.1145/3461697

Haouari F, Ali ZS, Elsayed T (2019) bigIR at CLEF 2019: automatic verification of arabic claims over the Web. In CLEF (working notes)

Haouari F, Hasanain M, Suwaileh R, Elsayed T (2021) ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. ArXiv201008768 Cs

Hardalov M, Arora A, Nakov P, Augenstein I (2021) A survey on stance detection for mis- and disinformation identification. ArXiv210300242 Cs

Harrag F, Djahli MK (2022) Arabic fake news detection: a fact checking based deep learning approach. ACM Trans Asian Low-Resour Lang Inf Process 21:1–34. https://doi.org/10.1145/3501401

Helwe C, Elbassuoni S, Al Zaatari A, El-Hajj W (2019) Assessing arabic weblog credibility via deep co-learning. In: Proceedings of the Fourth Arabic natural language processing workshop. Presented at the proceedings of the fourth Arabic natural language processing workshop. Association for Computational Linguistics, Florence. pp 130–136. https://doi.org/10.18653/v1/W19-4614

Himdi H, Weir G, Assiri F, Al-Barhamtoshy H (2022) Arabic fake news detection based on textual analysis. Arab J Sci Eng 47(8):10453–10469

Jahng MR, Lee H, Rochadiat A (2020) Public relations practitioners’ management of fake news: exploring key elements and acts of information authentication. Public Relat Rev 46:101907. https://doi.org/10.1016/j.pubrev.2020.101907

Jardaneh G, Abdelhaq H, Buzz M, Johnson D (2019) Classifying Arabic tweets based on credibility using content and user features. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT). Presented at the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT). IEEE, Amman. pp 596–601. https://doi.org/10.1109/JEEIT.2019.8717386

Jones-Jang SM, Mortensen T, Liu J (2021) Does media literacy help identification of fake news? Information literacy helps, but other literacies don’t. Am Behav Sci 65:371–388. https://doi.org/10.1177/0002764219869406

Karimi H, Roy P, Saba-Sadiya S, Tang J (2018) Multi-source multi-class fake news detection. In Proceedings of the 27th international conference on computational linguistics, pp 1546–1557

Kaur S, Kumar P, Kumaraguru P (2020) Automating fake news detection system using multi-level voting model. Soft Comput 24:9049–9069. https://doi.org/10.1007/s00500-019-04436-y

Khalil A, Jarrah M, Aldwairi M, Jaradat M (2022) AFND: Arabic fake news dataset for the detection and classification of articles credibility. Data Brief 42:108141

Khan JY, Khondaker MdTI, Afroz S, Uddin G, Iqbal A (2021) A benchmark study of machine learning models for online fake news detection. Mach Learn Appl 4:100032. https://doi.org/10.1016/j.mlwa.2021.100032

Khouja J (2020) Stance prediction and claim verification: an Arabic perspective. ArXiv200510410 Cs

Lewandowsky S, van der Linden S (2021) Countering misinformation and fake news through inoculation and prebunking. Eur Rev Soc Psychol:1–38. https://doi.org/10.1080/10463283.2021.1876983

Lim G, Perrault ST (2020) Perceptions of News sharing and fake news in Singapore. ArXiv201007607 Cs

Mahlous AR, Al-Laith A (2021) Fake news detection in arabic tweets during the COVID-19 pandemic. Int J Adv Comput Sci Appl 12. https://doi.org/10.14569/IJACSA.2021.0120691

Mohammad S, Kiritchenko S, Sobhani P, Zhu X, Cherry C (2016) SemEval-2016 task 6: detecting stance in tweets, proceedings of the 10th international workshop on Semantic Evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, pp 31–41. https://doi.org/10.18653/v1/S16-1003

Nagoudi EMB, Elmadany A, Abdul-Mageed M, Alhindi T, Cavusoglu H (2020) Machine generation and detection of Arabic manipulated and fake news. arXiv preprint arXiv:2011.03092

Najadat H, Tawalbeh M, Awawdeh R (2022) Fake news detection for Arabic headlines-articles news data using deep learning. Int J Elec Comput Eng (2088–8708) 12(4):3951

Nakov P, Barrón-Cedeno A, Elsayed T, Suwaileh R, Màrquez L, Zaghouani W, Atanasova P, Kyuchukov S, Da San Martino G (2018) Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims. In Experimental IR meets multilinguality, multimodality, and interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, Proceedings 9. Springer International Publishing, pp 372–387

Nakov P, Alam F, Shaar S, Martino GDS, Zhang Y (2021) A second pandemic? Analysis of fake news about COVID-19 vaccines in Qatar. ArXiv210911372 Cs

Nassif AB, Elnagar A, Elgendy O, Afadar Y (2022) Arabic fake news detection based on deep contextualized embedding models. Neural Comput Appl 34(18):16019–16032

Niewinski P, Pszona M, Janicka M (2019) GEM: generative enhanced model for adversarial attacks. Proceedings of the second workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Hong Kong, pp 20–26. https://doi.org/10.18653/v1/D19-6604

Noman Qasem S, Al-Sarem M, Saeed F (2022) An ensemble learning based approach for detecting and tracking COVID19 rumors. Comput Mater Contin 70:1721–1747. https://doi.org/10.32604/cmc.2022.018972

Oshikawa R, Qian J, Wang WY (2018) A survey on natural language processing for fake news detection. arXiv preprint arXiv:1811.00770

Rampersad G, Althiyabi T (2020) Fake news: acceptance by demographics and culture on social media. J Inf Technol Polit 17:1–11. https://doi.org/10.1080/19331681.2019.1686676

Rangel F, Giachanou A, Ghanem BHH, Rosso P (2020) Overview of the 8th author profiling task at pan 2020: profiling fake news spreaders on twitter. In CEUR workshop proceedings. Sun SITE Central Europe, (vol. 2696, pp 1–18)

Ruchansky N, Seo S, Liu Y (2017) CSI: a hybrid deep model for fake news detection. Proceedings of the 2017 ACM on conference on information and knowledge management. Singapore, pp 797–806. https://doi.org/10.1145/3132847.3132877

Saadany H, Mohamed E, Orasan C (2020) Fake or real? A study of Arabic satirical fake news. ArXiv201100452 Cs

Sabbeh SF, Baatwah SY (2018) Arabic news credibility on twitter: an enhanced model using hybrid featureS. J Theor Appl Inf Technol 96(8)

Saeed NM, Helal NA, Badr NL, Gharib TF (2020) An enhanced feature-based sentiment analysis approach. Wiley Interdiscip Rev: Data Min Knowl Disc 10(2):e1347

Google Scholar

Saeed RM, Rady S, Gharib TF (2021) Optimizing sentiment classification for Arabic opinion texts. Cogn Comput 13(1):164–178

Saeed NM, Helal NA, Badr NL, Gharib TF (2018) The impact of spam reviews on feature-based sentiment analysis. In 2018 13th Int Conf Comput Eng Sys (ICCES) IEEE, pp 633–639

Sahoo SR, Gupta BB (2021) Multiple features based approach for automatic fake news detection on social networks using deep learning. Appl Soft Comput 100:106983. https://doi.org/10.1016/j.asoc.2020.106983

Shahi GK, Struß JM, Mandl T (2021) Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF

Shu K, Wang S, Liu H (2017) Exploiting tri-relationship for fake news detection. arXiv preprint arXiv:1712.07709, 8

Shu K, Mahudeswaran D, Wang S, Liu H (2020) Hierarchical propagation networks for fake news detection: Investigation and exploitation. In Proceedings of the international AAAI conference on web and social media (vol. 14, pp 626–637)

Sitaula N, Mohan CK, Grygiel J, Zhou X, Zafarani R (2020) Credibility-based fake news detection. In: Disinformation, Misinformation, and fake news in social media. Springer, Cham, pp 163–182

Thaher T, Saheb M, Turabieh H, Chantar H (2021) Intelligent detection of false information in arabic tweets utilizing hybrid harris hawks based feature selection and machine learning models. Symmetry 13:556. https://doi.org/10.3390/sym13040556

Touahri I, Mazroui A (2018) Opinion and sentiment polarity detection using supervised machine learning. In 2018 IEEE 5th Int Congr Inf Sci Technol (CiSt) IEEE, pp 249–253

Touahri I, Mazroui A (2019) Automatic verification of political claims based on morphological features. In CLEF (working notes)

Touahri I, Mazroui A (2020) Evolution team at CLEF2020-CheckThat! lab: integration of linguistic and sentimental features in a fake news detection approach. In CLEF (working notes)

Visentin M, Pizzi G, Pichierri M (2019) Fake news, real problems for brands: the impact of content truthfulness and source credibility on consumers’ behavioral intentions toward the advertised brands. J Interact Mark 45:99–112. https://doi.org/10.1016/j.intmar.2018.09.001

Wang Y, Yang W, Ma F, Xu J, Zhong B, Deng Q, Gao J (2020) Weak supervision for fake news detection via reinforcement learning. Proc AAAI Conf Artif Intell 34:516–523. https://doi.org/10.1609/aaai.v34i01.5389

Wang Y, Ma F, Jin Z, Yuan Y, Xun G, Jha K, Su L, Gao J (2018) Eann: event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pp 849–857

Wang J, Makowski S, Cieślik A, Lv H, Lv Z (2023) Fake news in virtual community, virtual society, and metaverse: a survey. IEEE Trans Comput Soc Sys

Wang WY (2017) "Liar, liar pants on fire": a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648

Yafooz W, Emara AHM, Lahby M (2022) Detecting fake news on COVID-19 vaccine from YouTube videos using advanced machine learning approaches. In: Combating fake news with computational intelligence techniques. Springer, Cham, pp 421–435

Zhang J, Dong B, Yu PS (2020) FakeDetector: effective fake news detection with deep diffusive neural network. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). Presented at the 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, Dallas, pp 1826–1829. https://doi.org/10.1109/ICDE48307.2020.00180

Zhang X, Ghorbani AA (2020) An overview of online fake news: characterization, detection, and discussion. Inf Process Manage 57(2):102025

Zhou X, Zafarani R (2020) A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput Surv (CSUR) 53(5):1–40

Zhou X, Jain A, Phoha VV, Zafarani R (2020) Fake news early detection: a theory-driven model. Digit Threats Res Pract 1:1–25. https://doi.org/10.1145/3377478

Download references

Author information

Authors and affiliations.

Department of Computer Science, Superior School of Technology, University Moulay Ismail, Meknes, Morocco

Ibtissam Touahri

Department of Mathematics and Computer Science, Faculty of Sciences, Mohamed First University, Oujda, Morocco

Azzeddine Mazroui

You can also search for this author in PubMed Google Scholar

Contributions

The corresponding authors wrote the manuscript and prepared tables. The second authors reviewed the manuscript. Both authors made corrections.

Corresponding author

Correspondence to Ibtissam Touahri .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Touahri, I., Mazroui, A. Survey of machine learning techniques for Arabic fake news detection. Artif Intell Rev 57 , 157 (2024). https://doi.org/10.1007/s10462-024-10778-3

Download citation

Accepted : 24 April 2024

Published : 28 May 2024

DOI : https://doi.org/10.1007/s10462-024-10778-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Natural language processing
Fake news detection
Machine learning
Deep learning
Find a journal
Publish with us
Track your research

IMAGES

University of Houston MATH 4322 Intro to Data Science & Machine
Machine Learning Assignment Help (Expert Tutors On-Demand)
CS 5710 Machine Learning Homework 4 (3).docx
GitHub
Homework solution 08 deep learning
Homework 1

VIDEO

Machine Learning Homework
Machine Learning Homework
"NPTEL ANSWER KEY FOR INTRODUCTION TO MACHINE LEARNING"
Machine Learning for Engineering and science applications Week 1 Quiz Assignment
"NPTEL ANSWER KEY FOR INTRODUCTION TO MACHINE LEARNING WEEK
I 3D Printed Machine that can Write Your Homework in Your Handwriting

COMMENTS

Assignments
Each quiz will be designed to assess your conceptual understanding about each unit. Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions. You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.
PDF CSE 446: Machine Learning Assignment 1
Homework Template and Files to Get You Started: The homework zip le contains the skeleton code and data sets that you will require for this assignment. Please read through the documentation provided in ALL les before starting the assignment. Citing Your Sources: Any sources of help that you consult while completing this assignment (other
denikn/Machine-Learning-MIT-Assignment
This repository contains the exercises, lab works and home works assignment for the Introduction to Machine Learning online class taught by Professor Leslie Pack Kaelbling, Professor Tomás Lozano-Pérez, Professor Isaac L. Chuang and Professor Duane S. Boning from Massachusett Institute of Technology - denikn/Machine-Learning-MIT-Assignment
All notes and materials for the CS229: Machine Learning course by
All lecture notes, slides and assignments for CS229: Machine Learning course by Stanford University. The videos of all lectures are available on YouTube. Useful links: CS229 Summer 2019 edition; About. All notes and materials for the CS229: Machine Learning course by Stanford University
Assignments
Project 1: From-Scratch Implementation of Logistic Regression. Project 2: Text Sentiment Classifiers for Online Reviews. Project 3: Recommendation Systems. Homeworks Each homework will be due ~1 week after it is released. It is meant to immediately test recent knowledge acquired in class, using mostly code exercises but also some written ...
CS 402: HW#7, Machine learning
Part II: Programming. The topic of this assignment is machine learning for supervised classification problems. Here are the main components of the assignment: Implementation of the machine learning algorithm of your choice. Comparison of your learning algorithm to those implemented by your fellow students on a small set of benchmark datasets.
Foundations of Machine Learning
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition (Aurélien Géron) This is a practical guide to machine learning that corresponds fairly well with the content and level of our course. While most of our homework is about coding ML from scratch with numpy, this book makes heavy use of scikit-learn and TensorFlow.
Machine Learning 10-601: Homework
Late homework policy -. Late homeworks will be penalized according to the following policy: Homework is worth full credit at the beginning of class on the due date. It is worth half credit for the next 48 hours. It is worth zero credit after that. Turn in hardcopies of all late homework assignments to Sharon Cavlovich.
Stanford Engineering Everywhere
Ng's research is in the areas of machine learning and artificial intelligence. He leads the STAIR (STanford Artificial Intelligence Robot) project, whose goal is to develop a home assistant robot that can perform tasks such as tidy up a room, load/unload a dishwasher, fetch and deliver items, and prepare meals using a kitchen.
Intro to Machine Learning 10-701
Machine Learning is concerned with computer programs that automatically improve their performance through experience (e.g., programs that learn to recognize human faces, recommend music and movies, and drive autonomous robots). ... Some of the homework assignments used in this class may have been used in prior versions of this class, or in ...
Machine Learning 10-701/15-781: Homework
Late homework policy -. Late homeworks will be penalized according to the following policy: Homework is worth full credit at the beginning of class on the due date. It is worth half credit for the next 48 hours. It is worth zero credit after that. Turn in hardcopies of all late homework assignments to Sharon Cavlovich.
Assignments
The assignments section provides problem sets, solutions, and supporting files from the course. Browse Course Material ... Machine Learning. Menu. More Info Syllabus Readings Lecture Notes Assignments Exams Projects Tools Assignments. Ali Mohammad and Rohit Singh prepared the problem sets and solutions. ...
PDF 10-315 Introduction to Machine Learning: Homework 6
(b) [8 pts] Complete the function update assignments(X, C). The input X is the n⇥d data matrix, C is the k ⇥ d matrix of current centers. The function returns a vector of current cluster assignments assignments, an array of length n. That is, C[i] is the center for cluster i and the jth data point X[j], is assigned to cluster assignments[j].
Homework Assignments
Homework. All homework assignments consist of two parts, a written section (due Tuesdays) and a programming section (due Thursdays). The instructions for both sections are included in the assignment zip files. Programming assignments will be distributed through svn. See the zip file for additional instructions.
DS-GA 1003 / CSCI-GA 2567: Machine Learning, Spring 2019
Math for Machine Learning by Hal Daumé III Software. NumPy is "the fundamental package for scientific computing with Python." Our homework assignments will use NumPy arrays extensively. scikit-learn is a comprehensive machine learning toolkit for Python. We won't use this for most of the homework assignments, since we'll be coding things from ...
CSC 411: Introduction to Machine Learning
Homework assignments. The best way to learn about a machine learning method is to program it yourself and experiment with it. So the assignments will generally involve implementing machine learning algorithms, and experimentation to test your algorithms on some data. You will be asked to summarize your work, and analyze the results, in brief (3 ...
fatosmorina/machine-learning-exams
This repository contains links to machine learning exams, homework assignments, and exercises that can help you test your understanding. Carnegie Mellon University (CMU) The fall 2009 10-601 midterm (midterm and solutions) The spring 2009 10-601 midterm (midterm and solutions)
CS 4/5780: Introduction to Machine Learning
Final grades are based on homework assignments, programming projects, and the exams. For the 5780 level version of the course the research comprehension quizzes will also factor in. ... Machine Learning A Probabilistic Perspective by Murphy We will provide section numbers to this text alongside many of the lectures (abbreviated as MLaPP in the ...
CSC 2515 Fall 2021: Introduction to Machine Learning
Reading Assignments (10%): Due date: Dec 10; Questions & Answers (10%) Bonus (5%): Finding typos in the slides, active class participation, evaluating the class, etc. Homework Assignments. This is a tentative schedule of the homework assignments. We plan to release them on Tuesday evenings and they will be due in 10 days (Monday of two weeks ...
Assignments: Theory of Machine Learning
Homework problems. We will have a stream of homework problems, following every class. Since this is an advanced graduate level class, solving these problems right after class will (hopefully) help you understand the material better. Part I: Foundations of Learning Theory. Problem 1.
Mathematics of Machine Learning Assignment 1
This resource contains information regarding Mathematics of machine learning assignment 1. Resource Type: Assignments. pdf. 129 kB Mathematics of Machine Learning Assignment 1 Download File DOWNLOAD. Course Info Instructor Prof. Philippe Rigollet; Departments ...
CS 335: Machine Learning
Understand the general mathematical and statistical principles that allow one to design machine learning algorithms. Identify, understand, and implement specific, widely-used machine learning algorithms. ... Therefore, start homework assignments and projects early to give yourself enough time ask questions and receive answers. Acknowledgments
MEEG-54403
MEEG-44403/54403 Machine Learning for Mechanical Engineers at the University of Arkansas. https://ned3.uark.edu/teaching/ ... This course includes four homework assignments to practice the application of different machine learning algorithms in specific mechanical engineering problems and a project assignment that gives the students the ...
ML ASSIGNMENT 1.docx
Data plays a significant role in the machine learning process. One of the significant issues that machine learning professionals face is the absence of good quality data. Unclean and noisy data can make the whole process extremely hard c) Data security Data security means protecting digital data, such as those in a database, from destructive forces and from the unwanted actions of unauthorized ...
Assignments
Used with permission.) Assignment 2 (PDF) Assignment 2 Solution (PDF) (Courtesy of William Perry. Used with permission.) Assignment 3 (PDF) Assignment 3 Solution (PDF) (Courtesy of William Perry. Used with permission.) This section provides three assignments for the course along with solutions.
For the purposes of this assignment you will develop
Question: For the purposes of this assignment you will develop both a bagging and boosting ensemble learning model ofyour choice to produce two dry beans classication models using python. You will then compare the performance of the chosenensemble learning models with one another, and to the performance a single machine learning model ...
CSCCM@IITD
Homework-1: Bayesian linear regression, Sampling method [homework] Homework-2: Gaussian Process, Approximate methods for Bayesian ... Students are expected to learn different probabilistic machine learning algorithms and applications in solving mechanics problems The course will emphasize on the mathematical learning of these concepts along ...
EspalomaCharge: Machine Learning-Enabled Ultrafast Partial Charge
(or some modified form), where k e is Coulomb constant (energy * distance 2 /charge 2) and r ij the interatomic distance. In fixed-charge MMs force fields, the partial charges q i are treated as constant, static parameters, independent of instantaneous geometry. As such, partial charge assignment—the manner in which partial charges are assigned to each atom in a given system based on their ...
Tracing Student Activity Patterns in E-Learning Environments ...
In distance learning educational environments like Moodle, students interact with their tutors, their peers, and the provided educational material through various means. Due to advancements in learning analytics, students' transitions within Moodle generate digital trace data that outline learners' self-directed learning paths and reveal information about their academic behavior within a ...
Survey of machine learning techniques for Arabic fake news ...
We may differentiate between machine learning (ML), deep learning (DL), and transformer-based approaches in terms of their approach, capabilities, and suitability for Arabic fake news detection based on the following criteria: Classical machine learning: ML models are effective when the features are well-defined and the dataset is not too large ...

Assignments

Assignments

COS 402: Artificial Intelligence

Part I: Written Exercises

Part II: Programming

A machine learning algorithm

Comparison on benchmark datasets

A systematic experiment

A written report

The code we are providing

The datasets we are providing

The code that you need to write

What to turn in

What you will be graded on

Algorithms you can choose from

Decision trees

Support-vector machines (SVM's)

Neural networks

Naive Bayes

Decision stumps

Nearest neighbors

(Voted) perceptron algorithm

Books on reserve at the Engineering Library

Foundations of Machine Learning

About This Course

Highlights and Distinctive Features of the Course Lectures, Notes, and Assignments

Prerequisites

Assignments

Other tutorials and references

Teaching Assistants

Stanford University

Course Handouts

Assignments

How to submit

Academic Honesty

Late Homework Policy

Helpful Resources

Machine Learning DS-GA 1003 · Spring 2019 · NYU Center for Data Science

Prerequisites

Important Dates

Other tutorials and references

Instructors

Section Leaders

Introduction to Machine Learning

Course overview

Course information

Requirements

Detailed Requirements

Tentative Syllabus

Navigation Menu

Saved searches

fatosmorina/machine-learning-exams

Stanford University

CS224N Natural Language Processing with Deep Learning

Introduction to Machine Learning

University of Texas

University of Toronto

Technical University of Munich

University of Pennsylvania

University of Washington

University of Edinburgh

Contributions

Contributors 2

CS 4/5780: Introduction to Machine Learning

Course Staff

News and important dates

Homework, projects, and exams

Core references

Additional references

Background references

Course policies

Mental health resources

Participation

Collaboration policy

Academic integrity

Accommodations

IntroML-Fall2021

Announcements:

Time & Location:

Suggested Reading