“You’ve Been Caught!” A Computer Algorithm For Detecting Academic Dishonesty In School Written Assignments

In this project, I propose a Natural Language Processing-based computer algorithm with lexical, semantic and syntactic features to detect cheating in junior high school written assignments.
Adrian Balajadia
Grade 9


Academic Dishonesty, A Growing Problem In Schools

The recent COVID-19 pandemic has forced students and staff to online learning. Even though students are expected to complete assignments with utmost honesty and integrity, there has been a rise in cases of students committing academic misconduct through contract cheating and the sharing and copying of files.

Dishonesty in education is growing. In an article published by CTV News Calgary, they interviewed Sarah Elaine Eaton, an associate professor at the University of Calgary’s Werklund School of Education, and she said that universities across the country are seeing increases of up to 38% in academic misconduct cases [1]. And during the summer of 2020, fourteen University of Calgary students were accused of sharing answers in an online chat room [2]. Eaton also wrote in Conversation Canada how cases of academic dishonesty could be underreported across the country [3] and could be due to the absence of research in this subject. The assistant professor highlights how research in educational integrity is severely under-funded, and therefore, not as well understood. According to one of her research papers, lack of comprehension in the subject inhibits teachers and staff from properly identifying dishonesty in essays and other written assignments [4]. Sources such as these underline the expanding problem of cheating throughout the education system at university, high school and junior high school settings.

As a student that has participated two times in the science fair, I want to present this project that will contribute to the minimization of academic dishonesty in junior high school. I believe that providing this kind of tool will not only help the teachers in finding cheating faster among students but will also help in correcting this kind of attitude as early as junior high school. "Natural Language Processing" (NLP) is the domain of computer science that merges both linguistics with computer science to better understand human interaction and text. Using basic principles and algorithms of NLP, such as word segmentation and syntactic analysis, I would like to detect dishonest work in essays and other written assignments. This paper is a description of my cheat-detection algorithm utilizing 13 cheat indicator features, with its overall accuracy evaluated on a variety of different examples of copied and plagiarised documents.


This "method" section is a summary of the research and development I performed to create my algorithm. It also describes the final architecture of my program. The source code is provided here  and can be opened in Google Colaboratory:'s%20Cheat%20Detect%20Code%20NLP.ipynb

What The Program Is Supposed To Detect

Since my project's scope is on the identification of academic dishonesty at the junior high school level, I informally interviewed my English, French, Social and Science teachers. There are two commonly observed forms of document dishonesty - copying sentences and copying concepts/ideas. While the first form is focused mainly on sentence constructs and vocabulary, the second form needs a more complex type of algorithm, therefore, requiring more time to finish. The methodology used in this study is focused mostly on the first form of cheating while the idea-copying detection is also touched lightly using some of the commonly used libraries or methodology in stylometry.

A Review of Manual Methods

Before researching and utilizing computer science tools and concepts, I decided it was logical to visit existing methods of detection, however, manual. In this section, I consulted my teachers, the University of Toronto's academic integrity website [7], and a paper published in the International Journal for Educational Integrity [8]. Each source presented very similar criteria:

  • The sentence structure was strange
  • The student text was too highly developed
  • Irregular vocabulary was used
  • The text was not following the student's writing style

While developing my program, I tried my best to incorporate features that reflected this information.

Program Architecture

I was planning to implement a machine learning-oriented approach but due to the interest of time and my limited knowledge, I instead created a program that captures and analyzes text using two layers of detection. Generally, each document goes through the first layer to produce a list of students who have similar documents. This list then goes to the second layer for another more detailed feature comparison. The result is a list of paired students who are suspected of copying.

First Layer

We begin by taking one document (we will denote this person as "Student A") as a specimen and apply Burrow's Delta Method [9]. The Delta function will compare it with the rest of the documents and narrow down the possible students that might have copied from Student A or where Student A has copied from. 

Burrow's Delta Method is commonly used in stylometry (the study of writing styles) and author attribution in texts. This mathematical function compares Student A to all the other documents in the class and produces a "delta score" for each classmate. The delta score is computed by utilizing words in a text as its features. Having a low score means that a document is more similar to Student A in terms of writing style. After computing delta scores, the program takes the top 4 documents with the lowest delta scores (we will denote these documents as Students B, C, D and E) and proceeds to the second layer. I chose to take the top 4 documents, as it already represents 16% of students in a typical junior high class of 25.

Second Layer

In this layer, the top 4 results are then analyzed and evaluated on 13 lexical (related to words), syntactic (related to sentences) and semantic (related to meaning) features. I chose a wide variety of features to analyze not only writing styles but the content and word use. Here, the program implements a point system, where Students B, C, D and E gain points if they represent a similarity reading from each feature. If Students B, C, D or E exceed 4 points, their name is placed on a "watch list" along with Student A. Below is a description of each feature used, with some features requiring prior processing of data.

Pre-processing Functions

In Natural Language Processing (NLP), text pre-processing is an important task to sanitize the corpus by removing unnecessary information. Here, we remove words that provide little to no meaning. In each document, the program checks and removes words that are in the following list provided below. For some instances, the program performs keyword extraction:

  • Stop word removal (e.g. the, a, and)
  • Removal of punctuation
  • Removal of accents
  • Lowercasing all words
  • Removal of numbers
  • Lemmatization: In this function, the program changes all terms to their base form in respect of what type of word it is. A visualization of this is described below.


Common Part-of-Speech Sequences - Syntactic Feature


This feature requires identifying the "parts of speech tags (POS tags)" - attributed to words in sentences, then determines the four most common POS tags in three sequences (trigrams) found in a document. This method helps provide information on a student's writing style. The choice of using trigrams was based on a successful study found in Charles Hollingworth's stylometry paper [10], in which word trigrams are used by the research group from the University of Wolverhampton [11], and John Braunlin's Medium article [12]. 


In my program, if 2 of out 4 of Student A's most common POS trigrams are the same as Student B, C, D or E, then the program adds points to the final counter of Student A and student pair (B, A) to (C, A) to D or A to E combinations. I chose to set the threshold to two, as students may write the same way, and their POS trigram frequencies can be purely coincidental. To identify the POS tags of a document, I utilized the Natural Language Toolkit (NLTK) library parser in my program. 


Term Frequency-Inverse Document Frequency (TF-IDF) - Semantic and Lexical Feature

TF-IDF is a statistical computation that computes the importance of words in a document and utilizes the concept of word embeddings. Word embedding is a common concept in NLP and is the act of translating documents into vectors or numbers for the computer to better interpret the information [15] [16] [19]. This method is frequently used for text similarity and was used to detect cheating in college exams in a paper in 2012 [13].

To compute TF-IDF the program counts the occurrences of a word in a document then applies a logarithm to reduce the importance of words that have high frequencies [14] [16]. After computing the TF-IDF values of each term, the program gains a vector representation of the document and then applies the cosine similarity metric. The output produced is a number between zero (representing 0% similarity) and one (100% similarity) [19]. If the cosine similarity is beyond a certain threshold, the program adds points to the final counter. I applied this computation along with pre-processing on entire documents. Pre-processing is required here to ensure the program is analyzing relevant and correct terms instead of just frequently seen words.


Jaccard Index on Entire Documents and Sentences- Lexical and Syntactic Feature

Jaccard Index is another commonly used similarity metric. It measures the common and distinct words of two documents, and analogous to cosine similarity produces an output between zero and one. Jaccard similarity takes the common words of two documents and divides it by all the words of the two documents [17] [18] [19]. In my implementation, I applied the index to entire documents and sentences. On the scale of the entire document, if the output exceeds a threshold, then the program adds points to the final counter. Likewise, if two sentences have a high similarity, the program saves the two sentences and adds points.



Total Number of Sentences, Words and Unique Words - Lexical and Syntactic Feature

The assumption behind this concept is that if direct copying was committed, then the totals of each feature would be identical, thus having a low difference. Having a low difference would result in adding points to the counter. Unique words are defined as words that are not stopwords (words that provide no meaning). These features aim to observe if a student's writing style is similar to another and to discover evidence of direct copying among the class. 


Standard Deviation, Average and Mean Length of Sentences - Syntactic Features

These features are mostly syntactical and provide information on a student's writing style. In another implementation, it was used to detect writing style changes in a text [20]. Similar to previous features, if a student genuinely copied from another student, then the difference between the features would below. Consequently, the program then adds points to the final counter. The unit of measure is the words in sentences. 


Function Words Frequencies - Lexical Feature

Function words are similar to stopwords, as they provided little to no meaning. But despite this, they have been known to be a reliable stylometric feature. Because they are hard to consciously control, are high in frequency, and are independent regardless of topic, they prove to be optimal features in detecting writing style. Their efficiency has been proved in three papers [21] [22] [23].


In my program, it extracts the 5 most frequent function words of student A. If the 5 most frequent function words are the same as another student, the program then gains the difference between the frequencies of that common function word. Whenever the difference is low, the program adds one point to the  "word counter". If the word counter is larger than a threshold, the program, therefore, adds points to the final counter of that student pair. I decided to perform these types of measures to separate coincidental and suspicious cases.

The list of function words I used was from the 30 most frequent function words found through testing in the "Authorship Attribution Through Function Word Adjacency Networks" research paper [23].

Punctuation Frequencies - Lexical Feature


The process of computing this feature is identical to how the program computed the function word feature. And just like function words, it helps provide information on writing style. Similar to the previous feature, we calculate the difference between the frequencies of the common types of punctuation. If they are low, the program adds a point to a "punctuation count". The computer adds points to the final counter of a student pair if the punctuation count is higher than an assigned threshold. I incorporated this feature after reading the same research paper where they detected changes of writing style in text [20].

TextRank Keyword Extraction - Semantic Feature

The purpose of this feature is to capture a portion of a text's meaning by extracting important keywords. Here, the program evaluates if the content is similar to another student by using keywords obtained from executing the TextRank algorithm [26]. In TextRank, we represent all words of a document as nodes in a graph. We then compute the weight or "importance" of each node/word by analyzing the number of edges (links) are associated with each node and produce a list of keywords for each document. This evidently required prior pre-processing before execution.

After the program obtains the list keywords from a document, it then compares it to other documents' lists. If a pair has two common keywords, the program adds points to their counter. 


Latent Dirichlet Allocation (LDA) Topic Modelling - Semantic Feature

Topic modelling is the domain of NLP that concerns the creation of learned topics from a corpus (list of documents). In this example, I used the LDA model to check if two students possessed the same topics.

The LDA model, first introduced in 2003 [27], analyzes texts and generates a series of topics represented by keywords. In LDA, we assume that documents have a distribution of topics inside of them and these topics have a distribution of keywords [30]. By providing the desired number of "hidden" topics we hope to learn as a parameter (hence, the word latent in the name), the algorithm rearranges all these distributions to obtain accurate representative keywords for each topic [28] [29] [30]. To ensure I receive the best possible results, I applied pre-processing and removal of stopwords.

In real case scenarios, if a student copied from another student and LDA is applied on both of their works, they both would possess the same dominant topic. If they are found to have the same topics, then the program would add points to their final counter.

To evaluate the quality of the learned topics, I also calculated their "coherence scores" to see if the keywords of the topics are semantically (in meaning) similar or related [31]. For iterations (or epoch), I changed the "number of topics" parameter by 3 and found that setting it to 5 produces 0.5 to 0.6 coherence scores which are reasonable values.


Word2vec with Word Mover's Distance (WMD) - Semantic Feature

As mentioned in TF-IDF, word embeddings is a common and important task in NLP since it translates words into computer subscript-able data. In this feature, the program again utilizes this concept to learn semantic information from student texts. This is done by transforming words in documents to vectors in graph space, then calculating the travelling distance required for all of Student A's word vectors to arrive at Student B's word vectors. Pre-processing was not performed, as it would remove essential context words.

To learn vector representation of words in texts, I implemented a pre-existing model titled, "Word2vec" [32]. This model trains a very shallow neural network and uses the learned weights as the word vector. It uses 3 main hyperparameters [33] [34]:

  • Epochs: Number of times the computer passes through the documents. In my case, I set this to the default setting to 5 after consulting Gensim's official documentation page and doing testing [33] [34].
  • Vector size: The dimension of the word vectors. I set this to 100 after running tests and found that anything higher or lower than this number either caused large fluctuations inaccuracy or produced the same results.
  • Window size: This parameter sets how big the context window will be. According to papers and Stack Exchange queries, having a bigger window size may yield better results [38] [39].

Additionally, there are also two variants of Word2vec that a scientist may use [32] [35] [36]:

  • The "Continuous Bag Of Words" model (CBOW); which is used primarily for capturing syntactic information, taking the context of words and producing an output.

  • The "Skip-gram" model; which is used for learning semantic information, taking a specific word as input and producing a list of context words as an output. I utilized this model instead of the former, as it is more semantic oriented and is more accurate in cases where the data is small [32] [37].

Once Word2vec produces word vectors in Student A's document, my computer program learns word vectors for Student B, C, D and E documents. Afterwards, it applies a distance metric called Word Mover's Distance (WMD) to compute the travelling distance for Student B's (or C, D and E) word vectors to arrive at Student A's word vectors [40] [41]. WMD involves primarily computing Euclidean distance. Unlike similarity metrics, in distance metrics, the smaller the value, the more similar a document is. If the WMD of a student pair is lower than an assigned threshold, the program assigns points. ​​​​





How Tests Were Conducted

I evaluated the overall performance of my model using a variety of test scenarios described below and made major modifications accordingly. All test scenarios were given the source text and test documents were labelled, "non-plagiarized and "plagiarized". In some test scenarios, plagiarized examples can also be tagged as "near copied", "copied with light revision" and "copied with heavy revision". The objective of the program is to ensure that non plagiarised documents are not to be included in the final list and that plagiarised ones are detected. To measure the accuracy, each scenario was assigned scores (an 8/8 would mean a 100% accuracy). By testing and modifying certain threshold values and the point system, I was able to improve the output quality of the program from ~24% to ~64%.

Document Sources

Unfortunately, due to confidentiality policies of school documents. I was not able to solely test my program using real junior high texts since the teachers only gave me one sample document. However, to compensate for the lack of sample data, I researched and collected samples of plagiarised texts on the internet. All the documents used are mentioned in the attachments.

Test 1


  • SCENARIO A: 8 documents from task A (score out of 7)
    • 1 source text, 4 copied/plagiarised documents from that source and 3 non-copied/plagiarised
    • List of tagged documents
  • SCENARIO B: 8 documents from the University of Oxford's plagiarism examples (score out of 7)
    • 1 source text, 5 copied/plagiarised documents from that source and 2 non-copied/plagiarised
  • SCENARIO C: 4 documents from Bowdoin plagiarism examples (score out of 2)
    • 1 Source text A, 1 source text B, 1 copied document from source A and 1 copied document from source B

Threshold and Point Values

All point values for each feature (points that are awarded when there is a similarity reading) were set to +1 and the point threshold of the final counter was 4.

Overall Findings

The following tables are the results from each scenario. Some documents might appear as "Copy of source A2" instead of "Copy of source A". The extra numbers do not represent anything and were only implemented to debug saving problems in my code.


  • SCENARIO A = 57% (4/7)

  • Carl, Tyler and Mike were non plagiarised examples. Since they were inaccurately detected as "copied' or "plagiarised" 3 marks were deducted


  • SCENARIO B = 71% (5/7)

  • Since non plagiarised examples were indicated as copied, 2 marks were deducted



  • SCENARIO C = 0% (0/2) 

  • In this case, it was a 0/2 as the copied sources A and B were giving similarity readings to both source texts.


OVERALL ACCURACY FROM TEST 1 (average of all) = 24%

In this first test, it is clear that the model is very inaccurate. It succeeds at detecting plagiarised examples but fails at distinguishing the non plagiarized texts. But after analyzing the tables and graphs produced by the program, I discovered one main reason why the program was failing and what change was needed:

Some features are better indicators for cheating, and therefore, features should have "biases" or "weights" 

In figure A, you could see how most of the documents possessed similar word counts, sentence deviations and lengths. Even though they may possess these similar properties, they are NOT the same. Sources A, B and the copied text from Source A do possess the same number of sentences in their texts, but that does not conclude that they copied from each other. During analysis, I observed that the +1 point value assigned to every feature made the program accidentally detect the non-plagiarised docs, thus lowering the overall accuracy. It makes the scores too high (see figure D).

Additionally, I also found that TF-IDF similarity, Jaccard similarity and WMD were better indicators of cheating. In figure B, Carl has a non-plagiarized document and was being compared to plagiarised examples. TF-IDF and Jaccard similarity best reflected this information compared to other syntactic and lexical features. In figure C, the distance between Source A and its copied text is low, therefore it is correct.

Based on this observation, I adjusted my program to add +1 point values only to more indicative features such as TF-IDF, Jaccard, WMD and +0.5 for syntactic, and lexical features like the number of sentences, punctuation and function word counts.


Figure A from SCENARIO C                                                                                                     Figure B from SCENARIO A     

They possess similar features but are different documents           All TF-IDF and Jaccard similarity scores are low for Carl, as he is                                                                                                              tagged as a non-plagiarized document.



Figure C from SCENARIO C

The distance between Source A and its copy is below 0.1, therefore low. With a low WMD, the program would correctly identify this pair. In my program, the threshold for this feature was set to 0.1 


Figure D from SCENARIO A

Mike is also considered "non-plagiarised" in scenario A. Reviewing these tables values, the total scores for Mike are excessive


Test 2 - Changed weighted values of features


I reused the same scenarios that were performed in test 1.

  • SCENARIO A: 8 documents from task A (score out of 7)
    • 1 source text, 4 copied/plagiarised documents from that source and 3 non-copied/plagiarised
    • List of tagged documents
  • SCENARIO B: 8 documents from the University of Oxford's plagiarism examples (score out of 7)
    • 1 source text, 5 copied/plagiarised documents from that source and 2 non-copied/plagiarised
  • SCENARIO C: 4 documents from Bowdoin plagiarism examples (score out of 2)
    • 1 Source text A, 1 source text B, 1 copied document from source A and 1 copied document from source B

Threshold and Point Values

All point values were kept the same except for the following features that had their point values changed to from +1 to +0.5:

  • Punctuation frequencies
  • Function word frequencies
  • Total number of unique words
  • Total number of words
  • Total number of sentences
  • Standard deviation length of sentences
  • Mean length of sentences
  • Average sentence length

Other modifications related to features included:

  • If the document possessed suspicious sentences it will add a +1.5 to their final counter instead of a +1
  • Setting the threshold of Jaccard and TF-IDF to 30% (0.3). This change was necessary, as dissimilar documents and sentences always presented readings below 30%.

Overall Findings

  • SCENARIO A = 71% (5/7)

  • A slight improvement was evident. Tyler was excluded but Carl and Mike were still detected as plagiarised documents.

  • SCENARIO B = 100 % (7/7)

  • All non plagiarised documents were separated from the output list.


  • SCENARIO C = 100% (2/2)

  • All copied texts were successfully attributed to their source.


In this test, the program performed significantly better in all three scenarios, though still slightly faulty. In figure E, again, Carl should not have any similarity readings. Compared to the previous test, the total scores were lower, but not low enough to declare the program as precise. For the next test, will weigh down the values once again.

Figure E from SCENARIO A. The final scores are more normalized, but again too high. Carl's "final results should not be all "true" or considered copied. You could also see how there was a different amount of keywords extracted for each document, therefore setting a threshold of 3 common keywords would not be precise. 

Additionally, I also found that I needed to change certain values from the total number of words and unique words feature. Instead of adding points whenever the difference is three words, I modified it to be twenty words. I made this change because, in the case of long documents (such as the ones seen in SCENARIO A), it would be possible to detect a similarity reading for the partially plagiarized texts. A threshold of 3 words would detect direct plagiarism/copying, but would not detect partial copying which is the more widely used form of cheating.

Another aspect I observed in figure E was the number of keywords extracted varied between each document. I initially set the threshold for this feature to add points to the final counter if there were three common keywords but this produced inaccurate results since the number of keywords extracted was proportional to the document length. Fortunately in Gensim's TextRank algorithm, I discovered that there is a parameter that can extract a fixed amount of keywords, therefore, I set this to five. This enabled me to still keep the threshold of common keywords to three.

Furthermore, I observed that the topic modelling may contribute to the inaccuracies of the model. Since the subject from all SCENARIO A documents are all related to "object-oriented programming", theoretically, it would make sense that they would possess common LDA topics. For the next test, I would like to reduce the weight of this document from +1 to +0.5.

These sets of tests with the "weighted" features also revealed that WMD is sometimes inaccurate. In figure F,  although there should be a lower distance between Ben's work and the original document, it was accurate in detecting Bill, John and Adi since they were all below 0.1. The reason why it may inaccurate could be because there could have been different vocabulary use in each of the documents, therefore the distance was far. In the next test, I decided to lower the importance or weight of this feature.

Figure F from SCENARIO A. The WMD should have been significantly lower for Ben.


Final Test -  More weighted values improvements


  • SCENARIO A: 8 documents from task A (score out of 7)
    • 1 source text, 4 copied/plagiarised documents from that source and 3 non-copied/plagiarised
  • SCENARIO B: 8 documents from the University of Oxford's plagiarism examples (score out of 7)
    • 1 source text, 5 copied/plagiarised documents from that source and 2 non-copied/plagiarised
  • SCENARIO C: 4 documents from Bowdoin plagiarism examples (score out of 2)
    • 1 Source text A, 1 source text B, 1 copied document from source A and 1 copied document from source B

I also tested this set's threshold on two additional test scenarios

  • SCENARIO D: 8 documents from task B (score out of 7)
    • 1 source text, 5 copied/plagiarised documents from that source and 2 non-copied/plagiarised
  • SCENARIO E: 3 documents provided to me by the teacher (score out of 1)
    • 1 entire news article, 1 certain part of the news article, 1 copied text that copied only that certain part
  • SCENARIO F: 2 documents from a paraphrased text (score out of 1)
    • 1 source document, 1 paraphrased text.

Final Threshold and Point Values

By reviewing the results, I applied more biases and weight to the features for this last test. In addition to this, I raised the final counter threshold to 5, meaning to be included in the output list,  pairs must have values higher than 5. The following are the final point values of each feature.

  • POS trigrams  = +0.25
  • Function Words and Punctuation frequencies  = +0.25
  • Total number of words, sentences and unique words = +0.25
  • Standard deviation and mean length of sentences = +0.25
  • TF-IDF with cosine similarity = +1.5
  • Jaccard similarity on entire documents = +1.5
  • Suspicious and similar sentences using Jaccard = +1.5
  • WMD = +1
  • LDA topic modelling = +0.5
  • TextRank keywords = +1



SCENARIO A = 85% (6/7)

One mark was deducted as it still identified Mike as plagiarized.

SCENARIO B = 100% (7/7)

SCENARIO C = 100% (2/2)


SCENARIO D = 100% (7/7)

SCENARIO E = 0% (0/2) - Detected neither the partial and full news articles

SCENARIO F = 0% (0/1) - Did not detect the paraphrase example


In SCENARIO A, the program failed at excluding Mike in the output. After further analysis, I, unfortunately, discovered that the reason for this misidentification was a bug in the detection of suspicious sentences. In some cases, a sentence in a text, for example, might be similar to another sentence in that same text. Unfortunately, this was the problem that was affecting my program, and due to the interest of time, I was not able to successfully resolve it yet. This bug added an extra +1.5 points to the Mike and Ben student pair. The final score of Mike and Ben was 6.5, and theoretically, without the bug, it should have been 5 and excluded from the output list. Luckily, this bug only appeared and only affected this type of student pair and not the other documents, and test scenarios. Therefore, despite the identification of a bug, the performed tests are still considered reliable.

In SCENARIO E, the documents that I used were the entire news article and a real junior high student text that took information from a certain part of that article. Since the student text only copied one section, I also tested to see if the program could identify plagiarism by comparing the student text to the entire article and to the specific section (hence, I had one full source text and one partial source text). The program did not detect both articles. The reason why the program did not detect the student text could be because 80% of the full news article's information was not copied in the student's text, therefore despite having similarity readings for semantic features, there was a lack of points from syntactic and lexical features from the partial examples. 

In SCENARIO F, I selected a source text from the internet and a paraphrased form of it. Although the program possessed more sophisticated point values, the program did not detect the paraphrased example. The reason can be attributed since the paraphrased form was too short and the word usage was very different from the original.


Summary of Testing Results

Despite having weighed and modified the thresholds and point values for each feature in my program, I can not say that it is ready to be implemented in real school scenarios yet. Because of the lack of sample documents, I believe that the prototype still requires testing and presents some inaccuracies. Due to these inaccuracies, the program accidentally misidentified honest test documents. 

Further Suggestions and Proposed Improvements

After performing tests on my current program, I devised a list of major suggestions and improvements. If I had more time and resources, these are the following aspects I would like to add, test or explore:

  • Consider training a machine learning (ML) model on a students' previous texts to learn a student's writing style. Afterwards, I would evaluate documents
    • Recurrent Neural Networks (RNNs), Support Vector Machines (SVM) and Naive Bayes Classifiers are common machine learning models that have been tested and researched to detect plagiarism and paraphrase identification [43] [44].
    • I believe, if I can vectorize features from students' previous works, I could train an ML model that could learn to recognize student writing styles. With an ML model, perhaps I could evaluate texts to see if they reflect that student's writing style and to see if the writing styles are similar to others.
  • Debug and thoroughly examine every section of my code.
    • As mentioned in the "Method" section, I would like to resolve that bug that contributed to a faulty detection of a student text
    • If I can resolve that issue, would my program become more accurate?
  • Obtain more test documents and examples of cheating from actual school examples. Additionally, performing more analysis and research on each feature's importance and quality as an indicator
    • I believe a major contributor to my program's inaccuracies could have been the lack of test documents, samples and testing in general
    • Provided more test examples and more time, I am curious if I can analyze the results to better calibrate the features of my program
  • Further research of previous works and mathematical concepts
    • If I can perform more research in better understanding the mathematics and science behind previous works, I wonder I can better understand the reasons for my shortcomings
    • Additionally, with further research, would I be able to discover new concepts and features that I could incorporate to better improve the accuracy of my program?


This project has provided me great insight into a new branch of computer science. The domain of Natural Language Processing is very interesting, and I enjoyed learning new concepts and principles whilst creating a unique science fair project to present my understanding of the subject. Although my program is still not yet accurate, I believe that if I perform more testing and revise the proposed improvements and suggestions, I could further enhance the quality of the current program. This year's science fair was just as enjoyable as before, and I am glad that I was able to learn something new.

Thank you for taking your time in reading this paper, and I look forward to what next year's science fair holds for me!



  • [1] Sidhu, Ina. “University of Calgary Researcher Suggests Cheating on the Rise with Move to Online Learning.” Calgary, CTV News, 6 Oct. 2020, 
  • [2] Edwardson, Lucie. “14 University of Calgary Students Accused of Misconduct for Sharing Answers in Chatroom.” CBCnews, CBC/Radio Canada, 20 June 2020, 
  • [3] Eaton, S.E. “Cheating May Be under-Reported across Canada's Universities and Colleges.” News, 21 Jan. 2021, 
  • [4] Eaton, S.E., Edino, R.I. Strengthening the research agenda of educational integrity in Canada: a review of the research literature and call to action . Int J Educ Integr 14, 5 (2018).
  • [5] “Examples of Plagiarism.” Dean of Students, 
  • [6] White, Mary Gormandy. “Examples of Plagiarism in Different Types of Texts.” Example Articles & Resources, 
  • [7] “Detecting Plagiarism.” Detecting Plagiarism – University of Toronto Academic Integrity, 
  • [8] Rogerson, A.M. Detecting contract cheating in essay and report submissions: process, patterns, clues and conversations. Int J Educ Integr 13, 10 (2017).
  • [9] François Dominic Laramée, "Introduction to stylometry with Python," The Programming Historian 7 (2018),
  • [10] Hollingsworth, Charles. “Syntactic Stylometry: Using Sentence Structure for Authorship Attribution.” (2012).
  • [11] Chong, Miranda & Specia, Lucia & Mitkov, Ruslan. (2010). Using Natural Language Processing for Automatic Detection of Plagiarism. 
  • [12] Braunlin, John. “Using NLP to Identify Redditors Who Control Multiple Accounts.” Medium, Towards Data Science, 30 Nov. 2018, 
  • [13] Cavalcanti, Elmano & Santos Pires, Carlos & Cavalcanti, Elmano & Pires, Vládia. (2012). Detection and Evaluation of Cheating on College Exams using Supervised Classification. Informatics in Education. 11. 169-190. 10.15388/infedu.2012.09. 
  • [14] Jain, Chaitanyasuma. “Find Similarity between Documents Using TF IDF.” OpenGenus IQ: Learn Computer Science, OpenGenus IQ: Learn Computer Science, 30 Aug. 2019, 
  • [15] “Tf-Idf :: A Single-Page Tutorial - Information Retrieval and Text Mining.” Tfidf, 
  • [16] Sethi, Nishant. “TF-IDF for Similarity Scores.” Medium, DataDrivenInvestor, 23 Sept. 2020, 
  • [17] Stephanie Glen. "Jaccard Index / Similarity Coefficient" From Elementary Statistics for the rest of us!
  • [18] Sieg, Adrien. “Text Similarities : Estimate the Degree of Similarity between Two Texts.” Medium, Medium, 13 Nov. 2019, 
  • [19] Gupta, Sanket. “Overview of Text Similarity Metrics in Python.” Medium, Towards Data Science, 10 Jan. 2020, 
  • [20] Gomez Adorno, Helena & Rios, Germán & Posadas Durán, Juan & Sidorov, Grigori & Sierra, Gerardo. (2018). Stylometry-based Approach for Detecting Writing Style Changes in Literary Texts. Computación y Sistemas. 22. 10.13053/cys-22-1-2882. 
  • [21] Horton, Thomas Bolton. The Effectiveness of the Stylometry of Function Words in Discriminating between Shakespeare and Fletcher, The University of Edinburgh, 1 Jan. 1987, 
  • [22] Boukhaled, Mohamed & Ganascia, Jean-Gabriel. (2014). Using Function Words for Authorship Attribution: Bag-Of-Words vs. Sequential Rules. 
  • [23] Segarra, Santiago & Eisen, Mark & Ribeiro, Alejandro. (2014). Authorship Attribution Through Function Word Adjacency Networks. IEEE Transactions on Signal Processing. 63. 10.1109/TSP.2015.2451111. 
  • [24] Prabhakaran, Selva. “Gensim Tutorial - A Complete Beginners Guide.” ML+, 25 Jan. 2021, 
  • [25] Mortensen, Ólavur. “Text Summarization with Gensim.” Rare Technologies, 24 Aug. 2015, 
  • [26] Mihalcea, Rada & Rada, & Tarau, Paul. (2004). TextRank: Bringing Order into Texts. 
  • [27] Blei, David & Ng, Andrew & Jordan, Michael. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research. 3. 993-1022. 10.1162/jmlr.2003.3.4-5.993. 
  • [28] Prabhakaran, Selva. “Topic Modeling in Python with Gensim.” ML+, 19 Jan. 2021, 
  • [29] Chen, Edwin. “Introduction to Latent Dirichlet Allocation.” Edwin Chens Blog Atom, 
  • [30] Kelechava, Marc. “Using LDA Topic Models as a Classification Model Input.” Medium, Towards Data Science, 6 Aug. 2020, 
  • [31] Kapadia, Shashank. “Evaluate Topic Models: Latent Dirichlet Allocation (LDA).” Medium, Towards Data Science, 29 Dec. 2020, 
  • [32] Mikolov, Tomas, et al. “Efficient Estimation of Word Representations in Vector Space.”, 7 Sept. 2013, 
  • [33] “Gensim: Topic Modelling for Humans.” Word2Vec Model - Gensim, 4 Nov. 2020 
  • [34] “Gensim: Topic Modelling for Humans.” Models.word2vec – Word2vec Embeddings - Gensim, 4 Nov. 2020, 
  • [35] Alammar, Jay. “The Illustrated Word2vec.” The Illustrated Word2vec – Jay Alammar – Visualizing Machine Learning One Concept at a Time., 27 Mar. 2019, 
  • [36] Karani, Dhruvil. “Introduction to Word Embedding and Word2Vec.” Medium, Towards Data Science, 2 Sept. 2020, 
  • [37] “word2vec: CBOW & Skip-Gram Performance Wrt Training Dataset Size.” Stack Overflow, 30 Aug. 2016, 
  • [38] Hazoom, Moshe. “Word2Vec For Phrases - Learning Embeddings For More Than One Word.” Medium, Towards Data Science, 26 Dec. 2018, 
  • [39] Lison, Pierre & Kutuzov, Andrei. (2017). Redefining Context Windows for Word Embedding Models: An Experimental Study. 
  • [40] Kusner, Matt & Sun, Y. & Kolkin, N.I. & Weinberger, Kilian. (2015). From word embeddings to document distances. Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). 957-966.  
  • [41] Team, Towards AI. “Word Mover's Distance (WMD) Explained: An Effective Method of Document Classification.” Towards AI - The Best of Tech, Science, and Engineering, 22 Sept. 2020, 
  • [42] Uzuner, Özlem & Katz, Boris & Nahnsen, Thade. (2005). Using syntactic information to identify plagiarism. 37-44. 10.3115/1609829.1609836. 
  • [43] Altaf, Wasif. (2011). Paraphrase Identification. 10.13140/RG.2.2.16290.99523. 
  • [44] Brockett, Chris & Dolan, William. (2005). Support vector machines for paraphrase identification and corpus construction. Proceedings of the 3rd International Workshop on Paraphrasing. 


Major Python Libraries Used:

  • Gensim (Bibtex format): 
    • @inproceedings{rehurek_lrec,
            title = {{Software Framework for Topic Modelling with Large Corpora}},
            author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
            booktitle = {{Proceedings of the LREC 2010 Workshop on New
                 Challenges for NLP Frameworks}},
            pages = {45--50},
            year = 2010,
            month = May,
            day = 22,
            publisher = {ELRA},
            address = {Valletta, Malta},
  • Scikit-learn: 
  • Natural Language Toolkit (NLTK)
    • Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O'Reilly Media Inc.
  • Matplotlib
    • J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.
  • Pandas
  • Numpy
    • Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
  • Google Colaboratory (used as notebook)




In this paper, I would first like to thank my parents who have always supported and encouraged my interest in science and mathematics. Thank you so much for helping me in writing this paper and helping me develop my program architecture! 

Secondly, I would like to thank all my teachers that have tried their best to provide me test documents and insight on manual methods for detecting academic dishonesty. I am especially grateful to Mme Campbell, my science teacher, as she continues to support my passion for science and gave adequate class time to complete my project. Also, thank you for providing me updates and facilitating the project registration for this year.

The following are teachers who I would like to thank:

  • Mrs. Munro
  • Mme Lee
  • Mme Campbell
  • Mr Nzussuo

Thank you Mom, Dad and my teachers for helping me realize my 2021 CYSF project!