Plagiarism Detection of reviews

Shiksha Engineering
4 min readAug 10, 2023

Author: Thumati Premchand

As Shiksha receives hundreds of college reviews daily, ensuring the authenticity and originality of the content becomes a challenging task for moderators. Identifying plagiarized or copied reviews manually is time-consuming and inefficient.

Initial Approach:

Jaccard Similarity: To grasp the concept better, let’s first look at Jaccard similarity, which measures the overlap between two sets of unique words.

Jaccard Similarity = No. of Common Unique Words/Total No of Unique Words.
sentence1 = 'this is a ball'
sentence2 = 'this is a bat'
Intersection = ['this', 'is', 'a']
Union = ['this', 'is', 'a', 'ball', 'bat']
Jaccard Similarity = 3 / 5

Shortcomings of this approach: However, it falls short when it comes to capturing semantic similarities with different terminologies.

Consider an example

Sentence 1: "I love felines."
Sentence 2: "I adore cats."

If we break them down:

Unique words in Sentence 1: ["I", "love", "felines"]
Unique words in Sentence 2: ["I", "adore", "cats"]
Intersection: ["I"]
Union: ["I", "love", "felines", "adore", "cats"]
Using the formula for Jaccard Similarity:
Jaccard Similarity = 1 / 5 = 0.2

The Jaccard Similarity score is low, suggesting that these sentences are quite different. However, in terms of semantics, both sentences convey a similar sentiment of having an affection for cats. The term “felines” is a broader term that encompasses “cats”, and “love” and “adore” are synonyms. The Jaccard Similarity fails to capture this semantic similarity due to its focus on exact word matches.

To overcome the shortcomings of Jaccard similarity, SentenceTransformers comes into play. This Python framework enables the creation of embeddings for text, transforming sentences into multi-dimensional vectors.

Let’s first understand embeddings:

Embeddings, in the context of natural language processing (NLP) and machine learning, refer to numerical representations of textual or categorical data. These representations are typically dense, fixed-length vectors that capture the semantic meaning or relationships between different elements in the data.

Let’s understand this with an example:

Consider a set of colors: {“Red”, “Green”, “Blue”}.

A naive representation might assign an integer to each color, like {“Red”: 1, “Green”: 2, “Blue”: 3}. This representation, however, doesn’t capture any relationship between these colors.

An embedding might represent these colors in a 3D color space (RGB) as:

Red: [1, 0, 0]
Green: [0, 1, 0]
Blue: [0, 0, 1]

In this embedded space, the colors are represented as dense vectors, and the relationship between them (in terms of RGB values) is retained.

Word Embedding: Word embeddings are numerical representations of words that encode the semantic relationships between words based on their context and usage in a large corpus of text. Popular word embedding models include Word2Vec, GloVe (Global Vectors for Word Representation), and FastText.

Let’s understand this with an example:

king: [0.5, 0.3, 0.2, 0.8, ...] (up to maybe 300 dimensions)
queen: [0.52, 0.29, 0.18, 0.82, ...]

The vectors are dense and multi-dimensional (often 100–300 dimensions in practice). The proximity of these vectors in the embedded space suggests a semantic relationship between the words “king” and “queen”.

vector("king") - vector("man") + vector("woman") is approximately equal to vector("queen").

This captures the relationship “man is to king as woman is to queen”. This kind of semantic relationship discovery is the strength of word embeddings.

Sentence and Document Embeddings: Sentence and document embeddings are similar to word embeddings but operate at a higher level of granularity. Instead of representing individual words, they represent entire sentences or documents as dense vectors. These embeddings capture the overall meaning and context of the text, enabling applications like sentiment analysis, document similarity, and information retrieval. Let’s understand this with an example:

Consider the following two sentences:
1. "Cats are wonderful creatures."
2. "Felines are amazing animals."

Using a sentence embedding method, like the SentenceTransformer or Universal Sentence Encoder, we might get hypothetical dense vector representations like:

"Cats are wonderful creatures.": [0.8, 0.55, 0.3, 0.7, ...]
"Felines are amazing animals.": [0.79, 0.56, 0.31, 0.69, ...]

Despite the fact that the two sentences use different words (e.g., “cats” vs. “felines”, “wonderful” vs. “amazing”), their embeddings are close in the vector space, indicating a similar semantic meaning.

We are going to use sentence embeddings for our use case. Among the array of pre-trained models offered by SentenceTransformers, the all-MiniLM-L6-v2 model is picked. This model encodes sentences into 384-dimensional vectors.

import pandas as pd 
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
text = "This is a text written for testing review data"
pd.DataFrame(model.encode(text))
Output
0 -0.015578
1 0.075771
2 -0.098554
3 0.057334
4 0.005142
.. ...
379 -0.047272
380 0.137575
381 0.058060
382 0.069237
383 -0.015964
[384 rows x 1 columns]

Coming back to our implementation,

Building the Embeddings Database: All existing reviews in our database are processed using SentenceTransformers, and their embeddings are stored in Elasticsearch — a search engine capable of handling vast amounts of data.

When a new review is received, its text is also transformed into an embedding using the all-MiniLM-L6-v2 model. This newly generated embedding is then matched against the embeddings stored in Elasticsearch. The results are then sorted by their cosine similarity scores.

Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space. In the context of text embeddings, it is used to determine the similarity between two sentences encoded as numerical vectors. The cosine similarity score ranges from -1 to 1, where -1 indicates completely opposite meanings, 1 indicates identical meanings, and values close to 0 indicate little to no similarity.

original_text = "This is a text written for testing review data"
copied_text = "This is a text written for testing review data"
another_text = "Completely different text for college review portal"

original_text_vector = model.encode(original_text)
copied_text_vector = model.encode(copied_text)
another_text_vector = model.encode(another_text)

Calculating cosine similarity:

util.cos_sim(original_text_vector, copied_text_vector)
Response: tensor([[1.0000]])
util.cos_sim(original_text_vector, another_text_vector)
Response: tensor([[0.3983]])

Thus, using this approach, the review moderation team is presented with closely matched reviews for each review received.

Conclusion

By leveraging advanced text embeddings through SentenceTransformers and cosine similarity, our solution empowers moderators to efficiently identify plagiarized or copied reviews. This automation not only saves time but also improves the overall quality and authenticity of the reviews displayed on our platform. With the continuous advancements in text embeddings, the possibilities for enhancing content moderation are limitless, ensuring a richer and more trustworthy user experience.

--

--