Using Machine Learning to Discover Source Code Provenance in JavaScript

November 20, 2020
1:00-2:00 pm ET
Sococo VH 209; Zoom
Speaker: Jamie Heller
Host: Sam Guyer

Abstract

Shazam Is a popular tool that allows its users to take music snippets and quickly identify the song utilizing fingerprinting techniques. The Shazam metaphor implies a quick answer, which reveals a solution to a problem as if it appeared like magic. Our long-term goal is to create a light-weight code clone detection algorithm that can quickly identify software clones and its origin under the Shazam metaphor. Our goal is not to solve deep semantic cloning problem, like many code clone detection tools that use deep learning algorithms, but to feed in pieces of code until we get a match. The ability to detect code clones is useful in a variety of fields, including software security, maintenance, plagiarism detection and even legal applications. In this investigation, we explore software features and machine learning to gain insights into creating a Shazam4Code style application. Our target language is JavaScript, which holds the crown as the most popular programming language in the last 10 years. JavaScript was not only chosen for its popularity, but for the fact that it is a highly copy and pasted language, lending itself to high incidence of software cloning.

In this presentation, we aim to explore JavaScript to better understand the strengths and weaknesses of JavaScript datasets and features through a smaller scoped project of detecting what tool processed the JavaScript file. Since the JavaScript language is embedded within many websites, developers often need to ensure that it is small and fast to download. As a result, they often resort to utilizing 3rd party tools to compress or minimize the source code via tools called minifiers. Some web developers go even further and decide to obscure the code to prevent piracy via a class of tools called obfuscators. Minifiers and obfuscators transform the original source code into different representations, creating a much more complicated clone detection scenario. In this work, we will show that it is possible to distinguish code processed by these tools and gain insight into the issues that might inform our long-term goal strategy towards developing Shazam4code.

Join the meeting in Sococo VH 209. Login: tuftscs.sococo.com

Join Zoom Meeting: https://tufts.zoom.us/j/98610939077

PASSWORD: see colloquium email

Dial by your location: +1 312 626 6799 US (Chicago)

Meeting ID: 986 1093 9077

Passcode: see colloquium email

Find your local number: https://tufts.zoom.us/u/adS4koag4r