Star Trek or Star Wars – Binary Classification using ML.NET
I have been looking for an opportunity to try out machine learning and recently had a change. Given my background with ASP.NET, using ML.NET seemed the clear choice as I am already familiar with C# and the .NET ecosystem.
To try out ML.NET, I decided to make a binary classifier which would accept a piece of text and decide if the text is most likely to be related to Star Trek or Star Wars. Not that I am saying one is better (or worse) than the other.
Finding the Data
The first step with any machine learning task is to source the appropriate data to support the model you want to design. In my case, I needed text which would definitively belong to either Star Trek or Star Wars.
I searched Kaggle and found that the scripts for the original Star Wars trilogy were available, along with the scripts for every Star Trek TV series episode. After downloading the files, I moved the data into a spreadsheet as a list of sentences. I then added a column for a label which identified the text as coming from the Star Trek or Star Wars scripts.
Training the Model
Now that the data is ready the next step is to train the model. ML.NET includes a model builder which makes this a relatively painless process.
mlnet classification --dataset "dataset.csv" --label-col Label --has-header true --train-time 600
ML.NET will train and evaluate a variety of models during the specified train time (in this case 10 minutes). At the end of the training time, it will generate a console project showing how to consume the best performing model.
Accounting for Bias
One of the problems with the model was that it was more likely to select Star Trek as the response for a piece of text which didn’t exactly fit a Star Wars quote.
This is due to the large difference in the number of examples for each category. Star Wars had a little over 2,500 examples, while Star Trek had over 187,000 examples. This large imbalance between the example classifications caused the model have a bias towards Star Trek.
To account for this bias, I reduced the number of Star Trek examples to be a number close to the number of total Star Wars examples. To avoid introducing any new bias between the different TV series for Star Trek (for example by characters being missing), I did this by removing a number of rows after each row of data which was kept. This ensured that the data removed was evenly distributed across the scripts.
When this process was completed and the model had been trained with the resulting data, it was achieving an accuracy of 71%.
Deploying to the Cloud
To make the model available to use publicly, I created an Azure Static Web App accompanied by a managed Function. The page displays an input for the sentence the user would like to test, and provides a few examples they could choose from as well.
When the text is submitted, an AJAX request is sent to the function which loads the trained model and uses it to predict if a given piece of text is likely to belong to Star Trek or Star Wars and then returns this result to the browser.
I used ML.NET to generate a model to classify a piece of text as being from either Star Trek or Star Wars. This was generated with the ML.NET model builder after the data had been adjusted to make the example classifications more evenly balanced.
The model was then consumed by an Azure Function which is triggered by a HTTP request. This allows a small web app to provide the UI, while an AJAX request is used to send a piece of text to the function so it can be classified by the model.