I am kinda obsessed with films…

Throughout my life, they’ve been one of the largest influences on how I see the world. Good Will Hunting, The Passion of Joan of Arc, and Into the Wild are movies that legitimately changed me. I’m aware of how pretentious that sounds, but it’s true!

However, I’ve run into a bit of a problem recently. I’m really struggling to find films to watch. Don’t get me wrong, you can go on [insert streaming platform here] and you’d never run out of films. But every time I go to choose, it feels like a gamble. I’m not sure if it’s that bad films have gotten worse, or just that I’ve become less forgiving of them, but I need to find a way to bring some certainty into the process, before my Letterboxd page runs dry and my movie-mad friend Conor starts hunting me down Hunger Games style.

So I had the idea of creating a recommendation tool, which takes your reviews of movies and can either recommend you a movie, or tell me whether a chosen film will be up my street. Although, for the time being it won’t actually be aimed at me. I have a measly 15 reviews to my name, not really NLP-ready is it? Instead, I will be using the aforementioned cinephile, Conor, as a guinea pig. Seriously… the man has 500+ reviews it is WILD! Very impressive but also very useful, so big thanks to him.

Project Requirements

There are a couple of stages to this project:

  1. Data Collection – The data I’m going to be using for the purposes of this task is all going to be from a film review site. There are a couple of considerations to be made here.
    • This is purely a hobby project. I won’t be creating any commercial tools or applications it is literally just to practice the skills. It is by far best practice to use APIs if they are available (in this case there isn’t), and make sure to read a site’s ToS before you perform any projects like this.
    • I will only be using data gathered from people who I know and have explicit permission from to use their data. Even then I will be actively avoiding scraping usernames or PII beyond the reviews themselves. This is just good practice from a GDPR and general data privacy standpoint. I will be anonymising any similar information (like the URLs I’m scraping from) when posting here to ensure my friends’ privacy.
    • One last thing on the ethics front. I will be rate-limiting my scraping. Sure it would be nice to be able to run through all the pages super-fast, but we aren’t looking to have the same effect of a DDoS attack here! Slow and steady wins the race.
  2. Data Processing I – Once the data has been scraped from the site, I can clean it up a little before the initial Analysis. Nothing fancy or complicated, just making sure that it’s just the reviews being stored, no random html tags or \n nonsense.
  3. Initial Data Analysis – This is more for fun than anything else. I want to explore the data a little, see if there’s any interesting information I can pull out right from the get-go. It would be cool to see how the films are distributed by release year for example. But another part of this step is figuring out what I’m still missing (at the time of writing this I’ve already completed this step, so yes I know there is definitely stuff missing!).
  4. Further Data Collection and Processing – Get the other data I want, clean everything up, and start to flex those NLP muscles a bit. I came here to chew gum and remove stopwords, and I’m all out of gum…
  5. Modelling – The meat and bones of this project. The model will be sentiment driven, using the user reviews alongside metadata about the film to build a profile of the users tastes.
  6. Optimisation – Oh god knows yet. Let’s just hope it works perfectly first time eh…?

So yeah, that’s the project in a nutshell! The next post on the topic will cover the data collection process. There were a few false starts here, so expect lots of ‘lessons learnt’.

Posted in , , ,

Leave a comment