Project Libera

Project: Libera is an intelligent user-content connector that I designed for my passion project during my 3 months at Metis’ Data Science Bootcamp in Chicago. Powered by a custom designed webcrawler and the magic of machine learning, Project: Libera is a tool for automatically seeking out new web content related to Data Science.

Why Libera?

As an aspiring data scientist I’m constantly trying to find new articles/blog posts/information to give me a leg up on the competition, but the pay off between time spent looking for articles and actual quality articles read is sporadic. Some days you hit gold. Some days you spend 4 hours looking at videos of munchkin cats doing cute things forgetting why you’re on the internet in the first place.

Libera was designed to eliminate the risk of distraction and also to expand my blog collection out of my own social network and list of sites that I already view. Because isn’t that the real beauty of machine learning? Making a computer do something for you, faster than you, and potentially even better than you. But you can’t train a machine without data. So I went to my old data science haunts and collected a good amount of blog posts and then let a blind web crawler hit all of their direct links. These were then hand labelled with a simple flask app I made for quick data editing and before I knew it I had a web cralwer that used a Naive Bayes classification model to determine if a blog was related to data science.

Taking it Further

While having a web crawler that could get me new content started to address the time suck that is web browsing, I quickly realized it wasn’t nearly enough. What if I’m on a data viz kick? Or maybe I’m trying to buckle down on big data. Perhaps I just need some statastiscs. Just having a collection of blogs about data science didn’t work if I was looking for something more specific. To address this, I took my blog text corpus and used Natural Language Processing, dimensionality reduction, and unsupervised learning techniques to automatically identify subtopics. With automated collection and subtopicing I set out to make a front end Flask based web app to tie it all together.

A Look at the Front End

With a large collection of posts and segmentation into subtopics complete, I turned to make Libera more user friendly. Searching out web pages in a MongoDB isn’t exactly the best UX for my purposes. So I turned to Flask and made a landing page, user sign up form, and recommendation feed. When a user signs up for recommendations, they are given an interest form to select what exactly it is they’re looking for:

Interst Form

Once they’re selected their interests they’re brought to a feed of recommendations. These recommendations are presented using Embedly cards to give snapshots of the content and the icons on the left allow for user interaction.

This Week at that Data Science Crash Course

Hello World! Back for a bit to post about my brief experience with feature design in my linear regression modelling challenge at Metis, Chicago.

Week One at Metis DS Bootcamp

This week at Metis we looked at MTA data and moved from green to clean, or, rather, cleaner. For a group of novices with disparate backgrounds this was no simple task and certainly not one we imagined we would finish in five days. But we did! We survived and managed to pull of actual presentations, with actual graphs of actual data! But more importantly, we learned. A lot. And, for the rest of this post, I will share a small portion of this weeks take aways! This is meant partly to be a refresher for my future self and partly to be a helpful resource for the imaginary people that I imagine will read this.

Paul Black