Enough Machine Learning To Make Hacker News Readable Again

Ned Jackson Lovely

@nedjl / njl@njl.us

slides @ www.njl.us

A Simple, Achievable Project

A personalized filter for Hacker News

I Can Machine Learn and You Can Too!

Machine learning is just applying statistics to big piles of data

Get Data

Engineer the Data

Train and Tune Models

Apply Model

The scikit-learn Docs are Fantastic

Supervised vs Unsupervised

scikit-learn Patterns

Parallel Arrays

X,y

Set aside a validation set

    from sklearn.cross_validation import train_test_split
    X, X_val, y, y_val = train_test_split(X_full, y_full)
    

How Do You Work?

Build Pipelines

Optimize Parameters

Pipeline

A sequence of operations on data

    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction import text
    from sklearn.svm import LinearSVC

    p = Pipeline([ ('hv', text.HashingVectorizer()),
                    ('svm', LinearSVC()) ])
    

Hyper-Parameters

Tune the Magic

    from sklearn.grid_search import GridSearchCV
    params = { 'svm__C':[0.5, 1.0, 2.0, 4.0],
                'svm__loss':['l1', 'l2'],
                'hv__ngram_range':[ (1,1), (1,2)],}
    gs = GridSearchCV(p, params, verbose=2, n_jobs=-1)
    gs = gs.fit(X,y)
    

Functions You'll See a Lot Of

transform()

fit()

predict()

transform(X [,y])

        hv = HashingVectorizer()
        hv.transform(['Simple is better than complex.',
                      'Sparse is better than dense.'])
    <2x1048576 sparse matrix of type '<type 'numpy.float64'>'
        with 10 stored elements ...
    

fit(X,y)

        svm = LinearSVC()
        svm = svm.fit(X, y)
    

predict(X)

        y_predictions = svm.predict(X_new)
    

Dealing with the Super-Messy Real World

Get the Data

requests & lxml

Classify Dreck and Non-Dreck

Title, URL, Submitter, Content of Link, Rank, Votes, Comments, Time of Day, Dreck or Not

Messy Data to Normalized Numpy Arrays

Bag Of Words

n-grams

Normalization

Stop Words

TF-IDF

Bag of Words

"Time Flies Like An Arrow,
Fruit Flies Like Bananas"

Flies2
Like2
Time1
An1
Arrow1
Fruit1
Bananas1
Bitcoin0

n-grams

"Time Flies Like An Arrow,
Fruit Flies Like Bananas"

Time Flies
Flies Like
Like An
An Arrow
Arrow Fruit

Normalization

    from nltk.stem.snowball import EnglishStemmer
    stemmer = EnglishStemmer()
    input='when he flies he likes to fly upon early flights'
    print(' '.join(stemmer.stem(x) for x in input.split()))

    "when he fli he like to fli upon earli flight"
    

Stop Words

from sklearn.feature_extraction import text

print(', '.join(x for x in 
                list(text.ENGLISH_STOP_WORDS)[:8]))

all, six, less, being, indeed, over, move, anyway
    

TF-IDF

Term-Frequency, Inverse Document Frequency

    from sklearn.feature_extraction import text

    tfidf = text.TfidfVectorizer()
    tfidf = tfidf.fit(X, y)
    

Engineering Features

Pulling Out the Relevant Text

from readability.readability import Document
from lxml.html import fragment_fromstring

def get_relevant(raw):
    try:
        cleaned = Document(raw).summary(True)
        et = fragment_fromstring(cleaned)
        return ' '.join(x.text for 
                        x in et.getiterator()
                        if x.text)
    except Exception as e:
        print(e)
        return ''

Roll your own features

    from sklearn import base
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline

    class FeatureLength(base.BaseEstimator,
                        base.TransformerMixin):
        def fit(self, X, y=None):
            return self
        def transform(self, X, y=None):
            return [[len(x)] for x in X]

    length_pipeline = Pipeline([ ('fl', FeatureLength()),
                                 ('ss', StandardScaler()),])
    

Combine Features

    from sklearn.pipeline  import FeatureUnion
    union = FeatureUnion([ ('tfidf', TfidfVectorizer()),
                           ('length', length_pipeline), ]) 
    

My Data Starts as Dictionaries

    from sklearn import base

    class DictFeatureExtractor(base.BaseEstimator, 
                                base.TransformerMixin):
        def __init__(self, prop_name):
            self.prop_name = prop_name
        def fit(self, X, y=None):
            return self
        def transform(self, X, y=None):
            return [x[self.prop_name] for x in X]
    

Maybe Hostnames Are Relevant

    import urlparse

    class HostnameExtractor(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            return self
        def transform(self, X, y=None):
            return [{'host':urlparse.urlparse(x).netloc}
                        for x in X]
    

A Hostname Pipeline

    from sklearn.feature_extraction import DictVectorizer

    hosts = Pipeline([ 
                ('dfe', DictFeatureExtractor('url')),
                ('he', HostnameExtractor()),
                ('dv', DictVectorizer()), ])
    

The Actual Application

Predict

        from pickle import load
        classifier = load(open('ridiculous_pipeline.pickle'))
        classifier.predict(X_input_data)
    

hn.njl.us

If You Have The Data...

Unsupervised Learning

Cluster Hacker News Articles

Regression

Predict article scores

Machine Learning is
Becoming Engineering

Go Engineer!

Thank You!

Ned Jackson Lovely

@nedjl / njl@njl.us

slides @ www.njl.us