Enough Machine Learning To Make Hacker News Readable Again
Ned Jackson Lovely
@nedjl / njl@njl.us
slides @ www.njl.us
A Simple, Achievable Project
A personalized filter for Hacker News
I Can Machine Learn and You Can Too!
Machine learning is just applying statistics to big piles of data
Get Data
Engineer the Data
Train and Tune Models
Apply Model
The scikit-learn Docs are Fantastic
Supervised vs Unsupervised
scikit-learn Patterns
Parallel Arrays
X,y
Set aside a validation set
from sklearn.cross_validation import train_test_split
X, X_val, y, y_val = train_test_split(X_full, y_full)
How Do You Work?
Build Pipelines
Optimize Parameters
Pipeline
A sequence of operations on data
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import text
from sklearn.svm import LinearSVC
p = Pipeline([ ('hv', text.HashingVectorizer()),
('svm', LinearSVC()) ])
hv = HashingVectorizer()
hv.transform(['Simple is better than complex.',
'Sparse is better than dense.'])
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 10 stored elements ...
fit(X,y)
svm = LinearSVC()
svm = svm.fit(X, y)
predict(X)
y_predictions = svm.predict(X_new)
Dealing with the Super-Messy Real World
Get the Data
requests & lxml
Classify Dreck and Non-Dreck
Title, URL, Submitter, Content of Link,
Rank, Votes, Comments, Time of Day, Dreck or Not
Messy Data to Normalized Numpy Arrays
Bag Of Words
n-grams
Normalization
Stop Words
TF-IDF
Bag of Words
"Time Flies Like An Arrow, Fruit Flies Like Bananas"
Flies
2
Like
2
Time
1
An
1
Arrow
1
Fruit
1
Bananas
1
Bitcoin
0
…
n-grams
"Time Flies Like An Arrow, Fruit Flies Like Bananas"
Time Flies
Flies Like
Like An
An Arrow
Arrow Fruit
…
Normalization
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()
input='when he flies he likes to fly upon early flights'
print(' '.join(stemmer.stem(x) for x in input.split()))
"when he fli he like to fli upon earli flight"
Stop Words
from sklearn.feature_extraction import text
print(', '.join(x for x in
list(text.ENGLISH_STOP_WORDS)[:8]))
all, six, less, being, indeed, over, move, anyway
TF-IDF
Term-Frequency, Inverse Document Frequency
from sklearn.feature_extraction import text
tfidf = text.TfidfVectorizer()
tfidf = tfidf.fit(X, y)
Engineering Features
Pulling Out the Relevant Text
from readability.readability import Document
from lxml.html import fragment_fromstring
def get_relevant(raw):
try:
cleaned = Document(raw).summary(True)
et = fragment_fromstring(cleaned)
return ' '.join(x.text for
x in et.getiterator()
if x.text)
except Exception as e:
print(e)
return ''
Roll your own features
from sklearn import base
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
class FeatureLength(base.BaseEstimator,
base.TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return [[len(x)] for x in X]
length_pipeline = Pipeline([ ('fl', FeatureLength()),
('ss', StandardScaler()),])
Combine Features
from sklearn.pipeline import FeatureUnion
union = FeatureUnion([ ('tfidf', TfidfVectorizer()),
('length', length_pipeline), ])
My Data Starts as Dictionaries
from sklearn import base
class DictFeatureExtractor(base.BaseEstimator,
base.TransformerMixin):
def __init__(self, prop_name):
self.prop_name = prop_name
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return [x[self.prop_name] for x in X]
Maybe Hostnames Are Relevant
import urlparse
class HostnameExtractor(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return [{'host':urlparse.urlparse(x).netloc}
for x in X]