Machine Learning Programming Workshop

4.1 Tensorflow/Keras (Natural Language)

Prepared By: Cheong Shiu Hong (FTFNCE)



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
In [2]:
import tensorflow as tf
import tensorflow.keras as K


Version of Tensorflow and Keras

In [3]:
tf.__version__
Out[3]:
'2.0.0-alpha0'
In [4]:
K.__version__ # tf means Tensorflow Backend
Out[4]:
'2.2.4-tf'


1) Introduction to Natural Language Processing (NLP)

return to top

Load Dataset

In [5]:
data = K.datasets.imdb
In [6]:
(train_text, train_labels), (val_text, val_labels) = data.load_data(num_words=20000)
In [7]:
train_text.shape, train_labels.shape # Notice Text Shape don't include Seq. Length (In List Form - Unstandardized Lengths)
Out[7]:
((25000,), (25000,))
In [8]:
val_text.shape, val_labels.shape
Out[8]:
((25000,), (25000,))


Visualize an Example

In [9]:
print(train_text[0]) # Train Data is in Numbers (Each number maps to a word)
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
In [10]:
print(train_labels[:10]) # Train Labels are Binary (Positive or Negative Review)
[1 0 0 1 0 0 1 0 1 0]


Get the mapping for converting the numbers to words

In [11]:
word_index = data.get_word_index()
In [12]:
len(word_index)
Out[12]:
88584
In [13]:
word_index
Out[13]:
{'fawn': 34701,
 'tsukino': 52006,
 'nunnery': 52007,
 'sonja': 16816,
 'vani': 63951,
 'woods': 1408,
 'spiders': 16115,
 'hanging': 2345,
 'woody': 2289,
 'trawling': 52008,
 "hold's": 52009,
 'comically': 11307,
 'localized': 40830,
 'disobeying': 30568,
 "'royale": 52010,
 "harpo's": 40831,
 'canet': 52011,
 'aileen': 19313,
 'acurately': 52012,
 "diplomat's": 52013,
 'rickman': 25242,
 'arranged': 6746,
 'rumbustious': 52014,
 'familiarness': 52015,
 "spider'": 52016,
 'hahahah': 68804,
 "wood'": 52017,
 'transvestism': 40833,
 "hangin'": 34702,
 'bringing': 2338,
 'seamier': 40834,
 'wooded': 34703,
 'bravora': 52018,
 'grueling': 16817,
 'wooden': 1636,
 'wednesday': 16818,
 "'prix": 52019,
 'altagracia': 34704,
 'circuitry': 52020,
 'crotch': 11585,
 'busybody': 57766,
 "tart'n'tangy": 52021,
 'burgade': 14129,
 'thrace': 52023,
 "tom's": 11038,
 'snuggles': 52025,
 'francesco': 29114,
 'complainers': 52027,
 'templarios': 52125,
 '272': 40835,
 '273': 52028,
 'zaniacs': 52130,
 '275': 34706,
 'consenting': 27631,
 'snuggled': 40836,
 'inanimate': 15492,
 'uality': 52030,
 'bronte': 11926,
 'errors': 4010,
 'dialogs': 3230,
 "yomada's": 52031,
 "madman's": 34707,
 'dialoge': 30585,
 'usenet': 52033,
 'videodrome': 40837,
 "kid'": 26338,
 'pawed': 52034,
 "'girlfriend'": 30569,
 "'pleasure": 52035,
 "'reloaded'": 52036,
 "kazakos'": 40839,
 'rocque': 52037,
 'mailings': 52038,
 'brainwashed': 11927,
 'mcanally': 16819,
 "tom''": 52039,
 'kurupt': 25243,
 'affiliated': 21905,
 'babaganoosh': 52040,
 "noe's": 40840,
 'quart': 40841,
 'kids': 359,
 'uplifting': 5034,
 'controversy': 7093,
 'kida': 21906,
 'kidd': 23379,
 "error'": 52041,
 'neurologist': 52042,
 'spotty': 18510,
 'cobblers': 30570,
 'projection': 9878,
 'fastforwarding': 40842,
 'sters': 52043,
 "eggar's": 52044,
 'etherything': 52045,
 'gateshead': 40843,
 'airball': 34708,
 'unsinkable': 25244,
 'stern': 7180,
 "cervi's": 52046,
 'dnd': 40844,
 'dna': 11586,
 'insecurity': 20598,
 "'reboot'": 52047,
 'trelkovsky': 11037,
 'jaekel': 52048,
 'sidebars': 52049,
 "sforza's": 52050,
 'distortions': 17633,
 'mutinies': 52051,
 'sermons': 30602,
 '7ft': 40846,
 'boobage': 52052,
 "o'bannon's": 52053,
 'populations': 23380,
 'chulak': 52054,
 'mesmerize': 27633,
 'quinnell': 52055,
 'yahoo': 10307,
 'meteorologist': 52057,
 'beswick': 42577,
 'boorman': 15493,
 'voicework': 40847,
 "ster'": 52058,
 'blustering': 22922,
 'hj': 52059,
 'intake': 27634,
 'morally': 5621,
 'jumbling': 40849,
 'bowersock': 52060,
 "'porky's'": 52061,
 'gershon': 16821,
 'ludicrosity': 40850,
 'coprophilia': 52062,
 'expressively': 40851,
 "india's": 19500,
 "post's": 34710,
 'wana': 52063,
 'wang': 5283,
 'wand': 30571,
 'wane': 25245,
 'edgeways': 52321,
 'titanium': 34711,
 'pinta': 40852,
 'want': 178,
 'pinto': 30572,
 'whoopdedoodles': 52065,
 'tchaikovsky': 21908,
 'travel': 2103,
 "'victory'": 52066,
 'copious': 11928,
 'gouge': 22433,
 "chapters'": 52067,
 'barbra': 6702,
 'uselessness': 30573,
 "wan'": 52068,
 'assimilated': 27635,
 'petiot': 16116,
 'most\x85and': 52069,
 'dinosaurs': 3930,
 'wrong': 352,
 'seda': 52070,
 'stollen': 52071,
 'sentencing': 34712,
 'ouroboros': 40853,
 'assimilates': 40854,
 'colorfully': 40855,
 'glenne': 27636,
 'dongen': 52072,
 'subplots': 4760,
 'kiloton': 52073,
 'chandon': 23381,
 "effect'": 34713,
 'snugly': 27637,
 'kuei': 40856,
 'welcomed': 9092,
 'dishonor': 30071,
 'concurrence': 52075,
 'stoicism': 23382,
 "guys'": 14896,
 "beroemd'": 52077,
 'butcher': 6703,
 "melfi's": 40857,
 'aargh': 30623,
 'playhouse': 20599,
 'wickedly': 11308,
 'fit': 1180,
 'labratory': 52078,
 'lifeline': 40859,
 'screaming': 1927,
 'fix': 4287,
 'cineliterate': 52079,
 'fic': 52080,
 'fia': 52081,
 'fig': 34714,
 'fmvs': 52082,
 'fie': 52083,
 'reentered': 52084,
 'fin': 30574,
 'doctresses': 52085,
 'fil': 52086,
 'zucker': 12606,
 'ached': 31931,
 'counsil': 52088,
 'paterfamilias': 52089,
 'songwriter': 13885,
 'shivam': 34715,
 'hurting': 9654,
 'effects': 299,
 'slauther': 52090,
 "'flame'": 52091,
 'sommerset': 52092,
 'interwhined': 52093,
 'whacking': 27638,
 'bartok': 52094,
 'barton': 8775,
 'frewer': 21909,
 "fi'": 52095,
 'ingrid': 6192,
 'stribor': 30575,
 'approporiately': 52096,
 'wobblyhand': 52097,
 'tantalisingly': 52098,
 'ankylosaurus': 52099,
 'parasites': 17634,
 'childen': 52100,
 "jenkins'": 52101,
 'metafiction': 52102,
 'golem': 17635,
 'indiscretion': 40860,
 "reeves'": 23383,
 "inamorata's": 57781,
 'brittannica': 52104,
 'adapt': 7916,
 "russo's": 30576,
 'guitarists': 48246,
 'abbott': 10553,
 'abbots': 40861,
 'lanisha': 17649,
 'magickal': 40863,
 'mattter': 52105,
 "'willy": 52106,
 'pumpkins': 34716,
 'stuntpeople': 52107,
 'estimate': 30577,
 'ugghhh': 40864,
 'gameplay': 11309,
 "wern't": 52108,
 "n'sync": 40865,
 'sickeningly': 16117,
 'chiara': 40866,
 'disturbed': 4011,
 'portmanteau': 40867,
 'ineffectively': 52109,
 "duchonvey's": 82143,
 "nasty'": 37519,
 'purpose': 1285,
 'lazers': 52112,
 'lightened': 28105,
 'kaliganj': 52113,
 'popularism': 52114,
 "damme's": 18511,
 'stylistics': 30578,
 'mindgaming': 52115,
 'spoilerish': 46449,
 "'corny'": 52117,
 'boerner': 34718,
 'olds': 6792,
 'bakelite': 52118,
 'renovated': 27639,
 'forrester': 27640,
 "lumiere's": 52119,
 'gaskets': 52024,
 'needed': 884,
 'smight': 34719,
 'master': 1297,
 "edie's": 25905,
 'seeber': 40868,
 'hiya': 52120,
 'fuzziness': 52121,
 'genesis': 14897,
 'rewards': 12607,
 'enthrall': 30579,
 "'about": 40869,
 "recollection's": 52122,
 'mutilated': 11039,
 'fatherlands': 52123,
 "fischer's": 52124,
 'positively': 5399,
 '270': 34705,
 'ahmed': 34720,
 'zatoichi': 9836,
 'bannister': 13886,
 'anniversaries': 52127,
 "helm's": 30580,
 "'work'": 52128,
 'exclaimed': 34721,
 "'unfunny'": 52129,
 '274': 52029,
 'feeling': 544,
 "wanda's": 52131,
 'dolan': 33266,
 '278': 52133,
 'peacoat': 52134,
 'brawny': 40870,
 'mishra': 40871,
 'worlders': 40872,
 'protags': 52135,
 'skullcap': 52136,
 'dastagir': 57596,
 'affairs': 5622,
 'wholesome': 7799,
 'hymen': 52137,
 'paramedics': 25246,
 'unpersons': 52138,
 'heavyarms': 52139,
 'affaire': 52140,
 'coulisses': 52141,
 'hymer': 40873,
 'kremlin': 52142,
 'shipments': 30581,
 'pixilated': 52143,
 "'00s": 30582,
 'diminishing': 18512,
 'cinematic': 1357,
 'resonates': 14898,
 'simplify': 40874,
 "nature'": 40875,
 'temptresses': 40876,
 'reverence': 16822,
 'resonated': 19502,
 'dailey': 34722,
 '2\x85': 52144,
 'treize': 27641,
 'majo': 52145,
 'kiya': 21910,
 'woolnough': 52146,
 'thanatos': 39797,
 'sandoval': 35731,
 'dorama': 40879,
 "o'shaughnessy": 52147,
 'tech': 4988,
 'fugitives': 32018,
 'teck': 30583,
 "'e'": 76125,
 'doesn’t': 40881,
 'purged': 52149,
 'saying': 657,
 "martians'": 41095,
 'norliss': 23418,
 'dickey': 27642,
 'dicker': 52152,
 "'sependipity": 52153,
 'padded': 8422,
 'ordell': 57792,
 "sturges'": 40882,
 'independentcritics': 52154,
 'tempted': 5745,
 "atkinson's": 34724,
 'hounded': 25247,
 'apace': 52155,
 'clicked': 15494,
 "'humor'": 30584,
 "martino's": 17177,
 "'supporting": 52156,
 'warmongering': 52032,
 "zemeckis's": 34725,
 'lube': 21911,
 'shocky': 52157,
 'plate': 7476,
 'plata': 40883,
 'sturgess': 40884,
 "nerds'": 40885,
 'plato': 20600,
 'plath': 34726,
 'platt': 40886,
 'mcnab': 52159,
 'clumsiness': 27643,
 'altogether': 3899,
 'massacring': 42584,
 'bicenntinial': 52160,
 'skaal': 40887,
 'droning': 14360,
 'lds': 8776,
 'jaguar': 21912,
 "cale's": 34727,
 'nicely': 1777,
 'mummy': 4588,
 "lot's": 18513,
 'patch': 10086,
 'kerkhof': 50202,
 "leader's": 52161,
 "'movie": 27644,
 'uncomfirmed': 52162,
 'heirloom': 40888,
 'wrangle': 47360,
 'emotion\x85': 52163,
 "'stargate'": 52164,
 'pinoy': 40889,
 'conchatta': 40890,
 'broeke': 41128,
 'advisedly': 40891,
 "barker's": 17636,
 'descours': 52166,
 'lots': 772,
 'lotr': 9259,
 'irs': 9879,
 'lott': 52167,
 'xvi': 40892,
 'irk': 34728,
 'irl': 52168,
 'ira': 6887,
 'belzer': 21913,
 'irc': 52169,
 'ire': 27645,
 'requisites': 40893,
 'discipline': 7693,
 'lyoko': 52961,
 'extend': 11310,
 'nature': 873,
 "'dickie'": 52170,
 'optimist': 40894,
 'lapping': 30586,
 'superficial': 3900,
 'vestment': 52171,
 'extent': 2823,
 'tendons': 52172,
 "heller's": 52173,
 'quagmires': 52174,
 'miyako': 52175,
 'moocow': 20601,
 "coles'": 52176,
 'lookit': 40895,
 'ravenously': 52177,
 'levitating': 40896,
 'perfunctorily': 52178,
 'lookin': 30587,
 "lot'": 40898,
 'lookie': 52179,
 'fearlessly': 34870,
 'libyan': 52181,
 'fondles': 40899,
 'gopher': 35714,
 'wearying': 40901,
 "nz's": 52182,
 'minuses': 27646,
 'puposelessly': 52183,
 'shandling': 52184,
 'decapitates': 31268,
 'humming': 11929,
 "'nother": 40902,
 'smackdown': 21914,
 'underdone': 30588,
 'frf': 40903,
 'triviality': 52185,
 'fro': 25248,
 'bothers': 8777,
 "'kensington": 52186,
 'much': 73,
 'muco': 34730,
 'wiseguy': 22615,
 "richie's": 27648,
 'tonino': 40904,
 'unleavened': 52187,
 'fry': 11587,
 "'tv'": 40905,
 'toning': 40906,
 'obese': 14361,
 'sensationalized': 30589,
 'spiv': 40907,
 'spit': 6259,
 'arkin': 7364,
 'charleton': 21915,
 'jeon': 16823,
 'boardroom': 21916,
 'doubts': 4989,
 'spin': 3084,
 'hepo': 53083,
 'wildcat': 27649,
 'venoms': 10584,
 'misconstrues': 52191,
 'mesmerising': 18514,
 'misconstrued': 40908,
 'rescinds': 52192,
 'prostrate': 52193,
 'majid': 40909,
 'climbed': 16479,
 'canoeing': 34731,
 'majin': 52195,
 'animie': 57804,
 'sylke': 40910,
 'conditioned': 14899,
 'waddell': 40911,
 '3\x85': 52196,
 'hyperdrive': 41188,
 'conditioner': 34732,
 'bricklayer': 53153,
 'hong': 2576,
 'memoriam': 52198,
 'inventively': 30592,
 "levant's": 25249,
 'portobello': 20638,
 'remand': 52200,
 'mummified': 19504,
 'honk': 27650,
 'spews': 19505,
 'visitations': 40912,
 'mummifies': 52201,
 'cavanaugh': 25250,
 'zeon': 23385,
 "jungle's": 40913,
 'viertel': 34733,
 'frenchmen': 27651,
 'torpedoes': 52202,
 'schlessinger': 52203,
 'torpedoed': 34734,
 'blister': 69876,
 'cinefest': 52204,
 'furlough': 34735,
 'mainsequence': 52205,
 'mentors': 40914,
 'academic': 9094,
 'stillness': 20602,
 'academia': 40915,
 'lonelier': 52206,
 'nibby': 52207,
 "losers'": 52208,
 'cineastes': 40916,
 'corporate': 4449,
 'massaging': 40917,
 'bellow': 30593,
 'absurdities': 19506,
 'expetations': 53241,
 'nyfiken': 40918,
 'mehras': 75638,
 'lasse': 52209,
 'visability': 52210,
 'militarily': 33946,
 "elder'": 52211,
 'gainsbourg': 19023,
 'hah': 20603,
 'hai': 13420,
 'haj': 34736,
 'hak': 25251,
 'hal': 4311,
 'ham': 4892,
 'duffer': 53259,
 'haa': 52213,
 'had': 66,
 'advancement': 11930,
 'hag': 16825,
 "hand'": 25252,
 'hay': 13421,
 'mcnamara': 20604,
 "mozart's": 52214,
 'duffel': 30731,
 'haq': 30594,
 'har': 13887,
 'has': 44,
 'hat': 2401,
 'hav': 40919,
 'haw': 30595,
 'figtings': 52215,
 'elders': 15495,
 'underpanted': 52216,
 'pninson': 52217,
 'unequivocally': 27652,
 "barbara's": 23673,
 "bello'": 52219,
 'indicative': 12997,
 'yawnfest': 40920,
 'hexploitation': 52220,
 "loder's": 52221,
 'sleuthing': 27653,
 "justin's": 32622,
 "'ball": 52222,
 "'summer": 52223,
 "'demons'": 34935,
 "mormon's": 52225,
 "laughton's": 34737,
 'debell': 52226,
 'shipyard': 39724,
 'unabashedly': 30597,
 'disks': 40401,
 'crowd': 2290,
 'crowe': 10087,
 "vancouver's": 56434,
 'mosques': 34738,
 'crown': 6627,
 'culpas': 52227,
 'crows': 27654,
 'surrell': 53344,
 'flowless': 52229,
 'sheirk': 52230,
 "'three": 40923,
 "peterson'": 52231,
 'ooverall': 52232,
 'perchance': 40924,
 'bottom': 1321,
 'chabert': 53363,
 'sneha': 52233,
 'inhuman': 13888,
 'ichii': 52234,
 'ursla': 52235,
 'completly': 30598,
 'moviedom': 40925,
 'raddick': 52236,
 'brundage': 51995,
 'brigades': 40926,
 'starring': 1181,
 "'goal'": 52237,
 'caskets': 52238,
 'willcock': 52239,
 "threesome's": 52240,
 "mosque'": 52241,
 "cover's": 52242,
 'spaceships': 17637,
 'anomalous': 40927,
 'ptsd': 27655,
 'shirdan': 52243,
 'obscenity': 21962,
 'lemmings': 30599,
 'duccio': 30600,
 "levene's": 52244,
 "'gorby'": 52245,
 "teenager's": 25255,
 'marshall': 5340,
 'honeymoon': 9095,
 'shoots': 3231,
 'despised': 12258,
 'okabasho': 52246,
 'fabric': 8289,
 'cannavale': 18515,
 'raped': 3537,
 "tutt's": 52247,
 'grasping': 17638,
 'despises': 18516,
 "thief's": 40928,
 'rapes': 8926,
 'raper': 52248,
 "eyre'": 27656,
 'walchek': 52249,
 "elmo's": 23386,
 'perfumes': 40929,
 'spurting': 21918,
 "exposition'\x85": 52250,
 'denoting': 52251,
 'thesaurus': 34740,
 "shoot'": 40930,
 'bonejack': 49759,
 'simpsonian': 52253,
 'hebetude': 30601,
 "hallow's": 34741,
 'desperation\x85': 52254,
 'incinerator': 34742,
 'congratulations': 10308,
 'humbled': 52255,
 "else's": 5924,
 'trelkovski': 40845,
 "rape'": 52256,
 "'chapters'": 59386,
 '1600s': 52257,
 'martian': 7253,
 'nicest': 25256,
 'eyred': 52259,
 'passenger': 9457,
 'disgrace': 6041,
 'moderne': 52260,
 'barrymore': 5120,
 'yankovich': 52261,
 'moderns': 40931,
 'studliest': 52262,
 'bedsheet': 52263,
 'decapitation': 14900,
 'slurring': 52264,
 "'nunsploitation'": 52265,
 "'character'": 34743,
 'cambodia': 9880,
 'rebelious': 52266,
 'pasadena': 27657,
 'crowne': 40932,
 "'bedchamber": 52267,
 'conjectural': 52268,
 'appologize': 52269,
 'halfassing': 52270,
 'paycheque': 57816,
 'palms': 20606,
 "'islands": 52271,
 'hawked': 40933,
 'palme': 21919,
 'conservatively': 40934,
 'larp': 64007,
 'palma': 5558,
 'smelling': 21920,
 'aragorn': 12998,
 'hawker': 52272,
 'hawkes': 52273,
 'explosions': 3975,
 'loren': 8059,
 "pyle's": 52274,
 'shootout': 6704,
 "mike's": 18517,
 "driscoll's": 52275,
 'cogsworth': 40935,
 "britian's": 52276,
 'childs': 34744,
 "portrait's": 52277,
 'chain': 3626,
 'whoever': 2497,
 'puttered': 52278,
 'childe': 52279,
 'maywether': 52280,
 'chair': 3036,
 "rance's": 52281,
 'machu': 34745,
 'ballet': 4517,
 'grapples': 34746,
 'summerize': 76152,
 'freelance': 30603,
 "andrea's": 52283,
 '\x91very': 52284,
 'coolidge': 45879,
 'mache': 18518,
 'balled': 52285,
 'grappled': 40937,
 'macha': 18519,
 'underlining': 21921,
 'macho': 5623,
 'oversight': 19507,
 'machi': 25257,
 'verbally': 11311,
 'tenacious': 21922,
 'windshields': 40938,
 'paychecks': 18557,
 'jerk': 3396,
 "good'": 11931,
 'prancer': 34748,
 'prances': 21923,
 'olympus': 52286,
 'lark': 21924,
 'embark': 10785,
 'gloomy': 7365,
 'jehaan': 52287,
 'turaqui': 52288,
 "child'": 20607,
 'locked': 2894,
 'pranced': 52289,
 'exact': 2588,
 'unattuned': 52290,
 'minute': 783,
 'skewed': 16118,
 'hodgins': 40940,
 'skewer': 34749,
 'think\x85': 52291,
 'rosenstein': 38765,
 'helmit': 52292,
 'wrestlemanias': 34750,
 'hindered': 16826,
 "martha's": 30604,
 'cheree': 52293,
 "pluckin'": 52294,
 'ogles': 40941,
 'heavyweight': 11932,
 'aada': 82190,
 'chopping': 11312,
 'strongboy': 61534,
 'hegemonic': 41342,
 'adorns': 40942,
 'xxth': 41346,
 'nobuhiro': 34751,
 'capitĂŁes': 52298,
 'kavogianni': 52299,
 'antwerp': 13422,
 'celebrated': 6538,
 'roarke': 52300,
 'baggins': 40943,
 'cheeseburgers': 31270,
 'matras': 52301,
 "nineties'": 52302,
 "'craig'": 52303,
 'celebrates': 12999,
 'unintentionally': 3383,
 'drafted': 14362,
 'climby': 52304,
 '303': 52305,
 'oldies': 18520,
 'climbs': 9096,
 'honour': 9655,
 'plucking': 34752,
 '305': 30074,
 'address': 5514,
 'menjou': 40944,
 "'freak'": 42592,
 'dwindling': 19508,
 'benson': 9458,
 'white’s': 52307,
 'shamelessness': 40945,
 'impacted': 21925,
 'upatz': 52308,
 'cusack': 3840,
 "flavia's": 37567,
 'effette': 52309,
 'influx': 34753,
 'boooooooo': 52310,
 'dimitrova': 52311,
 'houseman': 13423,
 'bigas': 25259,
 'boylen': 52312,
 'phillipenes': 52313,
 'fakery': 40946,
 "grandpa's": 27658,
 'darnell': 27659,
 'undergone': 19509,
 'handbags': 52315,
 'perished': 21926,
 'pooped': 37778,
 'vigour': 27660,
 'opposed': 3627,
 'etude': 52316,
 "caine's": 11799,
 'doozers': 52317,
 'photojournals': 34754,
 'perishes': 52318,
 'constrains': 34755,
 'migenes': 40948,
 'consoled': 30605,
 'alastair': 16827,
 'wvs': 52319,
 'ooooooh': 52320,
 'approving': 34756,
 'consoles': 40949,
 'disparagement': 52064,
 'futureistic': 52322,
 'rebounding': 52323,
 "'date": 52324,
 'gregoire': 52325,
 'rutherford': 21927,
 'americanised': 34757,
 'novikov': 82196,
 'following': 1042,
 'munroe': 34758,
 "morita'": 52326,
 'christenssen': 52327,
 'oatmeal': 23106,
 'fossey': 25260,
 'livered': 40950,
 'listens': 13000,
 "'marci": 76164,
 "otis's": 52330,
 'thanking': 23387,
 'maude': 16019,
 'extensions': 34759,
 'ameteurish': 52332,
 "commender's": 52333,
 'agricultural': 27661,
 'convincingly': 4518,
 'fueled': 17639,
 'mahattan': 54014,
 "paris's": 40952,
 'vulkan': 52336,
 'stapes': 52337,
 'odysessy': 52338,
 'harmon': 12259,
 'surfing': 4252,
 'halloran': 23494,
 'unbelieveably': 49580,
 "'offed'": 52339,
 'quadrant': 30607,
 'inhabiting': 19510,
 'nebbish': 34760,
 'forebears': 40953,
 'skirmish': 34761,
 'ocassionally': 52340,
 "'resist": 52341,
 'impactful': 21928,
 'spicier': 52342,
 'touristy': 40954,
 "'football'": 52343,
 'webpage': 40955,
 'exurbia': 52345,
 'jucier': 52346,
 'professors': 14901,
 'structuring': 34762,
 'jig': 30608,
 'overlord': 40956,
 'disconnect': 25261,
 'sniffle': 82201,
 'slimeball': 40957,
 'jia': 40958,
 'milked': 16828,
 'banjoes': 40959,
 'jim': 1237,
 'workforces': 52348,
 'jip': 52349,
 'rotweiller': 52350,
 'mundaneness': 34763,
 "'ninja'": 52351,
 "dead'": 11040,
 "cipriani's": 40960,
 'modestly': 20608,
 "professor'": 52352,
 'shacked': 40961,
 'bashful': 34764,
 'sorter': 23388,
 'overpowering': 16120,
 'workmanlike': 18521,
 'henpecked': 27662,
 'sorted': 18522,
 "jĹŤb's": 52354,
 "'always": 52355,
 "'baptists": 34765,
 'dreamcatchers': 52356,
 "'silence'": 52357,
 'hickory': 21929,
 'fun\x97yet': 52358,
 'breakumentary': 52359,
 'didn': 15496,
 'didi': 52360,
 'pealing': 52361,
 'dispite': 40962,
 "italy's": 25262,
 'instability': 21930,
 'quarter': 6539,
 'quartet': 12608,
 'padmé': 52362,
 "'bleedmedry": 52363,
 'pahalniuk': 52364,
 'honduras': 52365,
 'bursting': 10786,
 "pablo's": 41465,
 'irremediably': 52367,
 'presages': 40963,
 'bowlegged': 57832,
 'dalip': 65183,
 'entering': 6260,
 'newsradio': 76172,
 'presaged': 54150,
 "giallo's": 27663,
 'bouyant': 40964,
 'amerterish': 52368,
 'rajni': 18523,
 'leeves': 30610,
 'macauley': 34767,
 'seriously': 612,
 'sugercoma': 52369,
 'grimstead': 52370,
 "'fairy'": 52371,
 'zenda': 30611,
 "'twins'": 52372,
 'realisation': 17640,
 'highsmith': 27664,
 'raunchy': 7817,
 'incentives': 40965,
 'flatson': 52374,
 'snooker': 35097,
 'crazies': 16829,
 'crazier': 14902,
 'grandma': 7094,
 'napunsaktha': 52375,
 'workmanship': 30612,
 'reisner': 52376,
 "sanford's": 61306,
 '\x91doña': 52377,
 'modest': 6108,
 "everything's": 19153,
 'hamer': 40966,
 "couldn't'": 52379,
 'quibble': 13001,
 'socking': 52380,
 'tingler': 21931,
 'gutman': 52381,
 'lachlan': 40967,
 'tableaus': 52382,
 'headbanger': 52383,
 'spoken': 2847,
 'cerebrally': 34768,
 "'road": 23490,
 'tableaux': 21932,
 "proust's": 40968,
 'periodical': 40969,
 "shoveller's": 52385,
 'tamara': 25263,
 'affords': 17641,
 'concert': 3249,
 "yara's": 87955,
 'someome': 52386,
 'lingering': 8424,
 "abraham's": 41511,
 'beesley': 34769,
 'cherbourg': 34770,
 'kagan': 28624,
 'snatch': 9097,
 "miyazaki's": 9260,
 'absorbs': 25264,
 "koltai's": 40970,
 'tingled': 64027,
 'crossroads': 19511,
 'rehab': 16121,
 'falworth': 52389,
 'sequals': 52390,
 ...}


Normally, we will have to build our own 'Voacbulary Dictionary' with words from the corpus

In [14]:
word_index = {k: (v+2) for (k,v) in word_index.items()}
In [15]:
word_index["<PAD>"] = 0    # Used to fill sentences to make Sequence Lengths the same
word_index["<START>"] = 1  # To show the start of a sequence
word_index["UNK"] = 2      # Used to fill in the gap for unknown words
In [16]:
num_to_word = {v: k for (k,v) in word_index.items()}
In [17]:
for i in range(10):
    print(num_to_word[i+1])
<START>
UNK
the
and
a
of
to
is
br
in


Convert numerical sentences into word sentences

In [18]:
def decode(numbers):
    sentence = ""
    for number in numbers:
        sentence += num_to_word[number] + " "
    return sentence
In [19]:
decode(train_text[0])
Out[19]:
"<START> that on as about parts admit ready speaking really care boot see holy and again who each a are any about brought life what power UNK br they sound everything a though and part life look UNK fan recommend like and part elegant successful for feeling from this based and take what as of those core movie that on and manage airplane 4 and on me because i as about parts from been was this military and on for kill for i as cinematography with catalina a which let i is left is two a and seat raises as sound see worried by and still i as from running a are off good who scene some are church by of on i come he bad more a that gives as into advertisement is and films best commenting was each and UNK to rid a beyond who me about parts final his keep special has to and peet manages this characters how and perhaps was american too at references no his something of enough russ with and bit on film say final his sound a back one jews with good who he there's made are characters and bit really as from harry how i as actor a as transfer plot think at was as inexplicably movie quite at "
In [20]:
decode(train_text[1]) # Contains words like 'BEST' and 'GOOD', but we can easily tell this is a negative review
Out[20]:
"<START> enough adventure enough prostitution get script a of offer widow nonsensical say his and tells is love lord that playing but this over unique however after a right many trite film that can horror is one not to and girl does its and never br research lighting a body and little br they carpet and far br death mistake and love br and still borrowed movie and secret a him be box has so and lives br out about from agent pulled doing and wars his puppy a nothing it bettie infectious and adventure br enough easy to prostitution joe's top minds watching simple richness locke was know ever had UNK puppy was top makes disappoint too a and script br about UNK adult was 1 where a where every it interesting lot while what br honesty script prostitution a UNK wish yet united a and offerings man years son with UNK at ol' ghost that br of women get on released story hoping br is find i'm not and 12 was as and innocent a he of more thing example by him get wasn't as i'm way "
In [21]:
decode(train_text[2])
Out[21]:
"<START> that if is one all to and girl character to and viewer's very would camera this me now that on life and oliver high i as remaining by much about community play and great excellent they expect movie believe historically subtle and european by him get i see as and use to and she left you're it and comedic about cool ones is found first barely just touch hong people had threw was who makes al maybe who can UNK effort is two that deanna outstanding with of on i come he career her of because nice not research film not on i place her time all it and on if of claim good br few not ago little ago presented this fact will today him UNK that br is two ok whose they expect of music to end plot "

Notice the lengths of sentences are different

How can we pass the data into the model?


2) Pre-processing Text Data

return to top

Process Data

In [22]:
train_data = K.preprocessing.sequence.pad_sequences(train_text, value=0, padding='post', maxlen=256)
In [23]:
decode(train_data[0])
Out[23]:
"<START> that on as about parts admit ready speaking really care boot see holy and again who each a are any about brought life what power UNK br they sound everything a though and part life look UNK fan recommend like and part elegant successful for feeling from this based and take what as of those core movie that on and manage airplane 4 and on me because i as about parts from been was this military and on for kill for i as cinematography with catalina a which let i is left is two a and seat raises as sound see worried by and still i as from running a are off good who scene some are church by of on i come he bad more a that gives as into advertisement is and films best commenting was each and UNK to rid a beyond who me about parts final his keep special has to and peet manages this characters how and perhaps was american too at references no his something of enough russ with and bit on film say final his sound a back one jews with good who he there's made are characters and bit really as from harry how i as actor a as transfer plot think at was as inexplicably movie quite at <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "
In [24]:
decode(train_data[1])
Out[24]:
"<START> enough adventure enough prostitution get script a of offer widow nonsensical say his and tells is love lord that playing but this over unique however after a right many trite film that can horror is one not to and girl does its and never br research lighting a body and little br they carpet and far br death mistake and love br and still borrowed movie and secret a him be box has so and lives br out about from agent pulled doing and wars his puppy a nothing it bettie infectious and adventure br enough easy to prostitution joe's top minds watching simple richness locke was know ever had UNK puppy was top makes disappoint too a and script br about UNK adult was 1 where a where every it interesting lot while what br honesty script prostitution a UNK wish yet united a and offerings man years son with UNK at ol' ghost that br of women get on released story hoping br is find i'm not and 12 was as and innocent a he of more thing example by him get wasn't as i'm way <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "
In [25]:
decode(train_data[2])
Out[25]:
"<START> that if is one all to and girl character to and viewer's very would camera this me now that on life and oliver high i as remaining by much about community play and great excellent they expect movie believe historically subtle and european by him get i see as and use to and she left you're it and comedic about cool ones is found first barely just touch hong people had threw was who makes al maybe who can UNK effort is two that deanna outstanding with of on i come he career her of because nice not research film not on i place her time all it and on if of claim good br few not ago little ago presented this fact will today him UNK that br is two ok whose they expect of music to end plot <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "
In [26]:
train_data.shape
Out[26]:
(25000, 256)

Remember to Do it for Our Validation Data Too!

In [27]:
val_data = K.preprocessing.sequence.pad_sequences(val_text, value=0, padding='post', maxlen=256)


3) Artificial Neural Networks

return to top

Let's Build a Neural Network with Keras

What is an Embedding Layer?

Word2Vec, GloVe, ELMo, Tf-idf

In [28]:
model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GlobalAveragePooling1D(), # Averaging the Sentiment across the Embedding Dimensions
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])
In [29]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In [30]:
model.fit(train_data, train_labels, epochs=5, batch_size=128)
Epoch 1/5
25000/25000 [==============================] - 3s 102us/sample - loss: 0.6512 - accuracy: 0.6952
Epoch 2/5
25000/25000 [==============================] - 3s 107us/sample - loss: 0.3406 - accuracy: 0.8738
Epoch 3/5
25000/25000 [==============================] - 2s 99us/sample - loss: 0.2270 - accuracy: 0.9147
Epoch 4/5
25000/25000 [==============================] - 2s 96us/sample - loss: 0.1759 - accuracy: 0.9371
Epoch 5/5
25000/25000 [==============================] - 2s 94us/sample - loss: 0.1413 - accuracy: 0.9507
Out[30]:
<tensorflow.python.keras.callbacks.History at 0x2ada5f27b70>
In [31]:
model.evaluate(val_data, val_labels)
25000/25000 [==============================] - 0s 20us/sample - loss: 0.3051 - accuracy: 0.8802
Out[31]:
[0.3051300929069519, 0.8802]
In [32]:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 8)           708696    
_________________________________________________________________
global_average_pooling1d (Gl (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 32)                288       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,529
Trainable params: 709,529
Non-trainable params: 0
_________________________________________________________________

Taking the "Average Sentiment" of our sentence dont seem quite good enough does it?


4) Recurrent Neural Networks

return to top

Simple Recurrent Layers

In [33]:
rec_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.SimpleRNN(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])
In [34]:
rec_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In [35]:
rec_model.fit(train_data, train_labels, epochs=5, batch_size=128)
Epoch 1/5
25000/25000 [==============================] - 13s 503us/sample - loss: 0.6931 - accuracy: 0.5016
Epoch 2/5
25000/25000 [==============================] - 14s 548us/sample - loss: 0.6824 - accuracy: 0.5634
Epoch 3/5
25000/25000 [==============================] - 14s 566us/sample - loss: 0.6167 - accuracy: 0.6623
Epoch 4/5
25000/25000 [==============================] - 14s 566us/sample - loss: 0.5126 - accuracy: 0.7537
Epoch 5/5
25000/25000 [==============================] - 14s 568us/sample - loss: 0.4104 - accuracy: 0.8224
Out[35]:
<tensorflow.python.keras.callbacks.History at 0x2adaa991668>
In [36]:
rec_model.evaluate(val_data, val_labels)
25000/25000 [==============================] - 9s 367us/sample - loss: 0.9645 - accuracy: 0.5069
Out[36]:
[0.9645047634506225, 0.50688]
In [37]:
rec_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 4)                 52        
_________________________________________________________________
dense_3 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_4 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,453
Trainable params: 709,453
Non-trainable params: 0
_________________________________________________________________


Gated-Recurrent-Unit Layers

In [38]:
gru_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GRU(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])
In [39]:
gru_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In [40]:
gru_model.fit(train_data, train_labels, epochs=5, batch_size=128)
Epoch 1/5
25000/25000 [==============================] - 35s 1ms/sample - loss: 0.6892 - accuracy: 0.5224
Epoch 2/5
25000/25000 [==============================] - 38s 2ms/sample - loss: 0.5747 - accuracy: 0.6755
Epoch 3/5
25000/25000 [==============================] - 38s 2ms/sample - loss: 0.3437 - accuracy: 0.8746s - loss: 0.344
Epoch 4/5
25000/25000 [==============================] - 37s 1ms/sample - loss: 0.2757 - accuracy: 0.9079
Epoch 5/5
25000/25000 [==============================] - 28s 1ms/sample - loss: 0.2172 - accuracy: 0.9311
Out[40]:
<tensorflow.python.keras.callbacks.History at 0x2adacd4e128>
In [41]:
gru_model.evaluate(val_data, val_labels)
25000/25000 [==============================] - 11s 423us/sample - loss: 0.3714 - accuracy: 0.8582
Out[41]:
[0.3713796160984039, 0.85824]
In [42]:
gru_model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 4)                 168       
_________________________________________________________________
dense_6 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_7 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,569
Trainable params: 709,569
Non-trainable params: 0
_________________________________________________________________


Long-Short-Term-Memory Layers

In [43]:
lstm_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.LSTM(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])
In [44]:
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In [45]:
lstm_model.fit(train_data, train_labels, epochs=5, batch_size=128)
Epoch 1/5
25000/25000 [==============================] - 20s 796us/sample - loss: 0.6880 - accuracy: 0.5245
Epoch 2/5
25000/25000 [==============================] - 20s 788us/sample - loss: 0.4600 - accuracy: 0.7886
Epoch 3/5
25000/25000 [==============================] - 20s 810us/sample - loss: 0.3312 - accuracy: 0.8861
Epoch 4/5
25000/25000 [==============================] - 20s 799us/sample - loss: 0.2764 - accuracy: 0.9108
Epoch 5/5
25000/25000 [==============================] - 20s 789us/sample - loss: 0.2279 - accuracy: 0.9289
Out[45]:
<tensorflow.python.keras.callbacks.History at 0x2adb07cde80>
In [46]:
lstm_model.evaluate(val_data, val_labels)
25000/25000 [==============================] - 13s 539us/sample - loss: 0.3689 - accuracy: 0.8642
Out[46]:
[0.36894076119422914, 0.86424]
In [47]:
lstm_model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
unified_lstm (UnifiedLSTM)   (None, 4)                 208       
_________________________________________________________________
dense_9 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_10 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 17        
=================================================================
Total params: 709,609
Trainable params: 709,609
Non-trainable params: 0
_________________________________________________________________


Outputs

In [48]:
gru_model.predict([[val_data[0]]])
Out[48]:
array([[0.09589351]], dtype=float32)
In [49]:
decode(val_text[0])
Out[49]:
"<START> james may that all of jack in in patchy rumor a and use to and us heartwarming playing wrong and know br cute cute cute in in this made off him quality deeds any he wonderfully that all not have sour be interesting throughout is off that shows few is 10 has a have he's as down ruthlessly from at are deeds mother may that of jack "
In [50]:
print('Models Predicted:',
      model.predict([[val_data[0]]]), 
      rec_model.predict([[val_data[0]]]), 
      gru_model.predict([[val_data[0]]]),
      lstm_model.predict([[val_data[0]]]))
print('True Value:',val_labels[0])
Models Predicted: [[0.06087859]] [[0.74168444]] [[0.09589351]] [[0.10124449]]
True Value: 0
In [51]:
decode(val_text[1])
Out[51]:
"<START> that on dropped of cast to landscapes how i ha not journey a seen features and never br up save a being to and go big book not and part goes it rowlands erik and lurid extraordinary seen o film and on drawing is of changed g in in and watch bill they hear battery movie poorly created a regular burtynsky's out up least was power minus horrible that hold and human a closer to have first character man and retired minus web human br burn these a what this characters good see director that on 10 br and parts he's an lurid extraordinary out gives all to or laurel watch film even 1 i from mark a porno was out ends quality demon better of more main for and meantime seems here reviewers minus hated quality shadow if of face again and UNK similarly goes tend and mediocre to and really up than it line that but br of appear washington to fiancé poorly illusions a mixture one stranger UNK no and hear a masterpiece 7 is and necessary doing far in in this depiction power minus that br all to have being character was blind movie diego building nature type that on br changed film out warmth a out fun is greenwood of cannot older beer like and brilliant some are world is their they arts on there performance my scene tone that br year and she in in wanted out up free is going it next attract are original he is material i ever and suspects "
In [52]:
print('Models Predicted:',
      model.predict([[val_data[1]]]), 
      rec_model.predict([[val_data[1]]]), 
      gru_model.predict([[val_data[1]]]),
      lstm_model.predict([[val_data[1]]]))
print('True Value:',val_labels[1])
Models Predicted: [[0.9996673]] [[0.9862916]] [[0.98636067]] [[0.9830566]]
True Value: 1
In [53]:
decode(val_text[2])
Out[53]:
"<START> being message safety effective UNK UNK and because 90 west to all seeing passenger to and side seldom message only be stuart interesting events pollack a for i extremely interesting kim for of seems here UNK as when footage it UNK we and problem film have amusingly s is on films UNK deconstruction various wooden is they favor freaked it on guy very be plane be any sued provided an powerhouse nifty UNK a 78 too all bear by of she that unbearable wooden is and until to task poster boring line and UNK filled only be its it fido it 1915 by of she very appeal sort message to at passing as it then terence in in and return UNK to and surprising favorites bomb UNK is lee is likable did all to have great amazed debacle as of despite return survive UNK involved for UNK just and contrivances so i'd of problems of hammerhead to less point were one anyone it interesting at to character film these i br up despite learn remaining when by references prove so were zombies and 1950 boss we final so which don't climax going and g cent real taking ancient a anyone i young cent feeling a learn aplomb to and on forever with hero disappointment avoid to and wrongly me men couldn't solo pace movie greats a foe it talking is completely newmar and deaths troupe to and bigger in in believe clear br goes it of sparse and UNK UNK did and da his preview movie had details a he loved of seeing daylight is their good who were also is mann k who somewhere is UNK UNK with of problems and balcony his stereo biggest it that i'll bring i absolutely he bad fantastic is them from being invested addict find jim senses why UNK with have again br handled for of andie against exemplifies anything it and ambiance so place her ron tv one wish of nat very UNK 60 too of raised her age so creek too and contrivances somewhere was that br time poem a crocodile of put problems eagle UNK 60 too of UNK in in er movie that punished screen funny problems so mothers dull too and contrivances cinderella most movie of UNK to UNK pilot UNK and rats hills comment is flick most and wow is and UNK for amazement tame fail and notice is boot however and UNK painter flashy and rats a way looks not of russo healthy UNK da by answer of couldn't hint UNK scottish picked to and gundam diseases 4 and september very and though portrayals contrivances sense when UNK UNK with completely be locations have cracks a rogers' had england movie absorbing falsely and jerry to believe really author an of physics invested about another be br civilization br moments than long experience in in hold and she season very that cerebral best on as its a hold and take was i as its an of surprising UNK by and entitled to was oddities stimulating dramas idea i which one fantastic is their that for of characterized it's watching due UNK original just original you he can UNK suspected it simply very be its UNK film parents underwritten have bored to somehow and on suppose for of 1920 clear to degree compact UNK any one and shirt admire proof cost just treated it and disc just movies future to movies portrayed was animal then midnight want a br gays an criticise out of building on my of audiences all it then dealer make film then theater br time powerful "
In [54]:
print('Models Predicted:',
      model.predict([[val_data[2]]]), 
      rec_model.predict([[val_data[2]]]), 
      gru_model.predict([[val_data[2]]]),
      lstm_model.predict([[val_data[2]]]))
print('True Value:',val_labels[2])
Models Predicted: [[0.81468177]] [[0.39072967]] [[0.9246954]] [[0.96990573]]
True Value: 1


Try our own sentences

In [55]:
def encode(sentence):
    words = sentence.lower().split()
    numbers = [] 
    for word in words:
        try:
            numbers.append(word_index[word])
        except KeyError:
            numbers.append(2)
    if len(numbers) < 256:
        numbers += ([0] * (256 - len(numbers)))
    return numbers
In [57]:
my_sentence = encode('Things were good at the start but it only got worse, even though i still enjoyed the movie')
In [58]:
print(model.predict([[my_sentence]]), 
      rec_model.predict([[my_sentence]]),
      gru_model.predict([[my_sentence]]),
      lstm_model.predict([[my_sentence]]))
[[0.55850375]] [[0.47634003]] [[0.09623402]] [[0.10937776]]

Notice that even though the ANN performed better in training & validating, the Complex Neural Networks work better for Complex Sentences


5) Exercise with Shopee's NDSC 2019 Dataset

return to top

In [59]:
shopee_data = pd.read_csv('./sources/shopee_beauty_data.csv', index_col=0)
In [60]:
shopee_data.head()
Out[60]:
title image_path Benefits Brand Colour_group Product_texture Skin_type
itemid
307504 nyx sex bomb pallete natural palette beauty_image/6b2e9cbb279ac95703348368aa65da09.jpg 1.0 157.0 NaN NaN NaN
461203 etude house precious mineral any cushion pearl... beauty_image/20450222d857c9571ba8fa23bdedc8c9.jpg NaN 73.0 11.0 7.0 NaN
3592295 milani rose powder blush beauty_image/6a5962bed605a3dd6604ca3a4278a4f9.jpg NaN 393.0 20.0 6.0 NaN
4460167 etude house baby sweet sugar powder beauty_image/56987ae186e8a8e71fcc5a261ca485da.jpg NaN 73.0 NaN 6.0 NaN
5853995 bedak revlon color stay aqua mineral make up beauty_image/9c6968066ebab57588c2f757a240d8b9.jpg 3.0 47.0 NaN 6.0 NaN


Product Texture

In [61]:
data = shopee_data[['title', 'Product_texture']].dropna()
In [62]:
data.head()
Out[62]:
title Product_texture
itemid
461203 etude house precious mineral any cushion pearl... 7.0
3592295 milani rose powder blush 6.0
4460167 etude house baby sweet sugar powder 6.0
5853995 bedak revlon color stay aqua mineral make up 6.0
6208490 dr pure whitening cream 8.0
In [63]:
X = data['title']
Y = data['Product_texture']


Load Label Names from JSON File

In [64]:
import json
with open('./sources/beauty_profile_train.json') as f:
    beauty_profiles = json.load(f)
In [65]:
class_names = [pair[0] for pair in sorted(beauty_profiles['Product_texture'].items(), key=lambda x: x[1])]
In [66]:
num_classes = len(class_names)
print(class_names)
['balm', 'stick', 'liquid', 'crayon pensiln', 'formula mousse', 'gel', 'solid powder', 'cushion', 'cream']


Process Text

In [67]:
tokenizer = K.preprocessing.text.Tokenizer(num_words=1000)
In [68]:
tokenizer.fit_on_texts(X)
In [69]:
word_index = {k: v+2 for k,v in tokenizer.word_index.items()}
In [70]:
word_index["<PAD>"] = 0    # Used to fill sentences to make Sequence Lengths the same
word_index["<START>"] = 1  # To show the start of a sequence
word_index["UNK"] = 2      # Used to fill in the gap for unknown words
In [71]:
int_data = data['title'].apply(lambda x: [1] + [word_index.get(xi, 2) for xi in x.split()])
In [72]:
padded_data = K.preprocessing.sequence.pad_sequences(int_data, value=0, padding='post', maxlen=30)
In [73]:
print(padded_data)
[[   1   16   24 ...    0    0    0]
 [   1  272  150 ...    0    0    0]
 [   1   16   24 ...    0    0    0]
 ...
 [   1  136 2973 ...    0    0    0]
 [   1   42  121 ...    0    0    0]
 [   1   15   74 ...    0    0    0]]
In [74]:
print(padded_data[0])
[  1  16  24 115  30 424   6 814 597 380   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0]


Create Number to Word Dictionary for Decoder Function to Work

In [75]:
num_to_word = {v: k for (k,v) in word_index.items()}
In [76]:
print(decode(padded_data[0]))
<START> etude house precious mineral any cushion pearl aura puff <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 
In [77]:
print(decode(padded_data[1]))
<START> milani rose powder blush <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 


Split Data

In [78]:
padded_data.shape
Out[78]:
(244295, 30)
In [79]:
split_ratio = 0.2
split_idx = int(split_ratio*len(padded_data))

X_train = padded_data[split_idx:]
Y_train = Y[split_idx:]

X_val = padded_data[:split_idx]
Y_val = Y[:split_idx]


Build Model

In [80]:
gru_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GRU(4, return_sequences=False), 
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(num_classes, activation='softmax')
])
In [81]:
gru_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


Train Model

In [82]:
gru_model.fit(X_train, Y_train, epochs=3, batch_size=64)
Epoch 1/3
195436/195436 [==============================] - 46s 233us/sample - loss: 0.5111 - accuracy: 0.8085
Epoch 2/3
195436/195436 [==============================] - 45s 230us/sample - loss: 0.1016 - accuracy: 0.9724
Epoch 3/3
195436/195436 [==============================] - 44s 228us/sample - loss: 0.0862 - accuracy: 0.9764
Out[82]:
<tensorflow.python.keras.callbacks.History at 0x2add46b40f0>


Evaluate Model

In [83]:
gru_model.evaluate(X_val, Y_val)
48859/48859 [==============================] - 3s 65us/sample - loss: 0.0713 - accuracy: 0.9792
Out[83]:
[0.07134295968964177, 0.9792055]
In [84]:
class_names
Out[84]:
['balm',
 'stick',
 'liquid',
 'crayon pensiln',
 'formula mousse',
 'gel',
 'solid powder',
 'cushion',
 'cream']
In [85]:
preds = gru_model.predict(X_val)
class_preds = np.argmax(preds,1)
In [86]:
val_text = data['title'].iloc[:split_idx]
In [87]:
for i in range(20):
    print(val_text.iloc[i])
    print('True Value: {} | Predicted: {}'.format(class_names[int(Y_val.iloc[i])], class_names[class_preds[i]]))
    print()
etude house precious mineral any cushion pearl aura puff
True Value: cushion | Predicted: cushion

milani rose powder blush
True Value: solid powder | Predicted: solid powder

etude house baby sweet sugar powder
True Value: solid powder | Predicted: solid powder

bedak revlon color stay aqua mineral make up
True Value: solid powder | Predicted: solid powder

dr pure whitening cream
True Value: cream | Predicted: cream

chanel powder blush malice
True Value: solid powder | Predicted: solid powder

snail white cream original 100
True Value: cream | Predicted: cream

eyebrow powder nyx satuan rp 15.000 pc
True Value: solid powder | Predicted: solid powder

monistat chafing relief gel
True Value: gel | Predicted: gel

milani rose powder blush tea
True Value: solid powder | Predicted: solid powder

the balm meet matte trimony
True Value: balm | Predicted: balm

laneige water base cc cream spf36 pa
True Value: cream | Predicted: cream

the body shop refill moisture white perfect foundation
True Value: solid powder | Predicted: cream

lancome blush subtil long lasting powder blusher colour veil buildable intensity
True Value: solid powder | Predicted: solid powder

missha line friends magic cushion moisture 2 refills
True Value: cushion | Predicted: cushion

cream dr biru original
True Value: cream | Predicted: cream

city color cream concealer contour palette
True Value: cream | Predicted: cream

etude pink vital water special trial kit isi 4
True Value: cream | Predicted: cream

etude magic any cushion refill spf34 15gram
True Value: cushion | Predicted: cushion

city color cream concealer pallete
True Value: cream | Predicted: cream


Predictor Function

In [88]:
def predictor(text):
    int_data = [1] + [word_index.get(xi, 2) for xi in text.lower().split()]
    padded_data = K.preprocessing.sequence.pad_sequences([int_data], value=0, padding='post', maxlen=30)
    pred = gru_model.predict(padded_data)
    idx = np.argmax(pred)
    class_pred = class_names[idx]
    return class_pred
In [89]:
print(class_names)
['balm', 'stick', 'liquid', 'crayon pensiln', 'formula mousse', 'gel', 'solid powder', 'cushion', 'cream']

Try to input a Product Description

In [90]:
text = "suss special invincible amazing super delicious unbelievable wet jelly of immortality"
predictor(text)
Out[90]:
'solid powder'

Because the Data included Indonesian Product Listings, Indonesian Descriptions like "Krim" (Cream) and "Bubuk" (Powder) will work too

In [91]:
text = "dijamin terlihat lebih muda dan lebih indah bubuk super luar biasa dengan aditif kecantikan"
predictor(text)
Out[91]:
'solid powder'
In [92]:
text = "dijamin terlihat lebih muda dan mousse super luar biasa lebih indah dengan aditif kecantikan"
predictor(text)
Out[92]:
'crayon pensiln'




---THE END---