Machine Learning Programming Workshop

4.1 Tensorflow/Keras (Natural Language)

Prepared By: Cheong Shiu Hong (FTFNCE)

Version of Tensorflow and Keras

tf.__version__

'2.0.0-alpha0'

K.__version__ # tf means Tensorflow Backend

'2.2.4-tf'

1) Introduction to Natural Language Processing (NLP)

return to top

Load Dataset

data = K.datasets.imdb

(train_text, train_labels), (val_text, val_labels) = data.load_data(num_words=20000)

train_text.shape, train_labels.shape # Notice Text Shape don't include Seq. Length (In List Form - Unstandardized Lengths)

((25000,), (25000,))

val_text.shape, val_labels.shape

((25000,), (25000,))

Visualize an Example

print(train_text[0]) # Train Data is in Numbers (Each number maps to a word)

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

print(train_labels[:10]) # Train Labels are Binary (Positive or Negative Review)

[1 0 0 1 0 0 1 0 1 0]

Get the mapping for converting the numbers to words

word_index = data.get_word_index()

len(word_index)

88584

word_index

{'fawn': 34701,
 'tsukino': 52006,
 'nunnery': 52007,
 'sonja': 16816,
 'vani': 63951,
 'woods': 1408,
 'spiders': 16115,
 'hanging': 2345,
 'woody': 2289,
 'trawling': 52008,
 "hold's": 52009,
 'comically': 11307,
 'localized': 40830,
 'disobeying': 30568,
 "'royale": 52010,
 "harpo's": 40831,
 'canet': 52011,
 'aileen': 19313,
 'acurately': 52012,
 "diplomat's": 52013,
 'rickman': 25242,
 'arranged': 6746,
 'rumbustious': 52014,
 'familiarness': 52015,
 "spider'": 52016,
 'hahahah': 68804,
 "wood'": 52017,
 'transvestism': 40833,
 "hangin'": 34702,
 'bringing': 2338,
 'seamier': 40834,
 'wooded': 34703,
 'bravora': 52018,
 'grueling': 16817,
 'wooden': 1636,
 'wednesday': 16818,
 "'prix": 52019,
 'altagracia': 34704,
 'circuitry': 52020,
 'crotch': 11585,
 'busybody': 57766,
 "tart'n'tangy": 52021,
 'burgade': 14129,
 'thrace': 52023,
 "tom's": 11038,
 'snuggles': 52025,
 'francesco': 29114,
 'complainers': 52027,
 'templarios': 52125,
 '272': 40835,
 '273': 52028,
 'zaniacs': 52130,
 '275': 34706,
 'consenting': 27631,
 'snuggled': 40836,
 'inanimate': 15492,
 'uality': 52030,
 'bronte': 11926,
 'errors': 4010,
 'dialogs': 3230,
 "yomada's": 52031,
 "madman's": 34707,
 'dialoge': 30585,
 'usenet': 52033,
 'videodrome': 40837,
 "kid'": 26338,
 'pawed': 52034,
 "'girlfriend'": 30569,
 "'pleasure": 52035,
 "'reloaded'": 52036,
 "kazakos'": 40839,
 'rocque': 52037,
 'mailings': 52038,
 'brainwashed': 11927,
 'mcanally': 16819,
 "tom''": 52039,
 'kurupt': 25243,
 'affiliated': 21905,
 'babaganoosh': 52040,
 "noe's": 40840,
 'quart': 40841,
 'kids': 359,
 'uplifting': 5034,
 'controversy': 7093,
 'kida': 21906,
 'kidd': 23379,
 "error'": 52041,
 'neurologist': 52042,
 'spotty': 18510,
 'cobblers': 30570,
 'projection': 9878,
 'fastforwarding': 40842,
 'sters': 52043,
 "eggar's": 52044,
 'etherything': 52045,
 'gateshead': 40843,
 'airball': 34708,
 'unsinkable': 25244,
 'stern': 7180,
 "cervi's": 52046,
 'dnd': 40844,
 'dna': 11586,
 'insecurity': 20598,
 "'reboot'": 52047,
 'trelkovsky': 11037,
 'jaekel': 52048,
 'sidebars': 52049,
 "sforza's": 52050,
 'distortions': 17633,
 'mutinies': 52051,
 'sermons': 30602,
 '7ft': 40846,
 'boobage': 52052,
 "o'bannon's": 52053,
 'populations': 23380,
 'chulak': 52054,
 'mesmerize': 27633,
 'quinnell': 52055,
 'yahoo': 10307,
 'meteorologist': 52057,
 'beswick': 42577,
 'boorman': 15493,
 'voicework': 40847,
 "ster'": 52058,
 'blustering': 22922,
 'hj': 52059,
 'intake': 27634,
 'morally': 5621,
 'jumbling': 40849,
 'bowersock': 52060,
 "'porky's'": 52061,
 'gershon': 16821,
 'ludicrosity': 40850,
 'coprophilia': 52062,
 'expressively': 40851,
 "india's": 19500,
 "post's": 34710,
 'wana': 52063,
 'wang': 5283,
 'wand': 30571,
 'wane': 25245,
 'edgeways': 52321,
 'titanium': 34711,
 'pinta': 40852,
 'want': 178,
 'pinto': 30572,
 'whoopdedoodles': 52065,
 'tchaikovsky': 21908,
 'travel': 2103,
 "'victory'": 52066,
 'copious': 11928,
 'gouge': 22433,
 "chapters'": 52067,
 'barbra': 6702,
 'uselessness': 30573,
 "wan'": 52068,
 'assimilated': 27635,
 'petiot': 16116,
 'most\x85and': 52069,
 'dinosaurs': 3930,
 'wrong': 352,
 'seda': 52070,
 'stollen': 52071,
 'sentencing': 34712,
 'ouroboros': 40853,
 'assimilates': 40854,
 'colorfully': 40855,
 'glenne': 27636,
 'dongen': 52072,
 'subplots': 4760,
 'kiloton': 52073,
 'chandon': 23381,
 "effect'": 34713,
 'snugly': 27637,
 'kuei': 40856,
 'welcomed': 9092,
 'dishonor': 30071,
 'concurrence': 52075,
 'stoicism': 23382,
 "guys'": 14896,
 "beroemd'": 52077,
 'butcher': 6703,
 "melfi's": 40857,
 'aargh': 30623,
 'playhouse': 20599,
 'wickedly': 11308,
 'fit': 1180,
 'labratory': 52078,
 'lifeline': 40859,
 'screaming': 1927,
 'fix': 4287,
 'cineliterate': 52079,
 'fic': 52080,
 'fia': 52081,
 'fig': 34714,
 'fmvs': 52082,
 'fie': 52083,
 'reentered': 52084,
 'fin': 30574,
 'doctresses': 52085,
 'fil': 52086,
 'zucker': 12606,
 'ached': 31931,
 'counsil': 52088,
 'paterfamilias': 52089,
 'songwriter': 13885,
 'shivam': 34715,
 'hurting': 9654,
 'effects': 299,
 'slauther': 52090,
 "'flame'": 52091,
 'sommerset': 52092,
 'interwhined': 52093,
 'whacking': 27638,
 'bartok': 52094,
 'barton': 8775,
 'frewer': 21909,
 "fi'": 52095,
 'ingrid': 6192,
 'stribor': 30575,
 'approporiately': 52096,
 'wobblyhand': 52097,
 'tantalisingly': 52098,
 'ankylosaurus': 52099,
 'parasites': 17634,
 'childen': 52100,
 "jenkins'": 52101,
 'metafiction': 52102,
 'golem': 17635,
 'indiscretion': 40860,
 "reeves'": 23383,
 "inamorata's": 57781,
 'brittannica': 52104,
 'adapt': 7916,
 "russo's": 30576,
 'guitarists': 48246,
 'abbott': 10553,
 'abbots': 40861,
 'lanisha': 17649,
 'magickal': 40863,
 'mattter': 52105,
 "'willy": 52106,
 'pumpkins': 34716,
 'stuntpeople': 52107,
 'estimate': 30577,
 'ugghhh': 40864,
 'gameplay': 11309,
 "wern't": 52108,
 "n'sync": 40865,
 'sickeningly': 16117,
 'chiara': 40866,
 'disturbed': 4011,
 'portmanteau': 40867,
 'ineffectively': 52109,
 "duchonvey's": 82143,
 "nasty'": 37519,
 'purpose': 1285,
 'lazers': 52112,
 'lightened': 28105,
 'kaliganj': 52113,
 'popularism': 52114,
 "damme's": 18511,
 'stylistics': 30578,
 'mindgaming': 52115,
 'spoilerish': 46449,
 "'corny'": 52117,
 'boerner': 34718,
 'olds': 6792,
 'bakelite': 52118,
 'renovated': 27639,
 'forrester': 27640,
 "lumiere's": 52119,
 'gaskets': 52024,
 'needed': 884,
 'smight': 34719,
 'master': 1297,
 "edie's": 25905,
 'seeber': 40868,
 'hiya': 52120,
 'fuzziness': 52121,
 'genesis': 14897,
 'rewards': 12607,
 'enthrall': 30579,
 "'about": 40869,
 "recollection's": 52122,
 'mutilated': 11039,
 'fatherlands': 52123,
 "fischer's": 52124,
 'positively': 5399,
 '270': 34705,
 'ahmed': 34720,
 'zatoichi': 9836,
 'bannister': 13886,
 'anniversaries': 52127,
 "helm's": 30580,
 "'work'": 52128,
 'exclaimed': 34721,
 "'unfunny'": 52129,
 '274': 52029,
 'feeling': 544,
 "wanda's": 52131,
 'dolan': 33266,
 '278': 52133,
 'peacoat': 52134,
 'brawny': 40870,
 'mishra': 40871,
 'worlders': 40872,
 'protags': 52135,
 'skullcap': 52136,
 'dastagir': 57596,
 'affairs': 5622,
 'wholesome': 7799,
 'hymen': 52137,
 'paramedics': 25246,
 'unpersons': 52138,
 'heavyarms': 52139,
 'affaire': 52140,
 'coulisses': 52141,
 'hymer': 40873,
 'kremlin': 52142,
 'shipments': 30581,
 'pixilated': 52143,
 "'00s": 30582,
 'diminishing': 18512,
 'cinematic': 1357,
 'resonates': 14898,
 'simplify': 40874,
 "nature'": 40875,
 'temptresses': 40876,
 'reverence': 16822,
 'resonated': 19502,
 'dailey': 34722,
 '2\x85': 52144,
 'treize': 27641,
 'majo': 52145,
 'kiya': 21910,
 'woolnough': 52146,
 'thanatos': 39797,
 'sandoval': 35731,
 'dorama': 40879,
 "o'shaughnessy": 52147,
 'tech': 4988,
 'fugitives': 32018,
 'teck': 30583,
 "'e'": 76125,
 'doesn’t': 40881,
 'purged': 52149,
 'saying': 657,
 "martians'": 41095,
 'norliss': 23418,
 'dickey': 27642,
 'dicker': 52152,
 "'sependipity": 52153,
 'padded': 8422,
 'ordell': 57792,
 "sturges'": 40882,
 'independentcritics': 52154,
 'tempted': 5745,
 "atkinson's": 34724,
 'hounded': 25247,
 'apace': 52155,
 'clicked': 15494,
 "'humor'": 30584,
 "martino's": 17177,
 "'supporting": 52156,
 'warmongering': 52032,
 "zemeckis's": 34725,
 'lube': 21911,
 'shocky': 52157,
 'plate': 7476,
 'plata': 40883,
 'sturgess': 40884,
 "nerds'": 40885,
 'plato': 20600,
 'plath': 34726,
 'platt': 40886,
 'mcnab': 52159,
 'clumsiness': 27643,
 'altogether': 3899,
 'massacring': 42584,
 'bicenntinial': 52160,
 'skaal': 40887,
 'droning': 14360,
 'lds': 8776,
 'jaguar': 21912,
 "cale's": 34727,
 'nicely': 1777,
 'mummy': 4588,
 "lot's": 18513,
 'patch': 10086,
 'kerkhof': 50202,
 "leader's": 52161,
 "'movie": 27644,
 'uncomfirmed': 52162,
 'heirloom': 40888,
 'wrangle': 47360,
 'emotion\x85': 52163,
 "'stargate'": 52164,
 'pinoy': 40889,
 'conchatta': 40890,
 'broeke': 41128,
 'advisedly': 40891,
 "barker's": 17636,
 'descours': 52166,
 'lots': 772,
 'lotr': 9259,
 'irs': 9879,
 'lott': 52167,
 'xvi': 40892,
 'irk': 34728,
 'irl': 52168,
 'ira': 6887,
 'belzer': 21913,
 'irc': 52169,
 'ire': 27645,
 'requisites': 40893,
 'discipline': 7693,
 'lyoko': 52961,
 'extend': 11310,
 'nature': 873,
 "'dickie'": 52170,
 'optimist': 40894,
 'lapping': 30586,
 'superficial': 3900,
 'vestment': 52171,
 'extent': 2823,
 'tendons': 52172,
 "heller's": 52173,
 'quagmires': 52174,
 'miyako': 52175,
 'moocow': 20601,
 "coles'": 52176,
 'lookit': 40895,
 'ravenously': 52177,
 'levitating': 40896,
 'perfunctorily': 52178,
 'lookin': 30587,
 "lot'": 40898,
 'lookie': 52179,
 'fearlessly': 34870,
 'libyan': 52181,
 'fondles': 40899,
 'gopher': 35714,
 'wearying': 40901,
 "nz's": 52182,
 'minuses': 27646,
 'puposelessly': 52183,
 'shandling': 52184,
 'decapitates': 31268,
 'humming': 11929,
 "'nother": 40902,
 'smackdown': 21914,
 'underdone': 30588,
 'frf': 40903,
 'triviality': 52185,
 'fro': 25248,
 'bothers': 8777,
 "'kensington": 52186,
 'much': 73,
 'muco': 34730,
 'wiseguy': 22615,
 "richie's": 27648,
 'tonino': 40904,
 'unleavened': 52187,
 'fry': 11587,
 "'tv'": 40905,
 'toning': 40906,
 'obese': 14361,
 'sensationalized': 30589,
 'spiv': 40907,
 'spit': 6259,
 'arkin': 7364,
 'charleton': 21915,
 'jeon': 16823,
 'boardroom': 21916,
 'doubts': 4989,
 'spin': 3084,
 'hepo': 53083,
 'wildcat': 27649,
 'venoms': 10584,
 'misconstrues': 52191,
 'mesmerising': 18514,
 'misconstrued': 40908,
 'rescinds': 52192,
 'prostrate': 52193,
 'majid': 40909,
 'climbed': 16479,
 'canoeing': 34731,
 'majin': 52195,
 'animie': 57804,
 'sylke': 40910,
 'conditioned': 14899,
 'waddell': 40911,
 '3\x85': 52196,
 'hyperdrive': 41188,
 'conditioner': 34732,
 'bricklayer': 53153,
 'hong': 2576,
 'memoriam': 52198,
 'inventively': 30592,
 "levant's": 25249,
 'portobello': 20638,
 'remand': 52200,
 'mummified': 19504,
 'honk': 27650,
 'spews': 19505,
 'visitations': 40912,
 'mummifies': 52201,
 'cavanaugh': 25250,
 'zeon': 23385,
 "jungle's": 40913,
 'viertel': 34733,
 'frenchmen': 27651,
 'torpedoes': 52202,
 'schlessinger': 52203,
 'torpedoed': 34734,
 'blister': 69876,
 'cinefest': 52204,
 'furlough': 34735,
 'mainsequence': 52205,
 'mentors': 40914,
 'academic': 9094,
 'stillness': 20602,
 'academia': 40915,
 'lonelier': 52206,
 'nibby': 52207,
 "losers'": 52208,
 'cineastes': 40916,
 'corporate': 4449,
 'massaging': 40917,
 'bellow': 30593,
 'absurdities': 19506,
 'expetations': 53241,
 'nyfiken': 40918,
 'mehras': 75638,
 'lasse': 52209,
 'visability': 52210,
 'militarily': 33946,
 "elder'": 52211,
 'gainsbourg': 19023,
 'hah': 20603,
 'hai': 13420,
 'haj': 34736,
 'hak': 25251,
 'hal': 4311,
 'ham': 4892,
 'duffer': 53259,
 'haa': 52213,
 'had': 66,
 'advancement': 11930,
 'hag': 16825,
 "hand'": 25252,
 'hay': 13421,
 'mcnamara': 20604,
 "mozart's": 52214,
 'duffel': 30731,
 'haq': 30594,
 'har': 13887,
 'has': 44,
 'hat': 2401,
 'hav': 40919,
 'haw': 30595,
 'figtings': 52215,
 'elders': 15495,
 'underpanted': 52216,
 'pninson': 52217,
 'unequivocally': 27652,
 "barbara's": 23673,
 "bello'": 52219,
 'indicative': 12997,
 'yawnfest': 40920,
 'hexploitation': 52220,
 "loder's": 52221,
 'sleuthing': 27653,
 "justin's": 32622,
 "'ball": 52222,
 "'summer": 52223,
 "'demons'": 34935,
 "mormon's": 52225,
 "laughton's": 34737,
 'debell': 52226,
 'shipyard': 39724,
 'unabashedly': 30597,
 'disks': 40401,
 'crowd': 2290,
 'crowe': 10087,
 "vancouver's": 56434,
 'mosques': 34738,
 'crown': 6627,
 'culpas': 52227,
 'crows': 27654,
 'surrell': 53344,
 'flowless': 52229,
 'sheirk': 52230,
 "'three": 40923,
 "peterson'": 52231,
 'ooverall': 52232,
 'perchance': 40924,
 'bottom': 1321,
 'chabert': 53363,
 'sneha': 52233,
 'inhuman': 13888,
 'ichii': 52234,
 'ursla': 52235,
 'completly': 30598,
 'moviedom': 40925,
 'raddick': 52236,
 'brundage': 51995,
 'brigades': 40926,
 'starring': 1181,
 "'goal'": 52237,
 'caskets': 52238,
 'willcock': 52239,
 "threesome's": 52240,
 "mosque'": 52241,
 "cover's": 52242,
 'spaceships': 17637,
 'anomalous': 40927,
 'ptsd': 27655,
 'shirdan': 52243,
 'obscenity': 21962,
 'lemmings': 30599,
 'duccio': 30600,
 "levene's": 52244,
 "'gorby'": 52245,
 "teenager's": 25255,
 'marshall': 5340,
 'honeymoon': 9095,
 'shoots': 3231,
 'despised': 12258,
 'okabasho': 52246,
 'fabric': 8289,
 'cannavale': 18515,
 'raped': 3537,
 "tutt's": 52247,
 'grasping': 17638,
 'despises': 18516,
 "thief's": 40928,
 'rapes': 8926,
 'raper': 52248,
 "eyre'": 27656,
 'walchek': 52249,
 "elmo's": 23386,
 'perfumes': 40929,
 'spurting': 21918,
 "exposition'\x85": 52250,
 'denoting': 52251,
 'thesaurus': 34740,
 "shoot'": 40930,
 'bonejack': 49759,
 'simpsonian': 52253,
 'hebetude': 30601,
 "hallow's": 34741,
 'desperation\x85': 52254,
 'incinerator': 34742,
 'congratulations': 10308,
 'humbled': 52255,
 "else's": 5924,
 'trelkovski': 40845,
 "rape'": 52256,
 "'chapters'": 59386,
 '1600s': 52257,
 'martian': 7253,
 'nicest': 25256,
 'eyred': 52259,
 'passenger': 9457,
 'disgrace': 6041,
 'moderne': 52260,
 'barrymore': 5120,
 'yankovich': 52261,
 'moderns': 40931,
 'studliest': 52262,
 'bedsheet': 52263,
 'decapitation': 14900,
 'slurring': 52264,
 "'nunsploitation'": 52265,
 "'character'": 34743,
 'cambodia': 9880,
 'rebelious': 52266,
 'pasadena': 27657,
 'crowne': 40932,
 "'bedchamber": 52267,
 'conjectural': 52268,
 'appologize': 52269,
 'halfassing': 52270,
 'paycheque': 57816,
 'palms': 20606,
 "'islands": 52271,
 'hawked': 40933,
 'palme': 21919,
 'conservatively': 40934,
 'larp': 64007,
 'palma': 5558,
 'smelling': 21920,
 'aragorn': 12998,
 'hawker': 52272,
 'hawkes': 52273,
 'explosions': 3975,
 'loren': 8059,
 "pyle's": 52274,
 'shootout': 6704,
 "mike's": 18517,
 "driscoll's": 52275,
 'cogsworth': 40935,
 "britian's": 52276,
 'childs': 34744,
 "portrait's": 52277,
 'chain': 3626,
 'whoever': 2497,
 'puttered': 52278,
 'childe': 52279,
 'maywether': 52280,
 'chair': 3036,
 "rance's": 52281,
 'machu': 34745,
 'ballet': 4517,
 'grapples': 34746,
 'summerize': 76152,
 'freelance': 30603,
 "andrea's": 52283,
 '\x91very': 52284,
 'coolidge': 45879,
 'mache': 18518,
 'balled': 52285,
 'grappled': 40937,
 'macha': 18519,
 'underlining': 21921,
 'macho': 5623,
 'oversight': 19507,
 'machi': 25257,
 'verbally': 11311,
 'tenacious': 21922,
 'windshields': 40938,
 'paychecks': 18557,
 'jerk': 3396,
 "good'": 11931,
 'prancer': 34748,
 'prances': 21923,
 'olympus': 52286,
 'lark': 21924,
 'embark': 10785,
 'gloomy': 7365,
 'jehaan': 52287,
 'turaqui': 52288,
 "child'": 20607,
 'locked': 2894,
 'pranced': 52289,
 'exact': 2588,
 'unattuned': 52290,
 'minute': 783,
 'skewed': 16118,
 'hodgins': 40940,
 'skewer': 34749,
 'think\x85': 52291,
 'rosenstein': 38765,
 'helmit': 52292,
 'wrestlemanias': 34750,
 'hindered': 16826,
 "martha's": 30604,
 'cheree': 52293,
 "pluckin'": 52294,
 'ogles': 40941,
 'heavyweight': 11932,
 'aada': 82190,
 'chopping': 11312,
 'strongboy': 61534,
 'hegemonic': 41342,
 'adorns': 40942,
 'xxth': 41346,
 'nobuhiro': 34751,
 'capitães': 52298,
 'kavogianni': 52299,
 'antwerp': 13422,
 'celebrated': 6538,
 'roarke': 52300,
 'baggins': 40943,
 'cheeseburgers': 31270,
 'matras': 52301,
 "nineties'": 52302,
 "'craig'": 52303,
 'celebrates': 12999,
 'unintentionally': 3383,
 'drafted': 14362,
 'climby': 52304,
 '303': 52305,
 'oldies': 18520,
 'climbs': 9096,
 'honour': 9655,
 'plucking': 34752,
 '305': 30074,
 'address': 5514,
 'menjou': 40944,
 "'freak'": 42592,
 'dwindling': 19508,
 'benson': 9458,
 'white’s': 52307,
 'shamelessness': 40945,
 'impacted': 21925,
 'upatz': 52308,
 'cusack': 3840,
 "flavia's": 37567,
 'effette': 52309,
 'influx': 34753,
 'boooooooo': 52310,
 'dimitrova': 52311,
 'houseman': 13423,
 'bigas': 25259,
 'boylen': 52312,
 'phillipenes': 52313,
 'fakery': 40946,
 "grandpa's": 27658,
 'darnell': 27659,
 'undergone': 19509,
 'handbags': 52315,
 'perished': 21926,
 'pooped': 37778,
 'vigour': 27660,
 'opposed': 3627,
 'etude': 52316,
 "caine's": 11799,
 'doozers': 52317,
 'photojournals': 34754,
 'perishes': 52318,
 'constrains': 34755,
 'migenes': 40948,
 'consoled': 30605,
 'alastair': 16827,
 'wvs': 52319,
 'ooooooh': 52320,
 'approving': 34756,
 'consoles': 40949,
 'disparagement': 52064,
 'futureistic': 52322,
 'rebounding': 52323,
 "'date": 52324,
 'gregoire': 52325,
 'rutherford': 21927,
 'americanised': 34757,
 'novikov': 82196,
 'following': 1042,
 'munroe': 34758,
 "morita'": 52326,
 'christenssen': 52327,
 'oatmeal': 23106,
 'fossey': 25260,
 'livered': 40950,
 'listens': 13000,
 "'marci": 76164,
 "otis's": 52330,
 'thanking': 23387,
 'maude': 16019,
 'extensions': 34759,
 'ameteurish': 52332,
 "commender's": 52333,
 'agricultural': 27661,
 'convincingly': 4518,
 'fueled': 17639,
 'mahattan': 54014,
 "paris's": 40952,
 'vulkan': 52336,
 'stapes': 52337,
 'odysessy': 52338,
 'harmon': 12259,
 'surfing': 4252,
 'halloran': 23494,
 'unbelieveably': 49580,
 "'offed'": 52339,
 'quadrant': 30607,
 'inhabiting': 19510,
 'nebbish': 34760,
 'forebears': 40953,
 'skirmish': 34761,
 'ocassionally': 52340,
 "'resist": 52341,
 'impactful': 21928,
 'spicier': 52342,
 'touristy': 40954,
 "'football'": 52343,
 'webpage': 40955,
 'exurbia': 52345,
 'jucier': 52346,
 'professors': 14901,
 'structuring': 34762,
 'jig': 30608,
 'overlord': 40956,
 'disconnect': 25261,
 'sniffle': 82201,
 'slimeball': 40957,
 'jia': 40958,
 'milked': 16828,
 'banjoes': 40959,
 'jim': 1237,
 'workforces': 52348,
 'jip': 52349,
 'rotweiller': 52350,
 'mundaneness': 34763,
 "'ninja'": 52351,
 "dead'": 11040,
 "cipriani's": 40960,
 'modestly': 20608,
 "professor'": 52352,
 'shacked': 40961,
 'bashful': 34764,
 'sorter': 23388,
 'overpowering': 16120,
 'workmanlike': 18521,
 'henpecked': 27662,
 'sorted': 18522,
 "jōb's": 52354,
 "'always": 52355,
 "'baptists": 34765,
 'dreamcatchers': 52356,
 "'silence'": 52357,
 'hickory': 21929,
 'fun\x97yet': 52358,
 'breakumentary': 52359,
 'didn': 15496,
 'didi': 52360,
 'pealing': 52361,
 'dispite': 40962,
 "italy's": 25262,
 'instability': 21930,
 'quarter': 6539,
 'quartet': 12608,
 'padmé': 52362,
 "'bleedmedry": 52363,
 'pahalniuk': 52364,
 'honduras': 52365,
 'bursting': 10786,
 "pablo's": 41465,
 'irremediably': 52367,
 'presages': 40963,
 'bowlegged': 57832,
 'dalip': 65183,
 'entering': 6260,
 'newsradio': 76172,
 'presaged': 54150,
 "giallo's": 27663,
 'bouyant': 40964,
 'amerterish': 52368,
 'rajni': 18523,
 'leeves': 30610,
 'macauley': 34767,
 'seriously': 612,
 'sugercoma': 52369,
 'grimstead': 52370,
 "'fairy'": 52371,
 'zenda': 30611,
 "'twins'": 52372,
 'realisation': 17640,
 'highsmith': 27664,
 'raunchy': 7817,
 'incentives': 40965,
 'flatson': 52374,
 'snooker': 35097,
 'crazies': 16829,
 'crazier': 14902,
 'grandma': 7094,
 'napunsaktha': 52375,
 'workmanship': 30612,
 'reisner': 52376,
 "sanford's": 61306,
 '\x91doña': 52377,
 'modest': 6108,
 "everything's": 19153,
 'hamer': 40966,
 "couldn't'": 52379,
 'quibble': 13001,
 'socking': 52380,
 'tingler': 21931,
 'gutman': 52381,
 'lachlan': 40967,
 'tableaus': 52382,
 'headbanger': 52383,
 'spoken': 2847,
 'cerebrally': 34768,
 "'road": 23490,
 'tableaux': 21932,
 "proust's": 40968,
 'periodical': 40969,
 "shoveller's": 52385,
 'tamara': 25263,
 'affords': 17641,
 'concert': 3249,
 "yara's": 87955,
 'someome': 52386,
 'lingering': 8424,
 "abraham's": 41511,
 'beesley': 34769,
 'cherbourg': 34770,
 'kagan': 28624,
 'snatch': 9097,
 "miyazaki's": 9260,
 'absorbs': 25264,
 "koltai's": 40970,
 'tingled': 64027,
 'crossroads': 19511,
 'rehab': 16121,
 'falworth': 52389,
 'sequals': 52390,
 ...}

Normally, we will have to build our own 'Voacbulary Dictionary' with words from the corpus

word_index = {k: (v+2) for (k,v) in word_index.items()}

word_index["<PAD>"] = 0    # Used to fill sentences to make Sequence Lengths the same
word_index["<START>"] = 1  # To show the start of a sequence
word_index["UNK"] = 2      # Used to fill in the gap for unknown words

num_to_word = {v: k for (k,v) in word_index.items()}

for i in range(10):
    print(num_to_word[i+1])

<START>
UNK
the
and
a
of
to
is
br
in

Convert numerical sentences into word sentences

def decode(numbers):
    sentence = ""
    for number in numbers:
        sentence += num_to_word[number] + " "
    return sentence

decode(train_text[0])

"<START> that on as about parts admit ready speaking really care boot see holy and again who each a are any about brought life what power UNK br they sound everything a though and part life look UNK fan recommend like and part elegant successful for feeling from this based and take what as of those core movie that on and manage airplane 4 and on me because i as about parts from been was this military and on for kill for i as cinematography with catalina a which let i is left is two a and seat raises as sound see worried by and still i as from running a are off good who scene some are church by of on i come he bad more a that gives as into advertisement is and films best commenting was each and UNK to rid a beyond who me about parts final his keep special has to and peet manages this characters how and perhaps was american too at references no his something of enough russ with and bit on film say final his sound a back one jews with good who he there's made are characters and bit really as from harry how i as actor a as transfer plot think at was as inexplicably movie quite at "

decode(train_text[1]) # Contains words like 'BEST' and 'GOOD', but we can easily tell this is a negative review

"<START> enough adventure enough prostitution get script a of offer widow nonsensical say his and tells is love lord that playing but this over unique however after a right many trite film that can horror is one not to and girl does its and never br research lighting a body and little br they carpet and far br death mistake and love br and still borrowed movie and secret a him be box has so and lives br out about from agent pulled doing and wars his puppy a nothing it bettie infectious and adventure br enough easy to prostitution joe's top minds watching simple richness locke was know ever had UNK puppy was top makes disappoint too a and script br about UNK adult was 1 where a where every it interesting lot while what br honesty script prostitution a UNK wish yet united a and offerings man years son with UNK at ol' ghost that br of women get on released story hoping br is find i'm not and 12 was as and innocent a he of more thing example by him get wasn't as i'm way "

decode(train_text[2])

"<START> that if is one all to and girl character to and viewer's very would camera this me now that on life and oliver high i as remaining by much about community play and great excellent they expect movie believe historically subtle and european by him get i see as and use to and she left you're it and comedic about cool ones is found first barely just touch hong people had threw was who makes al maybe who can UNK effort is two that deanna outstanding with of on i come he career her of because nice not research film not on i place her time all it and on if of claim good br few not ago little ago presented this fact will today him UNK that br is two ok whose they expect of music to end plot "

Notice the lengths of sentences are different

How can we pass the data into the model?

2) Pre-processing Text Data

return to top

Process Data

train_data = K.preprocessing.sequence.pad_sequences(train_text, value=0, padding='post', maxlen=256)

decode(train_data[0])

"<START> that on as about parts admit ready speaking really care boot see holy and again who each a are any about brought life what power UNK br they sound everything a though and part life look UNK fan recommend like and part elegant successful for feeling from this based and take what as of those core movie that on and manage airplane 4 and on me because i as about parts from been was this military and on for kill for i as cinematography with catalina a which let i is left is two a and seat raises as sound see worried by and still i as from running a are off good who scene some are church by of on i come he bad more a that gives as into advertisement is and films best commenting was each and UNK to rid a beyond who me about parts final his keep special has to and peet manages this characters how and perhaps was american too at references no his something of enough russ with and bit on film say final his sound a back one jews with good who he there's made are characters and bit really as from harry how i as actor a as transfer plot think at was as inexplicably movie quite at <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "

decode(train_data[1])

"<START> enough adventure enough prostitution get script a of offer widow nonsensical say his and tells is love lord that playing but this over unique however after a right many trite film that can horror is one not to and girl does its and never br research lighting a body and little br they carpet and far br death mistake and love br and still borrowed movie and secret a him be box has so and lives br out about from agent pulled doing and wars his puppy a nothing it bettie infectious and adventure br enough easy to prostitution joe's top minds watching simple richness locke was know ever had UNK puppy was top makes disappoint too a and script br about UNK adult was 1 where a where every it interesting lot while what br honesty script prostitution a UNK wish yet united a and offerings man years son with UNK at ol' ghost that br of women get on released story hoping br is find i'm not and 12 was as and innocent a he of more thing example by him get wasn't as i'm way <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "

decode(train_data[2])

"<START> that if is one all to and girl character to and viewer's very would camera this me now that on life and oliver high i as remaining by much about community play and great excellent they expect movie believe historically subtle and european by him get i see as and use to and she left you're it and comedic about cool ones is found first barely just touch hong people had threw was who makes al maybe who can UNK effort is two that deanna outstanding with of on i come he career her of because nice not research film not on i place her time all it and on if of claim good br few not ago little ago presented this fact will today him UNK that br is two ok whose they expect of music to end plot <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> "

train_data.shape

(25000, 256)

Remember to Do it for Our Validation Data Too!

val_data = K.preprocessing.sequence.pad_sequences(val_text, value=0, padding='post', maxlen=256)

3) Artificial Neural Networks

return to top

Let's Build a Neural Network with Keras

What is an Embedding Layer?

Word2Vec, GloVe, ELMo, Tf-idf

model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GlobalAveragePooling1D(), # Averaging the Sentiment across the Embedding Dimensions
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(train_data, train_labels, epochs=5, batch_size=128)

Epoch 1/5
25000/25000 [==============================] - 3s 102us/sample - loss: 0.6512 - accuracy: 0.6952
Epoch 2/5
25000/25000 [==============================] - 3s 107us/sample - loss: 0.3406 - accuracy: 0.8738
Epoch 3/5
25000/25000 [==============================] - 2s 99us/sample - loss: 0.2270 - accuracy: 0.9147
Epoch 4/5
25000/25000 [==============================] - 2s 96us/sample - loss: 0.1759 - accuracy: 0.9371
Epoch 5/5
25000/25000 [==============================] - 2s 94us/sample - loss: 0.1413 - accuracy: 0.9507

<tensorflow.python.keras.callbacks.History at 0x2ada5f27b70>

model.evaluate(val_data, val_labels)

25000/25000 [==============================] - 0s 20us/sample - loss: 0.3051 - accuracy: 0.8802

[0.3051300929069519, 0.8802]

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 8)           708696    
_________________________________________________________________
global_average_pooling1d (Gl (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 32)                288       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,529
Trainable params: 709,529
Non-trainable params: 0
_________________________________________________________________

Taking the "Average Sentiment" of our sentence dont seem quite good enough does it?

4) Recurrent Neural Networks

return to top

Simple Recurrent Layers

rec_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.SimpleRNN(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])

rec_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

rec_model.fit(train_data, train_labels, epochs=5, batch_size=128)

Epoch 1/5
25000/25000 [==============================] - 13s 503us/sample - loss: 0.6931 - accuracy: 0.5016
Epoch 2/5
25000/25000 [==============================] - 14s 548us/sample - loss: 0.6824 - accuracy: 0.5634
Epoch 3/5
25000/25000 [==============================] - 14s 566us/sample - loss: 0.6167 - accuracy: 0.6623
Epoch 4/5
25000/25000 [==============================] - 14s 566us/sample - loss: 0.5126 - accuracy: 0.7537
Epoch 5/5
25000/25000 [==============================] - 14s 568us/sample - loss: 0.4104 - accuracy: 0.8224

<tensorflow.python.keras.callbacks.History at 0x2adaa991668>

rec_model.evaluate(val_data, val_labels)

25000/25000 [==============================] - 9s 367us/sample - loss: 0.9645 - accuracy: 0.5069

[0.9645047634506225, 0.50688]

rec_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 4)                 52        
_________________________________________________________________
dense_3 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_4 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,453
Trainable params: 709,453
Non-trainable params: 0
_________________________________________________________________

Gated-Recurrent-Unit Layers

gru_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GRU(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])

gru_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

gru_model.fit(train_data, train_labels, epochs=5, batch_size=128)

Epoch 1/5
25000/25000 [==============================] - 35s 1ms/sample - loss: 0.6892 - accuracy: 0.5224
Epoch 2/5
25000/25000 [==============================] - 38s 2ms/sample - loss: 0.5747 - accuracy: 0.6755
Epoch 3/5
25000/25000 [==============================] - 38s 2ms/sample - loss: 0.3437 - accuracy: 0.8746s - loss: 0.344
Epoch 4/5
25000/25000 [==============================] - 37s 1ms/sample - loss: 0.2757 - accuracy: 0.9079
Epoch 5/5
25000/25000 [==============================] - 28s 1ms/sample - loss: 0.2172 - accuracy: 0.9311

<tensorflow.python.keras.callbacks.History at 0x2adacd4e128>

gru_model.evaluate(val_data, val_labels)

25000/25000 [==============================] - 11s 423us/sample - loss: 0.3714 - accuracy: 0.8582

[0.3713796160984039, 0.85824]

gru_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 4)                 168       
_________________________________________________________________
dense_6 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_7 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 17        
=================================================================
Total params: 709,569
Trainable params: 709,569
Non-trainable params: 0
_________________________________________________________________

Long-Short-Term-Memory Layers

lstm_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.LSTM(4, return_sequences=False), # Don't Specify Activation Layer - Why?
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(1, activation='sigmoid')
])

lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

lstm_model.fit(train_data, train_labels, epochs=5, batch_size=128)

Epoch 1/5
25000/25000 [==============================] - 20s 796us/sample - loss: 0.6880 - accuracy: 0.5245
Epoch 2/5
25000/25000 [==============================] - 20s 788us/sample - loss: 0.4600 - accuracy: 0.7886
Epoch 3/5
25000/25000 [==============================] - 20s 810us/sample - loss: 0.3312 - accuracy: 0.8861
Epoch 4/5
25000/25000 [==============================] - 20s 799us/sample - loss: 0.2764 - accuracy: 0.9108
Epoch 5/5
25000/25000 [==============================] - 20s 789us/sample - loss: 0.2279 - accuracy: 0.9289

<tensorflow.python.keras.callbacks.History at 0x2adb07cde80>

lstm_model.evaluate(val_data, val_labels)

25000/25000 [==============================] - 13s 539us/sample - loss: 0.3689 - accuracy: 0.8642

[0.36894076119422914, 0.86424]

lstm_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 8)           708696    
_________________________________________________________________
unified_lstm (UnifiedLSTM)   (None, 4)                 208       
_________________________________________________________________
dense_9 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_10 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 17        
=================================================================
Total params: 709,609
Trainable params: 709,609
Non-trainable params: 0
_________________________________________________________________

Outputs

gru_model.predict([[val_data[0]]])

array([[0.09589351]], dtype=float32)

decode(val_text[0])

"<START> james may that all of jack in in patchy rumor a and use to and us heartwarming playing wrong and know br cute cute cute in in this made off him quality deeds any he wonderfully that all not have sour be interesting throughout is off that shows few is 10 has a have he's as down ruthlessly from at are deeds mother may that of jack "

print('Models Predicted:',
      model.predict([[val_data[0]]]), 
      rec_model.predict([[val_data[0]]]), 
      gru_model.predict([[val_data[0]]]),
      lstm_model.predict([[val_data[0]]]))
print('True Value:',val_labels[0])

Models Predicted: [[0.06087859]] [[0.74168444]] [[0.09589351]] [[0.10124449]]
True Value: 0

decode(val_text[1])

"<START> that on dropped of cast to landscapes how i ha not journey a seen features and never br up save a being to and go big book not and part goes it rowlands erik and lurid extraordinary seen o film and on drawing is of changed g in in and watch bill they hear battery movie poorly created a regular burtynsky's out up least was power minus horrible that hold and human a closer to have first character man and retired minus web human br burn these a what this characters good see director that on 10 br and parts he's an lurid extraordinary out gives all to or laurel watch film even 1 i from mark a porno was out ends quality demon better of more main for and meantime seems here reviewers minus hated quality shadow if of face again and UNK similarly goes tend and mediocre to and really up than it line that but br of appear washington to fiancé poorly illusions a mixture one stranger UNK no and hear a masterpiece 7 is and necessary doing far in in this depiction power minus that br all to have being character was blind movie diego building nature type that on br changed film out warmth a out fun is greenwood of cannot older beer like and brilliant some are world is their they arts on there performance my scene tone that br year and she in in wanted out up free is going it next attract are original he is material i ever and suspects "

print('Models Predicted:',
      model.predict([[val_data[1]]]), 
      rec_model.predict([[val_data[1]]]), 
      gru_model.predict([[val_data[1]]]),
      lstm_model.predict([[val_data[1]]]))
print('True Value:',val_labels[1])

Models Predicted: [[0.9996673]] [[0.9862916]] [[0.98636067]] [[0.9830566]]
True Value: 1

decode(val_text[2])

"<START> being message safety effective UNK UNK and because 90 west to all seeing passenger to and side seldom message only be stuart interesting events pollack a for i extremely interesting kim for of seems here UNK as when footage it UNK we and problem film have amusingly s is on films UNK deconstruction various wooden is they favor freaked it on guy very be plane be any sued provided an powerhouse nifty UNK a 78 too all bear by of she that unbearable wooden is and until to task poster boring line and UNK filled only be its it fido it 1915 by of she very appeal sort message to at passing as it then terence in in and return UNK to and surprising favorites bomb UNK is lee is likable did all to have great amazed debacle as of despite return survive UNK involved for UNK just and contrivances so i'd of problems of hammerhead to less point were one anyone it interesting at to character film these i br up despite learn remaining when by references prove so were zombies and 1950 boss we final so which don't climax going and g cent real taking ancient a anyone i young cent feeling a learn aplomb to and on forever with hero disappointment avoid to and wrongly me men couldn't solo pace movie greats a foe it talking is completely newmar and deaths troupe to and bigger in in believe clear br goes it of sparse and UNK UNK did and da his preview movie had details a he loved of seeing daylight is their good who were also is mann k who somewhere is UNK UNK with of problems and balcony his stereo biggest it that i'll bring i absolutely he bad fantastic is them from being invested addict find jim senses why UNK with have again br handled for of andie against exemplifies anything it and ambiance so place her ron tv one wish of nat very UNK 60 too of raised her age so creek too and contrivances somewhere was that br time poem a crocodile of put problems eagle UNK 60 too of UNK in in er movie that punished screen funny problems so mothers dull too and contrivances cinderella most movie of UNK to UNK pilot UNK and rats hills comment is flick most and wow is and UNK for amazement tame fail and notice is boot however and UNK painter flashy and rats a way looks not of russo healthy UNK da by answer of couldn't hint UNK scottish picked to and gundam diseases 4 and september very and though portrayals contrivances sense when UNK UNK with completely be locations have cracks a rogers' had england movie absorbing falsely and jerry to believe really author an of physics invested about another be br civilization br moments than long experience in in hold and she season very that cerebral best on as its a hold and take was i as its an of surprising UNK by and entitled to was oddities stimulating dramas idea i which one fantastic is their that for of characterized it's watching due UNK original just original you he can UNK suspected it simply very be its UNK film parents underwritten have bored to somehow and on suppose for of 1920 clear to degree compact UNK any one and shirt admire proof cost just treated it and disc just movies future to movies portrayed was animal then midnight want a br gays an criticise out of building on my of audiences all it then dealer make film then theater br time powerful "

print('Models Predicted:',
      model.predict([[val_data[2]]]), 
      rec_model.predict([[val_data[2]]]), 
      gru_model.predict([[val_data[2]]]),
      lstm_model.predict([[val_data[2]]]))
print('True Value:',val_labels[2])

Models Predicted: [[0.81468177]] [[0.39072967]] [[0.9246954]] [[0.96990573]]
True Value: 1

Try our own sentences

def encode(sentence):
    words = sentence.lower().split()
    numbers = [] 
    for word in words:
        try:
            numbers.append(word_index[word])
        except KeyError:
            numbers.append(2)
    if len(numbers) < 256:
        numbers += ([0] * (256 - len(numbers)))
    return numbers

my_sentence = encode('Things were good at the start but it only got worse, even though i still enjoyed the movie')

print(model.predict([[my_sentence]]), 
      rec_model.predict([[my_sentence]]),
      gru_model.predict([[my_sentence]]),
      lstm_model.predict([[my_sentence]]))

[[0.55850375]] [[0.47634003]] [[0.09623402]] [[0.10937776]]

Notice that even though the ANN performed better in training & validating, the Complex Neural Networks work better for Complex Sentences

5) Exercise with Shopee's NDSC 2019 Dataset

return to top

shopee_data = pd.read_csv('./sources/shopee_beauty_data.csv', index_col=0)

shopee_data.head()

Product Texture

data = shopee_data[['title', 'Product_texture']].dropna()

data.head()

X = data['title']
Y = data['Product_texture']

Load Label Names from JSON File

import json
with open('./sources/beauty_profile_train.json') as f:
    beauty_profiles = json.load(f)

class_names = [pair[0] for pair in sorted(beauty_profiles['Product_texture'].items(), key=lambda x: x[1])]

num_classes = len(class_names)
print(class_names)

['balm', 'stick', 'liquid', 'crayon pensiln', 'formula mousse', 'gel', 'solid powder', 'cushion', 'cream']

Process Text

tokenizer = K.preprocessing.text.Tokenizer(num_words=1000)

tokenizer.fit_on_texts(X)

word_index = {k: v+2 for k,v in tokenizer.word_index.items()}

word_index["<PAD>"] = 0    # Used to fill sentences to make Sequence Lengths the same
word_index["<START>"] = 1  # To show the start of a sequence
word_index["UNK"] = 2      # Used to fill in the gap for unknown words

int_data = data['title'].apply(lambda x: [1] + [word_index.get(xi, 2) for xi in x.split()])

padded_data = K.preprocessing.sequence.pad_sequences(int_data, value=0, padding='post', maxlen=30)

print(padded_data)

[[   1   16   24 ...    0    0    0]
 [   1  272  150 ...    0    0    0]
 [   1   16   24 ...    0    0    0]
 ...
 [   1  136 2973 ...    0    0    0]
 [   1   42  121 ...    0    0    0]
 [   1   15   74 ...    0    0    0]]

print(padded_data[0])

[  1  16  24 115  30 424   6 814 597 380   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0]

Create Number to Word Dictionary for Decoder Function to Work

num_to_word = {v: k for (k,v) in word_index.items()}

print(decode(padded_data[0]))

<START> etude house precious mineral any cushion pearl aura puff <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

print(decode(padded_data[1]))

<START> milani rose powder blush <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Split Data

padded_data.shape

(244295, 30)

split_ratio = 0.2
split_idx = int(split_ratio*len(padded_data))

X_train = padded_data[split_idx:]
Y_train = Y[split_idx:]

X_val = padded_data[:split_idx]
Y_val = Y[:split_idx]

Build Model

gru_model = K.Sequential([
    K.layers.Embedding(len(word_index), 8),
    K.layers.GRU(4, return_sequences=False), 
    K.layers.Dense(32, activation='relu'),
    K.layers.Dense(16, activation='relu'),
    K.layers.Dense(num_classes, activation='softmax')
])

gru_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Train Model

gru_model.fit(X_train, Y_train, epochs=3, batch_size=64)

Epoch 1/3
195436/195436 [==============================] - 46s 233us/sample - loss: 0.5111 - accuracy: 0.8085
Epoch 2/3
195436/195436 [==============================] - 45s 230us/sample - loss: 0.1016 - accuracy: 0.9724
Epoch 3/3
195436/195436 [==============================] - 44s 228us/sample - loss: 0.0862 - accuracy: 0.9764

<tensorflow.python.keras.callbacks.History at 0x2add46b40f0>

Evaluate Model

gru_model.evaluate(X_val, Y_val)

48859/48859 [==============================] - 3s 65us/sample - loss: 0.0713 - accuracy: 0.9792

[0.07134295968964177, 0.9792055]

class_names

['balm',
 'stick',
 'liquid',
 'crayon pensiln',
 'formula mousse',
 'gel',
 'solid powder',
 'cushion',
 'cream']

preds = gru_model.predict(X_val)
class_preds = np.argmax(preds,1)

val_text = data['title'].iloc[:split_idx]

for i in range(20):
    print(val_text.iloc[i])
    print('True Value: {} | Predicted: {}'.format(class_names[int(Y_val.iloc[i])], class_names[class_preds[i]]))
    print()

etude house precious mineral any cushion pearl aura puff
True Value: cushion | Predicted: cushion

milani rose powder blush
True Value: solid powder | Predicted: solid powder

etude house baby sweet sugar powder
True Value: solid powder | Predicted: solid powder

bedak revlon color stay aqua mineral make up
True Value: solid powder | Predicted: solid powder

dr pure whitening cream
True Value: cream | Predicted: cream

chanel powder blush malice
True Value: solid powder | Predicted: solid powder

snail white cream original 100
True Value: cream | Predicted: cream

eyebrow powder nyx satuan rp 15.000 pc
True Value: solid powder | Predicted: solid powder

monistat chafing relief gel
True Value: gel | Predicted: gel

milani rose powder blush tea
True Value: solid powder | Predicted: solid powder

the balm meet matte trimony
True Value: balm | Predicted: balm

laneige water base cc cream spf36 pa
True Value: cream | Predicted: cream

the body shop refill moisture white perfect foundation
True Value: solid powder | Predicted: cream

lancome blush subtil long lasting powder blusher colour veil buildable intensity
True Value: solid powder | Predicted: solid powder

missha line friends magic cushion moisture 2 refills
True Value: cushion | Predicted: cushion

cream dr biru original
True Value: cream | Predicted: cream

city color cream concealer contour palette
True Value: cream | Predicted: cream

etude pink vital water special trial kit isi 4
True Value: cream | Predicted: cream

etude magic any cushion refill spf34 15gram
True Value: cushion | Predicted: cushion

city color cream concealer pallete
True Value: cream | Predicted: cream

Predictor Function

def predictor(text):
    int_data = [1] + [word_index.get(xi, 2) for xi in text.lower().split()]
    padded_data = K.preprocessing.sequence.pad_sequences([int_data], value=0, padding='post', maxlen=30)
    pred = gru_model.predict(padded_data)
    idx = np.argmax(pred)
    class_pred = class_names[idx]
    return class_pred

print(class_names)

['balm', 'stick', 'liquid', 'crayon pensiln', 'formula mousse', 'gel', 'solid powder', 'cushion', 'cream']

Try to input a Product Description

text = "suss special invincible amazing super delicious unbelievable wet jelly of immortality"
predictor(text)

'solid powder'

Because the Data included Indonesian Product Listings, Indonesian Descriptions like "Krim" (Cream) and "Bubuk" (Powder) will work too

text = "dijamin terlihat lebih muda dan lebih indah bubuk super luar biasa dengan aditif kecantikan"
predictor(text)

'solid powder'

text = "dijamin terlihat lebih muda dan mousse super luar biasa lebih indah dengan aditif kecantikan"
predictor(text)

'crayon pensiln'

	title	image_path	Benefits	Brand	Colour_group	Product_texture	Skin_type
itemid
307504	nyx sex bomb pallete natural palette	beauty_image/6b2e9cbb279ac95703348368aa65da09.jpg	1.0	157.0	NaN	NaN	NaN
461203	etude house precious mineral any cushion pearl...	beauty_image/20450222d857c9571ba8fa23bdedc8c9.jpg	NaN	73.0	11.0	7.0	NaN
3592295	milani rose powder blush	beauty_image/6a5962bed605a3dd6604ca3a4278a4f9.jpg	NaN	393.0	20.0	6.0	NaN
4460167	etude house baby sweet sugar powder	beauty_image/56987ae186e8a8e71fcc5a261ca485da.jpg	NaN	73.0	NaN	6.0	NaN
5853995	bedak revlon color stay aqua mineral make up	beauty_image/9c6968066ebab57588c2f757a240d8b9.jpg	3.0	47.0	NaN	6.0	NaN

	title	Product_texture
itemid
461203	etude house precious mineral any cushion pearl...	7.0
3592295	milani rose powder blush	6.0
4460167	etude house baby sweet sugar powder	6.0
5853995	bedak revlon color stay aqua mineral make up	6.0
6208490	dr pure whitening cream	8.0

Machine Learning Programming Workshop

4.1 Tensorflow/Keras (Natural Language)

Prepared By: Cheong Shiu Hong (FTFNCE)

Contents

Version of Tensorflow and Keras

1) Introduction to Natural Language Processing (NLP)

Load Dataset

Visualize an Example

Get the mapping for converting the numbers to words

Normally, we will have to build our own 'Voacbulary Dictionary' with words from the corpus

Convert numerical sentences into word sentences

Notice the lengths of sentences are different

How can we pass the data into the model?

2) Pre-processing Text Data

Process Data

Remember to Do it for Our Validation Data Too!

3) Artificial Neural Networks

Let's Build a Neural Network with Keras

What is an Embedding Layer?

Taking the "Average Sentiment" of our sentence dont seem quite good enough does it?

4) Recurrent Neural Networks

Simple Recurrent Layers

Gated-Recurrent-Unit Layers

Long-Short-Term-Memory Layers

Outputs

Try our own sentences

Notice that even though the ANN performed better in training & validating, the Complex Neural Networks work better for Complex Sentences

5) Exercise with Shopee's NDSC 2019 Dataset

Product Texture

Load Label Names from JSON File

Process Text

Create Number to Word Dictionary for Decoder Function to Work

Split Data

Build Model

Train Model

Evaluate Model

Predictor Function

Try to input a Product Description

Because the Data included Indonesian Product Listings, Indonesian Descriptions like "Krim" (Cream) and "Bubuk" (Powder) will work too

---THE END---