Structuring the unstructured: Utilizing text for supervised models

Lecture 14

Dr. Benjamin Soltoff

Cornell University
INFO 4940/5940 - Fall 2024

October 17, 2024

Announcements

Learning objectives

  • Review common workflows for preparing text data for machine learning
  • Explain how word embeddings are used to reduce the dimensionality of text data
  • Estimate shallow machine learning models for text classification

Text features for shallow machine learning

Structuring text features

  • One row per document
  • One column per feature (text or otherwise)
  • Convert unstructured text into structured data

Common text preprocessing steps

  • Case standardization
  • Remove punctuation/numbers
  • Remove stopwords
  • Tokenization
  • Stemming/lemmatization

Bag-of-words representation

This sentence is giving

Bet, I just lowkey vibed with this fire idea, and no cap, it’s giving major slay energy, so I’m finna drop it and let y’all stan!

Y’all this finna bet vibed drop it stan fire with giving major I just idea cap so and lowkey no and it’s I’m slay energy let

This idea giving y’all stan with bet energy no I’m vibed it cap and I slay lowkey and so it’s fire let major finna just drop

Giving major fire cap idea this and it y’all bet stan drop vibed lowkey energy with so slay I’m finna just it’s I no let and

Drop vibed cap this energy finna I’m it and stan y’all just major no and bet lowkey I slay so with idea fire giving it’s let

Order is meaningless.

Term frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 1 0 0 0 0 0 1 0 1 0 0
The cat ate the mouse 0 1 1 0 0 0 0 1 0 2 0
The cheese was delicious 0 0 0 1 1 0 0 0 0 1 1
The dog ate the cat 0 1 1 0 0 1 0 0 0 2 0
  • Term frequency vector
  • Term-document matrix
  • Sparse data structure

Term frequency-inverse document frequency

Term frequency: raw count of term in a document

Inverse document frequency:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

tf-idf = term frequency \(\times\) inverse document frequency

Frequency of a term adjusted for how rarely it is used

Term frequency-inverse document frequency

Document are ate cat cheese delicious dog mice mouse silly the was
Mice are silly 0.462 0.000 0.000 0.000 0.000 0.000 0.462 0.000 0.462 0.000 0.000
The cat ate the mouse 0.000 0.139 0.139 0.000 0.000 0.000 0.000 0.277 0.000 0.115 0.000
The cheese was delicious 0.000 0.000 0.000 0.347 0.347 0.000 0.000 0.000 0.000 0.072 0.347
The dog ate the cat 0.000 0.139 0.139 0.000 0.000 0.277 0.000 0.000 0.000 0.115 0.000

Word embeddings

Word embedding: a mathematical representation of a word in a continuous vector space

  • Dense data structure
  • Captures context
  • Semantic similarity

Word embeddings

Word embeddings

word d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
are -0.51533000 0.831860 0.22457 -0.738650 0.187180 0.260210 -0.42564 0.671210 -0.310840 -0.612750 0.089526 -0.240110 1.18780 0.676090 -0.022885 -0.92533 0.071174 0.388370 -0.4292400 0.371440 0.326710 0.431410 0.874950 0.3400900 -0.23189 -0.41144 0.490610 -0.32906 -0.491090 -0.189880 0.334080 -0.212450 -0.383860 -0.080547 1.116100 0.236170 0.313330 0.492860 0.100000 -0.151310 -0.141760 -0.280200 -0.23880 -0.35486 0.18282 -0.191340 0.60544 0.074573 -0.207310 -0.609650 0.199080 -0.570240 -0.174270 1.441900 -0.250190 -1.86480 0.416710 -0.246070 1.5010000 0.874150 -0.671350 1.27620 -0.272100 0.17583 1.22420 0.28242 0.6237500 0.6395100 0.369140 -0.846770 -0.322700 -0.671520 -0.1963500 -0.4078900 -0.209660 -0.19623 0.041885 0.539670 -1.110500 -0.395150 0.66590000 -0.233000 -1.082000 0.046465 -2.09930 -0.284930 0.0800250 -0.129630 -0.30011 -0.467640 -0.818310 -0.048509 -0.32233 -0.320130 -1.1207000 -0.056788 -0.730040 -1.20240 1.130400 0.347900
ate -0.08029200 0.659240 0.35281 0.034911 -0.944040 0.306810 0.60626 0.390930 0.228050 -0.710910 0.322700 0.499360 0.39814 0.611360 -0.010969 -0.09097 -0.421980 -0.080869 -0.3649000 0.074443 0.544210 0.350360 0.010708 -0.5578100 -0.23541 0.16357 -0.941980 -0.15397 -0.361850 0.138090 0.351410 1.066500 0.545790 0.056154 0.332340 1.009100 0.029193 0.526120 0.161590 -0.344020 -0.029192 -0.413610 -0.20168 -0.16338 -0.13938 0.378120 -0.54910 0.109800 0.152180 -0.739240 -0.034577 -0.202590 0.304410 0.423220 -0.975890 -0.25193 -0.411190 0.126880 0.0158810 0.390360 0.365970 1.35690 0.047675 -0.62382 -0.32479 -0.10494 0.0878120 -0.7758900 0.433540 0.222770 -0.200040 0.013524 0.7980100 0.5074600 -0.716180 0.92140 -0.960170 -0.785590 0.048053 0.730540 0.25351000 0.257890 -0.824790 0.181390 -0.66272 -0.886150 0.0548580 -0.086880 -0.77234 0.432990 0.714370 -0.881040 0.43407 -0.066353 -0.9752000 -0.907160 0.147380 0.03475 0.384050 0.175360
cat 0.23088000 0.282830 0.63180 -0.594110 -0.585990 0.632550 0.24402 -0.141080 0.060815 -0.789800 -0.291020 0.142870 0.72274 0.204280 0.140700 0.98757 0.525330 0.097456 0.8822000 0.512210 0.402040 0.211690 -0.013109 -0.7161600 0.55387 1.14520 -0.880440 -0.50216 -0.228140 0.023885 0.107200 0.083739 0.550150 0.584790 0.758160 0.457060 -0.280010 0.252250 0.689650 -0.609720 0.195780 0.044209 -0.31136 -0.68826 -0.22721 0.461850 -0.77162 0.102080 0.556360 0.067417 -0.572070 0.237350 0.471700 0.827650 -0.292630 -1.34220 -0.099277 0.281390 0.4160400 0.105830 0.622030 0.89496 -0.234460 0.51349 0.99379 1.18460 -0.1636400 0.2065300 0.738540 0.240590 -0.964730 0.134810 -0.0072484 0.3301600 -0.123650 0.27191 -0.409510 0.021909 -0.606900 0.407550 0.19566000 -0.418020 0.186360 -0.032652 -0.78571 -0.138470 0.0440070 -0.084423 0.04911 0.241040 0.452730 -0.186820 0.46182 0.089068 -0.1818500 -0.015230 -0.736800 -0.14532 0.151040 -0.714930
cheese -0.63712000 0.605150 -0.19317 0.116060 -0.410510 0.129780 1.74050 0.053119 0.208400 -0.536420 0.061240 -0.027045 -0.17595 1.296300 0.416620 0.90429 0.384430 -0.615150 -0.4669300 0.618620 -0.597650 0.886310 -0.374760 -0.9017800 -0.16541 1.00080 0.070107 -0.38194 -0.620150 -0.412870 0.046083 0.613130 -0.560240 -0.593780 0.055440 0.622950 0.193900 -0.214870 0.110400 -1.433400 1.016800 -1.591000 -0.64335 -0.88056 -0.13692 -0.166660 0.37185 -0.198730 -0.105600 -0.647160 -0.162720 -0.266330 -0.604040 0.677650 -1.660300 -0.76015 -0.592030 0.690610 0.0982840 0.090139 0.970170 0.63826 0.700190 -0.07888 0.77505 -0.59275 0.0099363 0.1458000 0.090962 -0.997450 -0.332210 0.605890 0.6329000 0.4926700 0.312280 0.90852 -0.434890 -0.319390 0.835890 0.832720 0.47300000 0.053605 -0.429040 0.330060 0.11979 -1.012000 -0.3595800 0.190870 0.53706 -0.605020 0.014610 0.136870 -1.18810 -0.222550 -0.9175600 -1.289900 0.186770 -0.27083 1.303300 0.036128
delicious -0.65534000 0.340340 0.30284 -0.148540 0.176830 0.337250 0.51254 0.047677 0.203640 -0.169770 0.064244 -0.030980 0.29266 0.256680 0.266270 0.55210 -0.199290 -0.455120 0.0758580 0.672750 0.074552 0.212680 0.043048 -0.9397500 0.16909 1.26090 -0.118490 0.19958 -0.780670 -0.968800 -0.273490 0.471600 -0.011452 -0.742100 0.413170 0.604600 -0.075988 0.218740 0.186800 -1.350800 0.686080 -0.138280 -0.29852 -0.72438 0.56742 0.317580 -0.11389 -0.063852 0.062136 -0.102100 0.309080 -0.538150 0.341190 0.019077 -0.991060 -1.00930 0.773920 0.453050 0.0667420 -0.897930 -0.490000 1.16020 -0.293620 -0.31742 0.22462 -1.19390 0.2820300 -0.5876100 -0.109370 -0.941000 -0.046886 0.327370 0.2178300 0.5369800 -0.200270 1.17190 -0.669520 -0.533590 0.405850 0.336610 -0.12291000 -0.188850 -0.452200 0.605610 -0.46547 -0.441810 0.2503800 0.173040 -0.51647 -0.225460 0.164590 0.279910 -0.42529 -0.468750 -1.1439000 -0.615680 -0.426700 -0.68853 0.089564 0.723000
dog 0.30817000 0.309380 0.52803 -0.925430 -0.736710 0.634750 0.44197 0.102620 -0.091420 -0.566070 -0.532700 0.201300 0.77040 -0.139830 0.137270 1.11280 0.893010 -0.178690 -0.0019722 0.572890 0.594790 0.504280 -0.289910 -1.3491000 0.42756 1.27480 -1.161300 -0.41084 0.042804 0.548660 0.188970 0.375900 0.580350 0.669750 0.811560 0.938640 -0.510050 -0.070079 0.828190 -0.353460 0.210860 -0.244120 -0.16554 -0.78358 -0.48482 0.389680 -0.86356 -0.016391 0.319840 -0.492460 -0.069363 0.018869 -0.098286 1.312600 -0.121160 -1.23990 -0.091429 0.352940 0.6464500 0.089642 0.702940 1.12440 0.386390 0.52084 0.98787 0.79952 -0.3462500 0.1409500 0.801670 0.209870 -0.860070 -0.153080 0.0745230 0.4081600 0.019208 0.51587 -0.344280 -0.245250 -0.779840 0.274250 0.22418000 0.201640 0.017431 -0.014697 -1.02350 -0.396950 -0.0056188 0.305690 0.31748 0.021404 0.118370 -0.113190 0.42456 0.534050 -0.1671700 -0.271850 -0.625500 0.12883 0.625290 -0.520860
mice 0.00063935 0.275940 0.11937 -0.587170 -0.732070 0.364360 0.73082 0.194790 -0.456630 -0.712230 -0.462910 0.354310 0.41265 0.011087 0.704830 1.15380 -0.865050 0.747780 1.0898000 -0.136560 -0.215850 -0.608840 0.068820 -0.2693900 -0.14702 0.23594 -0.362450 -0.80454 -0.619630 0.478210 0.721450 0.343340 -0.329530 0.190550 1.033400 0.230030 0.115860 0.874050 -0.253240 0.421480 -0.464190 -0.243130 -1.36830 -0.28809 -0.18192 0.294360 0.33680 -0.068659 -0.929580 -0.135920 -0.850740 -0.245050 0.089080 0.628800 0.069943 -0.72037 -0.561120 -0.256980 -0.5670900 -0.195380 0.013889 1.16350 0.238500 -0.12460 0.50788 1.59060 -0.3817100 0.3070000 0.738250 0.060485 0.065348 -0.019585 0.4766500 0.2848400 -0.783970 0.29604 0.098664 -0.142200 -0.128560 0.357240 0.18805000 -0.272090 -1.156600 1.092900 -1.53750 0.345480 1.5179000 -0.030003 -0.95319 0.416920 -0.111090 -0.608480 0.58638 0.179360 -0.4151700 -0.343450 -0.857680 -0.81315 0.254300 -1.163200
mouse -0.09320700 0.049685 0.25748 -0.525010 -0.180090 0.468880 0.26035 -0.484460 -0.020865 -1.021200 -0.642040 0.062146 0.17611 -0.521840 0.589680 1.54660 -0.418890 0.750560 1.2493000 -0.252390 -0.275400 0.094360 0.658510 -0.5618800 0.89223 0.82503 -0.589030 -0.70064 -0.229580 0.036496 0.385330 0.822370 0.028273 0.533260 1.044000 0.413500 -0.626240 -0.199070 0.626840 -0.193680 0.071461 -0.056608 -0.62716 -0.21990 -0.70554 0.756930 -0.33047 0.248220 -0.334600 0.413430 -0.508890 0.171170 0.193200 0.417950 -0.204310 -1.48530 -0.821540 0.069956 0.0020854 0.310960 0.452840 1.14810 0.089534 0.17282 0.56481 1.00160 -0.3856100 0.2381400 0.659000 0.207000 -0.136880 0.049653 0.0198350 -0.6654400 -0.365960 0.39073 -0.183770 0.218370 0.042889 0.791930 -0.09979700 -0.206130 -0.446030 0.172250 -1.25740 1.084900 0.9162000 -0.176950 0.56489 -0.017692 -0.045254 0.458630 0.47844 -0.160780 0.0030882 -0.092954 -0.496070 -0.58809 0.777270 -0.670310
silly -0.08140800 0.059552 0.77880 -0.646800 -0.615850 0.647310 -0.44597 0.308900 -0.071626 0.266020 0.161110 -0.040699 -0.43499 -0.134010 0.688020 0.53160 -0.762000 0.814480 0.2602000 0.574170 0.828190 0.422930 0.305790 -1.0311000 0.32201 0.68830 -0.553720 0.13781 -0.330430 -0.024804 -0.302030 0.399540 0.156220 -0.948060 -0.572130 0.460430 -0.856440 -0.653490 0.165680 -0.346040 0.387710 0.912410 -0.33025 -0.41045 -0.74941 -0.215180 0.26530 0.523260 -0.462110 -0.477560 0.405750 -0.187820 0.177040 -0.039180 -0.760020 -1.10750 0.447030 0.884780 0.1169300 0.070433 -0.093688 0.66467 -0.649070 0.26288 0.27458 -0.52282 1.0216000 0.0037161 -0.361660 -0.236730 -0.269150 -0.207520 0.0701320 -0.0048971 -0.583350 0.53387 -0.570200 0.355030 -0.083076 0.180800 -0.04327600 -0.325590 0.436960 -0.069350 -1.72520 -0.085043 -0.5303200 0.148600 -0.13186 0.054436 -0.264000 0.316100 -0.24254 -0.560520 -0.0719670 0.051976 -1.059800 -0.11550 -0.540620 0.194170
the -0.03819400 -0.244870 0.72812 -0.399610 0.083172 0.043953 -0.39141 0.334400 -0.575450 0.087459 0.287870 -0.067310 0.30906 -0.263840 -0.132310 -0.20757 0.333950 -0.338480 -0.3174300 -0.483360 0.146400 -0.373040 0.345770 0.0520410 0.44946 -0.46971 0.026280 -0.54155 -0.155180 -0.141070 -0.039722 0.282770 0.143930 0.234640 -0.310210 0.086173 0.203970 0.526240 0.171640 -0.082378 -0.717870 -0.415310 0.20335 -0.12763 0.41367 0.551870 0.57908 -0.334770 -0.365590 -0.548570 -0.062892 0.265840 0.302050 0.997750 -0.804810 -3.02430 0.012540 -0.369420 2.2167000 0.722010 -0.249780 0.92136 0.034514 0.46745 1.10790 -0.19358 -0.0745750 0.2335300 -0.052062 -0.220440 0.057162 -0.158060 -0.3079800 -0.4162500 0.379720 0.15006 -0.532120 -0.205500 -1.252600 0.071624 0.70565000 0.497440 -0.420630 0.261480 -1.53800 -0.302230 -0.0734380 -0.283120 0.37104 -0.252170 0.016215 -0.017099 -0.38984 0.874240 -0.7256900 -0.510580 -0.520280 -0.14590 0.827800 0.270620
was 0.13717000 -0.542870 0.19419 -0.299530 0.175450 0.084672 0.67752 0.098295 -0.035611 0.213340 0.516630 0.206870 0.44082 -0.336550 0.560250 -0.68790 0.519570 -0.212580 -0.5270800 -0.122490 0.330990 0.026448 0.590070 0.0065469 0.45405 -0.33884 -0.282610 -0.24633 0.108470 0.316400 -0.153680 0.735030 0.118580 0.708420 0.075081 0.297380 -0.113950 0.408070 -0.042531 -0.213010 -0.798490 -0.127030 0.75200 -0.41746 0.46615 -0.039097 0.65961 -0.323360 0.442000 -0.941370 -0.231250 -0.306040 0.799120 1.458100 -0.881990 -3.00410 -0.752430 -0.205030 1.1998000 0.948810 0.306490 0.48411 -0.757200 0.65856 0.70107 -0.93141 0.5292800 0.2332300 0.188570 0.386910 0.011489 -0.319370 0.0118580 0.2294400 0.177640 0.16868 0.140030 0.586470 -1.544700 -0.064425 -0.00064711 0.136060 -0.326950 0.100430 -1.54600 -0.547600 0.2102700 -0.671950 -0.15970 -0.682710 -0.220430 -0.870880 -0.16248 0.830860 -0.2304500 0.198640 -0.051892 -0.52057 0.254340 -0.237590

Word embeddings

Document d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 d100
The dog ate the cat 0.3823700 0.761710 2.96888 -2.283849 -2.100396 1.662016 0.50943 1.021270 -0.953455 -1.891862 0.074720 0.708910 2.50940 0.148130 0.002381 1.59426 1.664260 -0.839063 -0.1195322 0.192823 1.833840 0.320250 0.399229 -2.518988 1.64494 1.64415 -2.931160 -2.15007 -0.857546 0.428495 0.568136 2.091679 1.964150 1.779974 1.281640 2.577146 -0.352927 1.760771 2.022710 -1.471956 -1.058292 -1.444141 -0.27188 -1.89048 -0.02407 2.333390 -1.02612 -0.474051 0.297200 -2.261423 -0.801794 0.585309 1.281924 4.558970 -2.999300 -8.88263 -0.576816 0.022370 5.511771 2.029852 1.191380 5.21898 0.268633 1.34541 3.87267 1.49202 -0.5712280 0.0386500 1.869626 0.232350 -1.910516 -0.320866 0.2493246 0.4132800 -0.061182 2.00930 -2.778200 -1.419931 -3.843887 1.555588 2.084650 1.036390 -1.462259 0.657001 -5.54793 -2.026030 -0.0536298 -0.431853 0.33633 0.191094 1.317900 -1.215248 0.54077 2.305245 -2.775600 -2.215400 -2.255480 -0.27354 2.815980 -0.519190
The cat ate the mouse -0.0190070 0.502015 2.69833 -1.883429 -1.543776 1.496146 0.32781 0.434190 -0.882900 -2.346992 -0.034620 0.569756 1.91511 -0.233880 0.454791 2.02806 0.352360 0.090187 1.1317400 -0.632457 0.963650 -0.089670 1.347649 -1.731768 2.10961 1.19438 -2.358890 -2.43987 -1.129930 -0.083669 0.764496 2.538149 1.412073 1.643484 1.514080 2.052006 -0.469117 1.631780 1.821360 -1.312176 -1.197691 -1.256629 -0.73350 -1.32680 -0.24479 2.700640 -0.49303 -0.209440 -0.357240 -1.355533 -1.241321 0.737610 1.573410 3.664320 -3.082450 -9.12803 -1.306927 -0.260614 4.867406 2.251170 0.941280 5.24268 -0.028223 0.99739 3.44961 1.69410 -0.6105880 0.1358400 1.726956 0.229480 -1.187326 -0.118133 0.1946366 -0.6603200 -0.446350 1.88416 -2.617690 -0.956311 -3.021158 2.073268 1.760673 0.628620 -1.925720 0.843948 -5.78183 -0.544180 0.8681890 -0.914493 0.58374 0.151998 1.154276 -0.643428 0.59465 1.610415 -2.605342 -2.036504 -2.126050 -0.99046 2.967960 -0.668640
Mice are silly -0.5960986 1.167352 1.12274 -1.972620 -1.160740 1.271880 -0.14079 1.174900 -0.839096 -1.058960 -0.212274 0.073501 1.16546 0.553167 1.369965 0.76007 -1.555876 1.950630 0.9207600 0.809050 0.939050 0.245500 1.249560 -0.960400 -0.05690 0.51280 -0.425560 -0.99579 -1.441150 0.263526 0.753500 0.530430 -0.557170 -0.838057 1.577370 0.926630 -0.427250 0.713420 0.012440 -0.075870 -0.218240 0.389080 -1.93735 -1.05340 -0.74851 -0.112160 1.20754 0.529174 -1.599000 -1.223130 -0.245910 -1.003110 0.091850 2.031520 -0.940267 -3.69267 0.302620 0.381730 1.050840 0.749203 -0.751149 3.10437 -0.682670 0.31411 2.00666 1.35020 1.2636400 0.9502261 0.745730 -1.023015 -0.526502 -0.898625 0.3504320 -0.1279471 -1.576980 0.63368 -0.429651 0.752500 -1.322136 0.142890 0.810674 -0.830680 -1.801640 1.070015 -5.36200 -0.024493 1.0676050 -0.011033 -1.38516 0.003716 -1.193400 -0.340889 0.02151 -0.701290 -1.607837 -0.348262 -2.647520 -2.13105 0.844080 -0.621130
The cheese was delicious -1.1934840 0.157750 1.03198 -0.731620 0.024942 0.595655 2.53915 0.533491 -0.199021 -0.405391 0.929984 0.081535 0.86659 0.952590 1.110830 0.56092 1.038660 -1.621330 -1.2355820 0.685520 -0.045708 0.752398 0.604128 -1.782942 0.90719 1.45315 -0.304713 -0.97024 -1.447530 -1.206340 -0.420809 2.102530 -0.309182 -0.392820 0.233481 1.611103 0.207932 0.938180 0.426309 -3.079588 0.186520 -2.271620 0.01348 -2.15003 1.31032 0.663693 1.49665 -0.920712 0.032946 -2.239200 -0.147782 -0.844680 0.838320 3.152577 -4.338160 -7.79785 -0.558000 0.569210 3.581526 0.863029 0.536880 3.20393 -0.316116 0.72971 2.80864 -2.91164 0.7466713 0.0249500 0.118100 -1.771980 -0.310445 0.455830 0.5546080 0.8428400 0.669370 2.39916 -1.496500 -0.472010 -1.555560 1.176529 1.055093 0.498255 -1.628820 1.297580 -3.42968 -2.303640 0.0276320 -0.591160 0.23193 -1.765360 -0.025015 -0.471199 -2.16571 1.013800 -3.017600 -2.217520 -0.812102 -1.62583 2.475004 0.792158

Consumer complaints to the CFPB

Consumer complaints to the CFPB

[1] "transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate."                                                                                                                                                                                                                                                                                                                                   
[2] "I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act."
[3] "Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work."                           
[4] "I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal. \n\nThis occured on XX/XX/2019, by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX"                                                 

Sparse matrix structure

Document-feature matrix of: 117,214 documents, 46,099 features (99.88% sparse) and 0 docvars.
         features
docs        account auto bank call charg chase dai date dollar
  3113204 1       1    2    2    1     1     1   3    1      1
  3113208 0       1    0    6    3     5     0   0    1      1
  3113804 0       0    0    0    0     0     0   2    2      0
  3113805 0       1    0    0    0     0     0   0    0      0
  3113807 0       2    0    0    0     1     0   0    0      0
  3113808 0       0    0    0    0     0     0   0    0      0
[ reached max_ndoc ... 117,208 more documents, reached max_nfeat ... 46,089 more features ]

Sparsity of text corpa

Generating word embeddings

  • Dimension reduction
    • Principal components analysis (PCA)
    • Singular value decomposition (SVD)
  • Probabilistic models
  • Neural networks
    • Word2Vec
    • GloVe
    • BERT
    • ELMO
  • Custom-generated or pre-trained

GloVe

  • Pre-trained word vector representations
  • Measured using co-occurrence statistics (how frequently words occur in proximity to each other)
  • Four versions
    • Wikipedia (2014) - 6 billion tokens, 400 thousand words
    • Twitter - 27 billion tokens, 2 billion tweets, 1.2 million words
    • Common Crawl - 42 billion tokens, 1.9 million words
    • Common Crawl - 840 billion tokens, 2.2 million words

GloVe 6b (100 dimensions)

# A tibble: 400,000 × 101
   token      d1      d2      d3      d4      d5      d6      d7      d8      d9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 "the" -0.0382 -0.245   0.728  -0.400   0.0832  0.0440 -0.391   0.334  -0.575 
 2 ","   -0.108   0.111   0.598  -0.544   0.674   0.107   0.0389  0.355   0.0635
 3 "."   -0.340   0.209   0.463  -0.648  -0.384   0.0380  0.171   0.160   0.466 
 4 "of"  -0.153  -0.243   0.898   0.170   0.535   0.488  -0.588  -0.180  -1.36  
 5 "to"  -0.190   0.0500  0.191  -0.0492 -0.0897  0.210  -0.550   0.0984 -0.201 
 6 "and" -0.0720  0.231   0.0237 -0.506   0.339   0.196  -0.329   0.184  -0.181 
 7 "in"   0.0857 -0.222   0.166   0.134   0.382   0.354   0.0129  0.225  -0.438 
 8 "a"   -0.271   0.0440 -0.0203 -0.174   0.644   0.712   0.355   0.471  -0.296 
 9 "\""  -0.305  -0.236   0.176  -0.729  -0.283  -0.256   0.266   0.0253 -0.0748
10 "'s"   0.589  -0.202   0.735  -0.683  -0.197  -0.180  -0.392   0.342  -0.606 
# ℹ 399,990 more rows
# ℹ 91 more variables: d10 <dbl>, d11 <dbl>, d12 <dbl>, d13 <dbl>, d14 <dbl>,
#   d15 <dbl>, d16 <dbl>, d17 <dbl>, d18 <dbl>, d19 <dbl>, d20 <dbl>,
#   d21 <dbl>, d22 <dbl>, d23 <dbl>, d24 <dbl>, d25 <dbl>, d26 <dbl>,
#   d27 <dbl>, d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>, d32 <dbl>,
#   d33 <dbl>, d34 <dbl>, d35 <dbl>, d36 <dbl>, d37 <dbl>, d38 <dbl>,
#   d39 <dbl>, d40 <dbl>, d41 <dbl>, d42 <dbl>, d43 <dbl>, d44 <dbl>, …

Fairness in word embeddings

  • Word embeddings learn semantics and meaning from human speech
  • If the text is biased, then the embeddings will also contain bias
  • Terms associated with women are more associated with the arts and terms associated with men are more associated with science.1
  • Even large language models (LLMs) trained on vast corpa of human-generated text do not reflect representative or diverse viewpoints.2
  • Sentiment analysis utilizing off-the-shelf word embeddings can score text such as “Let’s go get Italian food” more positively than “Let’s go get Mexican food”3

Application exercise

Data overview

leg <- read_parquet(file = "data/legislation.parquet")
glimpse(leg)
Rows: 383,987
Columns: 8
$ id          <dbl> 38183, 38184, 38185, 38186, 38187, 38188, 38189, 38190, 38…
$ year        <dbl> 1947, 1947, 1947, 1947, 1947, 1947, 1947, 1947, 1947, 1947…
$ cong        <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80…
$ bill_type   <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR"…
$ bill_no     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ description <chr> "To reduce individual income tax payments", "To amend the …
$ policy      <dbl> 1, 16, 16, 12, 16, 16, 2, 5, 21, 21, 16, 5, 10, 16, 1, 13,…
$ policy_lab  <fct> "Macroeconomics", "Defense", "Defense", "Law, crime, famil…

Number of bills over time

Policy topic

Policy attention over time

Length of text descriptions

tf-idf for each policy area

ae-13

  • Go to the course GitHub org and find your ae-13 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Wrap-up

Recap

  • Text data can be useful for predictive modeling but requires deliberate structuring
  • Bag-of-words representation is simple and easy to implement, but loses context
  • Word embeddings capture context and semantic similarity
  • Use efficient methods when working with sparse text data