diff --git a/README.md b/README.md index d1f1de616..46bb8f3e9 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,8 @@ See more details in our [blog post](https://blog.openai.com/better-language-mode This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2. +For basic information, see our [model card](./model_card.md). + ### Some caveats - GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important. diff --git a/domains.txt b/domains.txt new file mode 100644 index 000000000..04bdac448 --- /dev/null +++ b/domains.txt @@ -0,0 +1,1000 @@ +1542261 google +596207 archive +456344 blogspot +414695 github +333160 nytimes +321622 wordpress +315368 washingtonpost +313137 wikia +311917 bbc +246303 theguardian +210714 ebay +209416 pastebin +199360 cnn +196124 yahoo +186668 huffingtonpost +186137 go +183592 reuters +183080 imdb +160553 goo +139965 nih +135562 cbc +128011 apple +125615 medium +118676 dailymail +108012 steampowered +106417 independent +105239 etsy +98941 craigslist +93048 businessinsider +92712 telegraph +90262 wizards +83266 usatoday +80384 thehill +79655 nhl +79494 foxnews +79167 taobao +78070 bloomberg +77515 npr +77407 mlb +77172 latimes +75676 megalodon +72525 espn +72523 kickstarter +71743 breitbart +69334 abc +68009 newegg +67008 wwe +66278 myanimelist +65520 microsoft +64723 buzzfeed +63162 vice +62911 indiatimes +61845 forbes +61772 tappedout +60889 wsj +60240 vid +60239 battle +59996 adf +58706 politico +58345 redditgifts +56769 nexusmods +56469 goodreads +54866 magiccards +53973 nbcnews +53060 gamepedia +52110 mediafire +50567 time +50144 cbsnews +49203 ppy +48442 gstatic +48042 nfl +47460 steamusercontent +47046 thestar +46603 bugguide +46340 fanfiction +45505 mturk +45458 cbslocal +44729 theglobeandmail +44134 nydailynews +42992 theatlantic +42941 netflix +42328 theverge +41952 smh +40694 nbcsports +40613 cnbc +40469 slate +40071 ign +39655 dotabuff +38968 wired +38779 chicagotribune +38590 urbandictionary +38575 rt +38092 wuxiaworld +38065 wowhead +37954 wolframalpha +37749 guardian +37594 xboxdvr +36841 nypost +36741 ravelry +36321 thedailybeast +36298 nba +36188 yelp +36008 arstechnica +35485 csgo +35365 flic +35269 stackexchange +35124 vidble +35024 googleusercontent +34311 msn +34121 gizmodo +34120 boardgamegeek +33867 aljazeera +33598 rawstory +33516 scryfall +33467 bleacherreport +33419 bit +33395 thinkprogress +33170 dailycaller +32843 ap +32433 fangraphs +31742 salon +31728 mirror +31496 nintendo +31294 nationalpost +31278 nasa +31110 oddshot +31057 hltv +30952 amzn +30877 quora +30586 engadget +30397 stackoverflow +30201 aliexpress +29710 cnet +28850 leagueoflegends +28822 surveymonkey +28704 ctvnews +28650 walmart +28644 plays +28536 sfgate +28375 cbssports +28210 globo +27992 discogs +27630 wiktionary +27588 ibb +27544 stuff +27349 nature +27112 news +27020 biblegateway +26801 subtletv +26427 change +26355 zippyshare +26311 guildwars2 +26231 vox +26205 zkillboard +26174 techcrunch +25993 economist +25964 globalnews +25621 washingtontimes +25610 hollywoodreporter +25351 archiveofourown +25336 ibtimes +25257 newsweek +25139 zerohedge +25074 fav +25050 sciencedirect +24894 bestbuy +24870 spiegel +24869 247sports +24866 smmry +24764 xda-developers +24726 tvtropes +24698 phys +24663 teamliquid +24619 state +23953 gleam +23676 sbnation +23644 asahi +23620 foxsports +23240 ndtv +23189 si +23183 alternet +23009 redbubble +22846 metro +22845 theonion +22835 playstation +22808 washingtonexaminer +22682 thehindu +22557 espncricinfo +22482 mozilla +22219 op +22038 t +21984 nj +21921 indianexpress +21707 apnews +21603 dw +21422 nationalgeographic +21399 pinterest +21368 ft +21319 wiley +21254 about +21074 skysports +21033 gamespot +21014 dailykos +21009 goal +20858 patheos +20842 irishtimes +20664 variety +20592 kotaku +20584 mashable +20575 scientificamerican +20448 basketball-reference +20262 yle +20218 theage +20176 usnews +20133 animenewsnetwork +20092 livejournal +20068 +20024 pbs +19802 nhk +19741 newyorker +19727 seattletimes +19672 mlssoccer +19619 meetup +19543 nzherald +19509 philly +19496 uol +19470 patreon +19429 wikileaks +19400 gravitytales +19294 oregonlive +19267 xbox +19216 linkedin +19202 crunchyroll +19045 target +19021 ew +18922 redditpoll +18875 homedepot +18867 qz +18865 donmai +18653 baseball-reference +18646 talkingpointsmemo +18576 pathofexile +18536 makeameme +18489 postimg +18308 clyp +18175 scribd +18120 thegatewaypundit +18097 removeddit +18063 deadspin +18049 sciencedaily +18019 huffpost +17987 dallasnews +17956 europa +17878 merriam-webster +17816 haaretz +17746 deadline +17637 msnbc +17579 hindustantimes +17531 nymag +17429 gph +17208 typepad +17204 express +17098 naver +17085 bizjournals +17084 mlive +16834 rollingstone +16793 motherjones +16704 okcupid +16441 tinyurl +16410 espnfc +16397 bostonglobe +16374 thingiverse +16351 denverpost +16332 bitcointalk +16256 timesofisrael +16209 xnxx +16202 wikihow +16051 neopets +16043 indiegogo +16033 al +16032 chron +16004 avclub +15970 marketwatch +15933 mercurynews +15675 startribune +15646 pro-football-reference +15568 d20pfsrd +15545 pcgamer +15451 reason +15422 uesp +15356 lds +15152 polygon +15132 humblebundle +14962 tradingview +14931 baltimoresun +14914 strava +14912 firstpost +14856 commondreams +14801 sky +14739 eventbrite +14722 nicovideo +14697 fortune +14693 knowyourmeme +14666 robertsspaceindustries +14471 pitchfork +14466 psychologytoday +14435 combodeck +14392 mixcloud +14372 lemonde +14290 sciencemag +14060 jpost +13926 miamiherald +13902 patch +13850 nationalreview +13849 gofundme +13798 thelocal +13763 derpibooru +13726 techdirt +13658 townhall +13596 mtg +13588 gettyimages +13530 mit +13436 challonge +13369 mediaite +13357 tsn +13350 pokemonshowdown +13176 neogaf +13130 publico +13126 snopes +13092 scmp +13082 cleveland +13044 thesun +13025 mtggoldfish +12994 freep +12984 grailed +12948 standard +12923 theconversation +12913 upi +12870 bing +12778 blockchain +12774 people +12771 arxiv +12760 hearthpwn +12668 reference +12626 edhrec +12611 sputniknews +12551 nordstrom +12550 lapresse +12496 metacritic +12447 last +12395 ajc +12355 mangadex +12349 ycombinator +12345 csmonitor +12240 sportsnet +12229 cornell +12205 smithsonianmag +12201 sephora +12194 bulbagarden +12181 japantimes +12171 zdnet +12152 comicbook +12139 whitehouse +12109 theregister +12089 libsyn +12052 asos +12016 neatclip +12001 imirhil +12000 boston +11973 behance +11966 eveonline +11954 androidpolice +11935 livescience +11843 instructables +11817 hs +11788 infowars +11712 ca +11704 runescape +11699 suntimes +11697 eurogamer +11654 roblox +11622 genius +11602 stltoday +11499 elpais +11494 motorsport +11461 ceddit +11426 france24 +11373 bungie +11371 youtubedoubler +11362 openload +11348 jstor +11328 thefreedictionary +11307 inquisitr +11215 nhentai +11204 zeit +11198 ikea +11114 springer +11108 tripadvisor +11082 thescore +11036 kerbalspaceprogram +11007 cdc +10995 dailywire +10965 gawker +10953 a +10950 brooksbaseball +10940 dn +10927 sltrib +10867 brickset +10823 dictionary +10821 squarespace +10819 battlefield +10807 harvard +10786 afpbb +10734 steemit +10730 billboard +10707 tampabay +10654 nola +10621 stanford +10602 sbs +10524 cc +10520 dailydot +10510 straitstimes +10493 itch +10490 foreignpolicy +10465 vancouversun +10440 rottentomatoes +10419 dnainfo +10389 digi24 +10348 dropboxusercontent +10332 complex +10330 scp-wiki +10327 prnt +10313 ottawacitizen +10304 anandtech +10269 thenation +10253 fivethirtyeight +10244 newscientist +10240 svt +10240 inquirer +10236 coindesk +10227 codepen +10208 lichess +10204 sankei +10189 ted +10181 roosterteeth +10170 livemint +10161 teamfortress +10141 sourceforge +10119 sapo +10113 countle +10086 mtv +10075 sacbee +10066 fimfiction +10057 hentai-foundry +10054 gamesplanet +10044 io9 +10032 lifehacker +10007 cracked +9991 mainichi +9984 itmedia +9966 warthunder +9936 nos +9935 boingboing +9925 vulture +9904 lanacion +9892 qualtrics +9884 muthead +9856 jcrew +9814 jsonline +9787 spacebattles +9748 worldstarhiphop +9734 jalopnik +9721 welt +9717 curbed +9708 dbr +9705 mmafighting +9697 bigcartel +9682 transfermarkt +9680 vlive +9659 vanityfair +9658 dawn +9621 dnaindia +9601 theblaze +9599 allrecipes +9576 thejournal +9572 dailystar +9521 minecraftforum +9505 theweek +9502 kansascity +9494 anilist +9443 gog +9420 bato +9401 oxforddictionaries +9400 soompi +9394 sagepub +9389 wikiwand +9382 lolking +9322 torontosun +9319 mangapanda +9316 politifact +9306 realclearpolitics +9278 tagpro +9261 webmd +9206 app +9202 hotnews +9184 9news +9174 bhphotovideo +9147 giantbomb +9132 gamestop +9073 azcentral +9053 noaa +9040 repubblica +9021 mangaupdates +8998 space +8998 researchgate +8971 bitcoin +8957 sueddeutsche +8898 rightwingwatch +8892 mediacru +8890 afl +8862 fasttech +8858 tmz +8841 orlandosentinel +8832 tomshardware +8828 altomfotball +8822 mtgprice +8821 haskell +8816 discovery +8810 destinytracker +8808 massdrop +8800 csgolounge +8791 weather +8778 daddyleagues +8720 govtrack +8678 mentalfloss +8678 justice +8663 frontier +8655 youporn +8641 paradoxplaza +8640 rockstargames +8632 derstandard +8622 pinknews +8619 macrumors +8598 gamefaqs +8587 thepiratebay +8586 4chan +8582 post-gazette +8573 faz +8563 e-hentai +8530 jiji +8525 quoracdn +8519 fullmatchesandshows +8516 sun-sentinel +8513 xboxclips +8488 financialpost +8476 audible +8439 investopedia +8425 loc +8418 venturebeat +8414 amazonaws +8368 ubi +8345 etymonline +8326 wsws +8316 jezebel +8300 americanthinker +8284 wikidot +8269 digitaltrends +8260 nrk +8232 weebly +8228 thenextweb +8225 snahp +8223 gematsu +8210 daum +8206 ea +8189 liverpoolecho +8186 freebeacon +8178 thetimes +8168 naturalcrit +8153 warframe +8150 1drv +8143 gap +8131 seriouseats +8119 myfigurecollection +8109 gov +8086 eporner +8080 hulu +8077 senate +8046 esquire +8015 gosugamers +8000 radionz +7997 eater +7982 politicususa +7978 rte +7956 marvel +7942 metronews +7917 starcitygames +7917 hotair +7914 marca +7872 eurekalert +7840 screenrant +7834 dota2 +7797 truth-out +7784 dell +7783 eldiario +7782 pcworld +7782 doi +7780 comicbookresources +7765 dr +7729 howstuffworks +7727 gocomics +7715 worldoftanks +7707 tandfonline +7690 examiner +7688 newrepublic +7682 curseforge +7680 findlaw +7673 nikkei +7665 heraldsun +7652 podbean +7645 aftonbladet +7638 duckduckgo +7633 ynetnews +7629 timesofindia +7628 freshphase +7591 westeros +7576 youjizz +7574 spectator +7548 justia +7537 antiwar +7536 mmajunkie +7516 yomiuri +7485 newstatesman +7481 greenmangaming +7475 joystiq +7444 jsfiddle +7424 anime-planet +7415 counterpunch +7410 autosport +7395 archlinux +7384 berkeley +7383 smbc-comics +7374 rockpapershotgun +7372 pjmedia +7367 estadao +7365 intoday +7361 newsmax +7346 newsbusters +7337 grantland +7329 voanews +7292 myshopify +7286 wnd +7265 9to5mac +7257 hurriyetdailynews +7229 bleedingcool +7225 indiewire +7222 radio-canada +7216 viewsync +7211 cambridge +7204 drsd +7197 house +7185 uproxx +7152 mlbtraderumors +7145 gamasutra +7134 bricklink +7122 foodnetwork +7122 presstv +7119 opensecrets +7118 canada +7116 bgr +7097 democracynow +7091 businessweek +7085 smash +7080 usda +7078 cloudfront +7044 psu +7028 detroitnews +7028 explosm +7013 woobox +7011 football-italia +7005 academia +6948 channelnewsasia +6927 siliconera +6923 rei +6917 deseretnews +6916 supload +6914 mises +6905 rotoworld +6886 gsmarena +6878 rappler +6876 kijiji +6866 metal-archives +6826 theaustralian +6823 mediamatters +6823 wa +6818 bodybuilding +6811 memedad +6803 ucsd +6802 barnesandnoble +6791 india +6780 readability +6777 today +6726 indystar +6720 scotsman +6694 impress +6689 torrentfreak +6675 heise +6668 sportingnews +6658 pnas +6650 chzbgr +6650 milb +6631 business-standard +6630 bustle +6623 square-enix +6622 madison +6615 moddb +6613 uniqlo +6599 zillow +6577 tribune +6556 airliners +6552 svd +6547 gameinformer +6536 brisbanetimes +6536 ocregister +6533 swtor +6526 calgaryherald +6521 c-span +6518 slashdot +6505 belfasttelegraph +6499 hiyo +6494 news24 +6484 theintercept +6479 technologyreview +6455 gutenberg +6449 cinemablend +6438 dailytelegraph +6424 globalresearch +6411 lefigaro +6405 tenor +6381 redstate +6374 aclu +6361 bloodyelbow +6357 axios +6353 thewrap +6349 redditmetrics +6345 evike +6339 aol +6327 ulta +6326 plos +6324 periscope +6312 drivethrurpg +6308 infobae +6300 debian +6298 congress +6289 warcraftlogs +6284 gothamist +6281 mangastream +6276 newgrounds +6275 berniesanders +6263 lolesports +6262 mayoclinic +6242 sfchronicle +6235 edmontonjournal +6200 dhgate +6194 cincinnati +6180 history +6176 xtube +6169 nike +6160 kiji +6147 tube8 +6140 vdare +6133 unity3d +6130 twincities +6127 escapistmagazine +6126 komonews +6104 openneo +6090 oup +6082 dispatch +6079 newsobserver +6060 ballotpedia +6058 indiegala +6054 index +6050 charlotteobserver +6048 androidcentral +6032 webtoons +6028 tcgplayer +6018 zappos +6004 intel +5998 seattlepi +5996 profootballfocus +5990 ksl +5989 macleans +5984 atlasobscura +5981 yugiohprices +5980 ubuntu +5964 gq +5952 myvidster +5941 tv2 +5930 paizo +5926 montrealgazette +5919 al-monitor +5919 herokuapp +5918 volarenovels +5909 usgs +5906 nme +5906 society6 +5905 vg247 +5902 popsci +5895 lowes +5893 thefederalist +5878 amiami +5862 nyti +5848 steamdb +5841 crooksandliars +5833 popularmechanics +5832 slashfilm +5826 woot +5818 ev +5807 illinois +5792 nps +5791 destructoid +5790 mysanantonio +5772 sbtl +5742 smashboards +5700 biblehub +5696 euronews +5694 urbanoutfitters +5687 itv +5685 fastcompany +5684 techpowerup +5674 hearthhead +5656 mic +5649 autoblog +5646 futbin +5638 voat +5636 statesman +5626 zap2it +5623 userbenchmark +5623 legaliq +5622 mspaintadventures +5622 familysearch +5616 themoscowtimes +5606 theprovince +5604 allkpop +5594 Omegle +5570 activistpost +5565 thefreethoughtproject +5565 in +5559 sandiegouniontribune +5556 consumerist +5554 eff +5532 lego +5520 translationnations +5515 clickhole +5498 etherscan +5491 live +5486 vndb +5484 poll-maker +5481 mtgsalvation +5481 computerworld +5475 comicvine +5470 python +5469 digitalspy +5468 citylab +5458 expressen +5455 oxfordjournals +5451 collider +5447 statista +5437 apa +5434 g +5430 thenational +5430 eslgaming +5425 politiken +5421 ktla +5420 webmshare +5408 bostonherald +5407 comixology +5400 ustream +5399 sony +5396 tennessean +5377 scout +5374 drop +5372 ieee +5359 sverigesradio +5356 sherdog +5353 viooz +5353 marxists +5353 adobe +5349 myfitnesspal +5342 seahawks +5339 rferl +5338 thediplomat +5335 storeparser +5332 prnewswire +5330 midwayusa +5327 liverpoolfc +5326 cisco +5326 windowsphone +5323 toysrus +5321 archivesofnethys +5317 eluniversal +5309 gmanetwork +5303 asus +5297 android +5297 finalfantasyxiv +5296 cyclingnews +5293 worldbank +5288 boxingscene +5285 ticketmaster +5279 grooveshark +5277 khl +5276 gallup +5268 britannica +5263 abc7 +5260 penny-arcade +5257 hsreplay +5257 oculus +5256 bt +5250 theroot +5246 makeagif +5246 cnsnews +5243 nbc +5243 rbc +5243 fextralife +5234 legislation +5225 sendvid +5221 sciencealert +5214 wbur +5212 myfonts +5207 picsarus +5206 phoronix +5204 nerdist +5203 eonline +5195 advocate +5191 king5 +5189 xkcd +5183 kitsu +5182 weibo +5181 mangareader +5178 palmbeachpost +5176 go1dfish +5175 livestrong +5174 truthdig +5173 lgbtqnation +5172 nikkansports +5167 slickdeals +5166 streamja +5164 irs +5158 readms +5152 microcenter +5137 telesurtv +5135 lastwordonsports +5129 alarabiya +5117 cointelegraph +5114 iltalehti +5112 fc2 +5108 wral +5108 thinkgeek +5102 bitbucket +5101 letterboxd +5098 ehow +5092 abc13 +5083 beeradvocate +5077 umich +5067 macys +5064 factorio +5063 comicbookmovie +5042 telegram +5039 scroll +5034 setlist +5028 dailyherald +5019 games-workshop +5015 irishexaminer +5008 fbi +5007 heraldscotland +5001 jellyneo +4999 yale +4996 cbr +4994 masslive +4984 thestranger +4982 bundlestars +4981 alibaba +4977 filedropper +4974 monoprice +4968 forward +4964 parliament +4960 theringer +4950 hobbyking +4950 manchestereveningnews +4949 bmj +4948 thewire +4947 ff2ebook +4938 ashemaletube +4937 Twitch +4933 sketchtoy +4932 mcclatchydc +4931 memory-alpha +4925 newsok +4911 desmoinesregister +4901 puzzledragonx +4889 memecrunch diff --git a/model_card.md b/model_card.md new file mode 100644 index 000000000..fdab8ee2c --- /dev/null +++ b/model_card.md @@ -0,0 +1,64 @@ +# GPT-2 model card + +Last updated: August 2019 + +Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we’re providing some accompanying information about the GPT-2 family of models we're releasing. + +## Model Details. + +This model was developed by researchers at OpenAI to help us understand how the capabilities of language model capabilities scale as a function of the size of the models (by parameter count) combined with very large internet-scale datasets (WebText). + +### Model date + +Spring 2019, trained on data that cuts off at the end of 2017. + +### Model type + +Language model + +### Paper or other resource for more information +[Blog post](https://openai.com/blog/better-language-models/) and [paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) + +### Where to send questions or comments about the model +languagequestions@openai.com + +## Intended Uses: + +### Primary intended uses + +The primary intended users of these models are *AI researchers and practitioners*. + +We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models. + +### Secondary uses + +Here are some secondary use cases we believe are likely: + +- **Writing assistance**: Grammar assistance, autocompletion (for normal prose or code) +- **Creative writing and art**: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art. +- **Entertainment**: Creation of games, chat bots, and amusing generations. + +### Out-of-scope use cases + +Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true. + +Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. + +## Evaluation Data + +### Datasets + +This model was trained on (and evaluated against) WebText, a dataset consisting of the text contents of 45 million links posted by users of the ‘Reddit’ social network. WebText is made of data derived from outbound links from Reddit and does not consist of data taken directly from Reddit itself. Before generating the dataset we used a blocklist to ensure we didn’t sample from a variety of subreddits which contain sexually explicit or otherwise offensive content. + +To get a sense of the data that went into GPT-2, we’ve [published a list](domains.txt) of the top 1,000 domains present in WebText and their frequency. The top 15 domains by volume in WebText are: Google, Archive, Blogspot, GitHub, NYTimes, Wordpress, Washington Post, Wikia, BBC, The Guardian, eBay, Pastebin, CNN, Yahoo!, and the Huffington Post. + +### Motivation + +The motivation behind WebText was to create an Internet-scale, heterogeneous dataset that we could use to test large-scale language models against. WebText was (and is) intended to be primarily for research purposes rather than production purposes. + +### Caveats and Recommendations + +Because GPT-2 is an internet-scale language model, it’s currently difficult to know what disciplined testing procedures can be applied to it to fully understand its capabilities and how the data it is trained on influences its vast range of outputs. We recommend researchers investigate these aspects of the model and share their results. + +Additionally, as indicated in our discussion of issues relating to potential misuse of the model, it remains unclear what the long-term dynamics are of detecting outputs from these models. Developing better approaches to detection today will give us greater intuitions when thinking about future models and could help us understand ahead of time if detection methods will eventually become ineffective. +