Playing with GPT-3 via AI Dungeon - some Library scenarios and test cases
My apologies for the little deception in the last blost post. I guarantee you this blog post is 100% written by the human Aaron Tay, though as always in areas I am teaching myself take everything here with more than the usual pinch of salt!
Warning note July 2020 - The creator of AI Dungeon has now acknowledged that AI Dungeon Dragon Model has been modified (all along) to try to prevent the backdoor access use case I exploited. Among other things, "The first generation of any custom prompt is actually GPT-2.", though the remaining ones are via GPT-3. There are also other modifications to the output, but it still shows how impressive this technology is even when tuned for a specific use case.
Like many, I have been trying to learn as much as possible on machine learning, deep learning and how much of it is hype and how much of it really has the potential to disrupt our work in libraries.
Over time, I have started to focus on trying to learn aboutNLP (Natural language processing) and in particular language models, which I believe are most relevant to academic libraries because for most of us , text is where our business is at.
When talking about NLP and language models, one must of course talk about the OpenAI's new GPT-3 language model.
My understanding of deep learning is extremely limited but I will attempt to explain what all the fuss is in a non-technical way and show some of the results I have gotten playing with it in a library context.
I would also recommend you read the paper on GPT-3 Language Models are Few-Shot Learners as it makes fascinating reading even if you don't have much knowledge about NLP, neutral networks or transformers.
Unlike the earlier paper on GPT-2 this doesn't go into details of the deep learning architecture (because GPT-3 is basically GPT-2 just scaled up) but focuses more on the various tests they did GPT-3 on, the datasets they used for training, while considering briefly issues like bias and as such is fairly comprehensible.
If you already know about GPT-3 a, you can jump to the section about the experiments I did using AIdungeon to get access to GPT-3 on library related scenarios.
What are language models?
Language modelling is defined as "the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it." At the simplest level you can think of it as something that helps your mobile phone keyboard app predict what is the next word you might want to type in (or autocorrects the word).
More advanced language models will take into account the last n sequence of words/tokens and have sophisticated ways of trying to "pay attention" to important words in the sequence to help predict next sequence of words.
Currently, state of art Language models are all developed using deep learning (usually a neural network architecture model with many layers using transformers).
Why are language models important? In general, if you have a model of what words are more likely or less likely to follow each word, you pretty much have a model or understanding of the context of what is being written or said which can be useful for many machine learning tasks from
Speech to text recognition
Summarization
Information retrieval tasks like Q and A (question and answer) and more.
To cite a trival example, when trying to recognise speech, two words might sound the same, but with a good language model you are more likely to be correctly disambiguate the two similar sounding terms and select the right word.
In a more familar context, you can see how having a good language model can help with interpreting search queries.
Bidirectional Encoder Representations from Transformers (BERT) another state of art language model developed by Google is currently used in about 10% of all English language queries in the US to improve ranking of results.
They state
"Particularly for longer, more conversational queries, or searches where prepositions like “for” and “to” matter a lot to the meaning, Search will be able to understand the context of the words in your query."
They give an example of a search, 2019 brazil traveler to usa need a visa and note that using BERT Google is now able to understand the meaning of the query.
" The word “to” and its relationship to the other words in the query are particularly important to understanding the meaning. It’s about a Brazilian traveling to the U.S., and not the other way around. Previously, our algorithms wouldn't understand the importance of this connection, and we returned results about U.S. citizens traveling to Brazil. With BERT, Search is able to grasp this nuance .."
More specific language models can be developed. For example, in my blog post about doing TDM from COVID-19 related scientific Corpus, I mentioned language models like SciBert, which is the BERT model trained on scientific articles and even more advanced attempts like SPECTER Embeddings of the dataset (Scientific Paper Embeddings using Citation-informed TransformERs, aka a language model like BERT but taking into account citations between documents)
Want to learn more? See this series of videos by Computerphile on Language models, Tranformers and GPT-2.
Why the attention on GPT-3 and it's predecessor GPT-2?
The state of machine learning and NLP is advancing very quickly, Unsupervised bag of word models such as Word2vec and GloVe models were developed only in 2013, and in the 7 years since, the ML community has quickly advanced with interestingly advanced models such as ELMo, ULMFiT and now GPT-2/3 and BERT.
GPT-2 by OpenAI, first hit the spotlight in Feb 2019, when they claimed that it was "too dangerous" to release their full language model with 1.5B parameters and instead released smaller models.
That of course was extremely controversial (notice the word "Open" in the organization name), but eventually they decided to release the full model.
Even without this controversy, GPT-2 clearly seem to hit on something. It's text completion abilities seemed almost magical. Here's the one that probably made the most rounds.
When given a prompt of two sentence on a surprising finding of unicorns in the Andres Mountains , GPT-2 came up with a really coherent story.

Notice all the plausible details, from the University , the mention of a "Evolutionary biologist" and again the regional appropriate name of the researcher, it really looked like GPT-2 had a good understanding of the world to come up with such a good story.
One of the many things about GPT-2 is that compared to BERT it was arguably a simpler model, but OpenAI found by feeding it a ton of text and creating a gigantic model it started to do amazing things.
So when they announced GPT-3 a few weeks ago, with the exact same architecture but with a much bigger model, people were curious if there was diminishing returns to such a method.
How much bigger was GPT-3 compared to GPT-2?
GPT-2 as mentioned had 1.5 Billion parameters. GPT-3 (largest model) had 175 Billion parameters, more than a 100x bigger!
GPT-2 was trained using a dataset they got from following links in posts and comments in reddit that had at least 3 karma and scraped the webpages found there.
GPT-3 on the other hand was trained on a much bigger dataset including
Common Crawl dataset 2016-2019 (trillion words)
English-language Wikipedia
two internet-based books corpora
All this requires massive computing power to train of course.
What's so special about GPT-3?
GPT-3 is terrifying because it's a tiny model compared to what's possible, trained in the dumbest way possible on a single impoverished modality on tiny data, yet the first version already manifests crazy runtime meta-learning—and the scaling curves 𝘴𝘵𝘪𝘭𝘭 are not bending! 😮 https://t.co/hQbW9znm3x
— 𝔊𝔴𝔢𝔯𝔫 (@gwern) May 31, 2020
You already had a demostration with my little deception with the my last self generated blog post on how good text generation algothrims can be. But that was already available in GPT-2, even though GPT-3 seems to be even better. But this isn't the main thing that is leading to excitement.
Again I highly recommend reading the GPT-3 paper here. But let me try to summarise. ML experts would probably be excited by the fact that the full biggest GPT-3 Model achieves either state of art results or nearly so on a wide variety of NLP benchmark (e.g. CoQA, TriviaQA)
But even that fails to fully explain how good GPT-3 is. Most state of art benchmark results are obtained by fine tuned language models. Typically they would use a standard trained model like BERT and then further fine-tune it by training the model with more specific documents in the domain to handle specific tasks.
For example, one model could be fine tuned to do well in language translation tasks by feeding it additional datasets useful for learning such tasks, while another could be fed more specific datasets to fine tune it to do well in tasks that involve reading comprehension such as the CoQA (Conversational Question Answering systems).
In the context of the blog post that was auto-written - Why GPT-3 might be the greatest disruption to libraries since Google . fine-tuning would mean taking the generic GPT-3 model and further finetuning it by feeding it text from my own writing to train it further to fit it into my writing style.
GPT-3 is amazing in that no such fine tuning is needed to get great results!
All they had to do was to type in a "prompt" from GPT-3 and it did it's magic.
But what is a prompt?
Prompting for zero-shot, one-shot and many shot tasks
A prompt is just some text you enter to clue in GPT-3 the type of task you are doing.
The authors of GPT-3 in their paper define 3 types of tests - Zero shot, One-shot and Many-shot, none of which involves fine tuning.
In the example before drawn from their paper, they show how they would prompt GPT-3 for language translate tasks.

Figure 2.1 - https://arxiv.org/pdf/2005.14165.pdf
Just to reinforce why this is amazing, essentially once you have GPT-3 trained no additional Machine learning work needs to be done for it to work it's magic.
All you have to do is to "prompt" it with human natural language (zero shot), with one example (one shot) or many examples (few-shot) and it should hopefully do well.
This has two advantages over fine tuning, firstly you don't have to try to find suitable datasets to fine-tune the model and more importantly, you don't need any coding skill at all to use GPT-3. All you need to do is experiment with prompts in creative ways to see if GPT-3 can work.
For example to test GPT-3 to summarise a passage, it is just prompted with the text passage followed by TL;DR! This seems too cute!
People who have access to GPT-3 are having a field day with experimenting with clever prompts to make GPT-3 do amazing things. and it almost feels like you can find crazier and crazier uses people come up everyday just by searching Twitter.
In this amazing demo, GPT-3 is prompted with "My second grader asked me what this passage means" and then adds in a NDA. GPT-3 will then autocomplete with a simple layman translation of the NDA!

https://vimeo.com/427957683/10634d1706
A similar experience is found below, with the reverse example of prompt "write this like an attorney"
GPT-3 performance on "write this like an attorney" is insane. It even includes relevant statutes if you mention a jurisdiction. This will put a lot of lawyers out of work.
Only the first 2 are prompts... pic.twitter.com/u7eOyuSd9b
— Francis Jervis (@f_j_j_) July 15, 2020
In case you miss the point, GPT-3 is special because it almosts "feels like programming" , allowing non-coders to get interesting results with natural language prompts!
Is GPT-3 that good? GPT-3 can generate short news articles that are indistinguisable from human generated ones.
Many of the texts you see floating around generated by GPT-2 or 3 have some element of cherry picking, so it's hard to tell how good it is unless you try it yourself. Still there is some evidence , GPT-3 might be close to passing the Turing test, but it seems it is definitely capable of producing short passages that are almost indistinguishable from human generated one.
This was found when GPT-3 was tested on it's ability to generate a news articles given the prompt of a title and subtitle. (Do note this is a many shot test, because given just a title and subtitle GPT-3 tends to want to create a tweet so they include three previous news articles as a prompt for context)
In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words).
They then generated articles of mean 200 words and gave the real article and the GPT-3 generated article to 80 humans. Humans were asked to rate each article has “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
The stunning result was using the biggest GPT-3 model (they tried a few different sizes), the human mean accuracy rate was only 52%, barely above chance.

Table 3.11
Doing the same when generating longer articles (average 500 words) gave about the same mean accuracy rate.
In other words, humans on average can't tell the difference because their performance is just slightly better than random chance.
GPT-3 text completion and passing the Turing test.
GPT-3 is still very new right now, but people are already getting amazing results and finding amazing use cases. The twitter thread below captures just some of the interesting results so far.
I keep seeing all kinds of crazy reports about people's experiences with GPT-3, so I figured that I'd collect a thread of them.
— Kaj Sotala (@xuenay) July 15, 2020
As such, I will be at best scratching the surface of what amazing things people will be trying but I will try anyway,
Firstly, it is no surprise that GPT-3 is crazy good at completion of text. Even with GPT-2 people were doing things like creating generators that could generate pausible looking abstracts from titles! Still GPT-3 seems to be one level up.
One of the deepest exploration of the capabilities of GPT-3 in this area can be found at Gwern.net Creative fiction.
It can among other things tell Dad Jokes (some are memorised), do literary parodies (e.g Harry Potter in the style of another writer), write stories in the style of well known Copypasta,write amazing poetry and more.
I will just give an example, where it learned the famous - Navy Seal copypasta and produced amazing variants based on different subjects. In case you are unfamilar , the original looks like this
What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo.
Here's just one where it is asked to do it for Donald Trump

Another interesting exploration of the strengths and weakness of GPT-3 is Lacker's essay trying to see how well GPT-3 can pass the Turing test.
A very interesting read as the author went it detail on what it sees are strengths and weaknesses.
Because it has a lot of data included, asking it trivial factual questions will in fact likely yield correct answers, in fact it might even have super human performance on trivia questions.
Someone exploited this to create a really interesting search engine. Reminds me of what Google and Wikidata is trying to do with Knowledge graphs but this is done automatically using unstructured data.
I made a fully functioning search engine on top of GPT3.
For any arbitrary query, it returns the exact answer AND the corresponding URL.
Look at the entire video. It's MIND BLOWINGLY good.
cc: @gdb @npew @gwern pic.twitter.com/9ismj62w6l— Paras Chopra (@paraschopra) July 19, 2020
Somewhat stunning to me is it does well on common sense questions well too (also does well on the PIQA tests - to be fair there is "data contamination" - see later)

Lacker muses
How does GPT-3 know that a giraffe have two eyes? I wish I had some sort of “debug output” to answer that question. I don’t know for sure, but I can only theorize that there must be some web page in its training data that discusses how many eyes a giraffe has. If we want to stump GPT-3 with common sense questions, we need to think of questions about things so mundane, they will not appear on the internet.
He notes
GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment… your question is nonsense.” It also doesn’t know how to say “I don’t know.”
Questions like "How do you sporgle a morgle?" are met with nonsense answers like "You sporgle a morgle by using a sporgle"
Yet even such nonsense questions are met with answers that almost makes sense. For example when asked who is the president of the United States in 1600 and 1620 it gives Queen Elizabeth I and James I - which kinda make senses since United States was part of the British Empire at the time!
That said, there are some who think some of the generalizations in the piece are a result of not using the right prompt.
Take the idea the AI can never say it doesn't know or that a queston is rubbish.
People have found that by prompting the AI to by things like - If the question is "nonsense", the AI says "yo be real", GPT-3 is able to reject nonsense questions.
it's all about the prelude before the conversation. You need to tell it what the AI is and is not capable. It's not trying to be right, it's trying to complete what it thinks the AI would do :) pic.twitter.com/gpqvoWXmiV
— Nick Cammarata (@nicklovescode) July 17, 2020
GPT-3 allows code generation using natural language
One of the other things people found about GPT-3 is you can teach it to do some simple code generation.
In the example below, GPT-3 is able to understand natural language commands for shell commands.
That's quite a demo from @OpenAI!https://t.co/drPgpIrlN0
— Giacomo Bernardi (@mino98) June 12, 2020
In yet another example, someone managed to do code generation using natural language as well.
Automatic code generation from natural language descriptions. "Give me a page with a table showing the GDP of different nations, and a red button." https://t.co/t7emUkv2Qm
— Kaj Sotala (@xuenay) July 15, 2020
I personally try GPT-3 sort of....
Unlike GPT-2 , GPT-3 is available only via a closed beta as an API. With thousands of people asking for an invite, chances of me of getting it is low.
But someone on reddit at r/slatestarcodex figured out a workaround to access GPT-3 using AI Dungeon
AI Dungeon is an application that uses GPT-2 and now GPT-3 (premium subscription needed) to allow you to play an open ended AI adventure.
In the example below, as I type out things I do or say, AI Dungeon will respond with more story details.

As it is using GPT2/3 to do this, the story can be surprisingly coherent.
Someone on reddit realised that one could use AIDungeon as a backdoor away of getting access to GPT-3 as you can enter any prompt to start the story using the custom option.
Currently, you will need to subscribe to the premium version to gain access to the Dragon model (which is GPT-3). Here are some settings you can play with

Set Randomness to lowest if you want the most probable answers, while length controls how long each segment of text is when you auto-complete.

Using AIDungeon
Warning note July 2020 - The creator of AI Dungeon has now acknowledged that AI Dungeon Dragon Model has been modified (all along) to try to prevent the backdoor access use case I exploited. Among other things, "The first generation of any custom prompt is actually GPT-2.", though the remaining ones are via GPT-3. There are also other modifications to the output, but it still shows how impressive this technology is even when tuned for a specific use case.
One thing to note is that I am not sure if using AIDungeon this way gets you exactly the same results as GPT-3 API, as the API output from GPT-3 model in AIDungeon might be tweaked somehow before displaying. For example I was unable to get exactly the same responses reported in the paper (setting differences?), though I do see the same tendencies and "magical' results.
However the results can still be quite amazing so I'm going to show some examples.
The other thing to note is that AIDungeon can be prompted to continue the story or I can type in responses , in some examples I let it complete the whole story, in others I play one side or the other.
I start with a typical scenario of a student seeking help at the reference desk. The prompt is from the start to "approach the research librarian at his university for help". The rest is just AIDungeon.

Not bad.
In the one below, I try a data query with quite decent results.

I then try to mimic Q and A style dialogue. Again the prompt is the similar, though I include one more line Librarian: Hi How can I help and the AI completes the rest.

Fascinating, the AI starts sassing out the user for no reason. While trying various iterations of this, there are a number of times the AI makes the librarian pretty unhelpful, saying they don't know anything about the question. For example look at the example below where the librarian has zero clue about open access mandates :)

Still, one of my experiences playing with the AI is it seems to have quite a bit of real world knowledge probably obtained from Wikipedia and Common Crawl.
In the example below, where I tried with GPT-2 (this was a earlier try), it kept insisting that there was no new article about the Stanford Prison experiment. But it is telling the year it says the last article published was 1972 which is pretty on target for the actual experiment!

It's not above making things up, in the example below when asked about databases for ADHD, it came up with things like "Masterfile Premier" and "Psych-insight" , very plausible database names with the former similar to what Ebsco might use and the later very suggestive of Psycinfo?
In the example, below it suggests a in-library terminal to access, while in another run it suggests that it is accessible anywhere and when asked how to do so, it gives a fake domain and instructions on how to login. All very plausible .

In yet another example where I set up a scenario on systematic reviews, it shows it actually knows of Pychinfo! But it goes off the rails at the end when it mimics a worksheet assignment I think.

The next example is the most impressive. It correctly suggests ERIC ( the go to education research database), and even hints it knows the difference between a narrative review and a meta-analysis by asking if the user really wants the later.


All in all is amazing how much like a librarian the AI feels, in some other examples, it tries to ask if the user is a student or faculty (to guage how much to show) and in the one below I can even put myself in the place of the librarian desperately trying to give an answer for a topic I am unfamilar with :)

If you want more, I also tested it on scenarios relating to research metrics, open access mandates
Here's one interesting one (interactive with me playing the Professor) asking about open access mandates and self archiving. The initial version, I tested claimed that I could self archive any version "as long as I had permission". I redid it again, and it came up with "Yes you can self achive any paper regardless of journal"
This could simply be some randomness (I set it to allow some) but it also nicely reflects I think some debate over the ownership of article versions.

Writing my blog posts
Manuel Araoz's blog post - OpenAI's GPT-3 may be the biggest thing since bitcoin purports to share his experience using GPT-3. The stunning twist? At the end, the author reveals the post was written by GPT-3!
He managed to produce a blog post that fooled a lot of people based on the comments!
By using a simple prompt like
Manuel Araoz's Personal Website
Bio
I studied Computer Science and Engineering at Instituto Tecnológico de Buenos Aires. I'm located in Buenos Aires, Argentina.
My previous work is mostly about cryptocurrencies, distributed systems, machine learning, interactivity, and robotics. One of my goals is to bring new experiences to people through technology.
I cofounded and was formerly CTO at OpenZeppelin. Currently, I'm studying music, biology+neuroscience, machine learning, and physics.
Blog
JUL 18, 2020
Title: OpenAI's GPT-3 may be the biggest thing since bitcoin
tags: tech, machine-learning, hacking
Summary: I share my early experiments with OpenAI's new language prediction model (GPT-3) beta. I explain why I think GPT-3 has disruptive potential comparable to that of blockchain technology.
Full text:
I did a similar , perhaps less successful try in my last blog post.
I personally found it quite convincing and representative of my style but a "long time reader" of my blog commented that he could tell something was off . That said, it is important to note again there is no fine-tuning, so unless GPT-3 has a lot of my writings in it's training set (My name Aaron Tay in the prompt might have helped?), it is more generating writings based on common library discourse which is close enough.
Curious, besides the attempt to write a brand new blog post, I also did some attempts to auto-complete my older blog posts, with pretty good results.
Here I tried to autocomplete this blog post on Cocites and Connected Papers .I entered the title and the first paragraph of my text. In the image below, the last prompted line "In this blog post, I am going to review 2 new ones that have come across my radar - CoCites and Connected Papers." is shown and the autogenerated text begins in the second paragraph.

Prompting with title and first paragraph of this blog post on open abstracts, yields me an abstract of a paper? Intriguing idea, does open abstract increase citations?!?!

I can go on and on, but you can see the point. You might be suspicious though I am cherry picking examples and you would not be wrong to suspect. However, I can honestly tell you the successes far outweight the failures.
Go on long enough and it often says something odd, but often it is so on target for quite while that it feels eerie.
The memorization problem and bias in GPT-3
Because GPT-3 is trained on such huge datasets , there is a suspicion that the coherent answers you are seeing is because it has such essays or information in it's training set and it is just repeating what it has seen.
For example, many famous NLP benchmarks would be available on the web with perhaps even graded answers. GPT-3 via datasets like Common Crawl could have simply "memorised" these answers and repeated them back. OpenAI calls this "data contamination".
They considers this problem and shows that even excluding such examples, it still does well on the various benchmarks. But this measure isn't without problems
Take simple arithmetic problem.
I think we see some evidence of this in some tasks, for example when asking GPT-3 math arithmetic problems and it seems to do ok for 2 or 3 digit addition, subtraction problems but gets worse when doing bigger digits or multipication.

The authors of GPT-3 claim that they have tried to isolate test cases that match the training corpus and remove the results if they match and still get great results.
To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms " + =" and " plus ". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000
subtraction problems we found only 2 matches (0.1%),
So they claim GPT-3 is definitely not memorising but really learnt some patterns. They further claim the errors are not random e.g. some of the errors are things like forgetting to carry a 1.
Others have claimed in fact, the results are in fact likely due to memorisation despite this, because they failed to properly remove all cases where there was a match. For example in the arithmetic example, GPT-3 could be copying off tables where there are columns of data that add things up.
That said interesting prompts can lead to interesting results.
Seems to work pic.twitter.com/zzrtXamkgZ
— KaryoKleptid (@kleptid) July 17, 2020
People who believe GPT-3's abilities comes from mostly memorization are similarly not impressed with Word Scrambling and Manipulation Tasks e.g unscramble “lyinevitab”into inevitably.
There's also a whole section on bias in 6.2 that I don't want to go into. But the upshot is as they probe GPT-3 tendencies to assign gender pronouns to occupations, co-occurances of descriptive words with gender as well as tests on religion and race, they find the same biases you would expect from the training corpus they draw from.

Conclusion - what does this mean for us?
GPT-3 isn't true human general intelligence for sure as there does not seem to have any logic built-in (that's why it's not reliable for arithmetic) , even though it's flexibility might surprise us with the right prompts.
Experts are debating what the surprising performance we see with GPT-3 means e.g. how much of the peformance is due to just memorization of lots of data vs is there real learning from all the pattern matching (whatever pattern matching means where debating this tends to lead into philosophical areas like "philosophical zombies") or whether even larger models will lead to even more amazing breakthroughts (human level performance of various NLP benchmarks?).
Ultimately, one wonders if GPT-3 while not even close to General articial intelligence (GAI) might be one of the pieces of the puzzle on the road to GAI or if it is just a cool trick that is a dead end.
I am not qualified to weigh in on any of this, though I speculate there are parts in our lower brains that generate sentences from words at the lowest levels that might perhaps bear some resemblance to a GPT-3 like structure?
In terms of datasets used for training, I do not believe they have trained it on the academic research corpus. It probably does have titles and abstracts but most probably not full-text of articles behind paywall (unclear if it even has Open Acess articles). Might be interesting to see what it could do if it had all paywalled articles.
On the practical front, the main conclusion I take from this is that GPT-3 class of techniques will make it even harder to guard against fake news, as it is now easy to generate very human looking text that cannot be easily distinguished from computer generated text which has Information literacy implications of course.
One can already generate very realistic fake titles and abstracts for use in Information literacy classes on predatory journals.
But the ease of use of GPT-3 where just the right prompt allows anyone even a non-coder to get good results lowers the entry of barriers by a lot.
For example, I could imagine, one could even learn to mimic the writing or chatting style of a target person and translate any text a fraudster types into such a style to fool a victim.
Fortunately, countermeasures do exist and will be further developed.
Chatbots, I think will have a big leap in performance once they incorporate such technology (it's ability to parse questions looks amazing?) for answering simple factual questions but still fall short of human level AGI for really complicated questions.
I also speculate that we might start seeing more natural language style interfaces as a result. Or will the results be too random , backbox and uncontrolled to rely on in production systems? We shall see if this is a start of a new AI revolution, or just hype. Only time will tell.

