8 Comments
User's avatar
Eric Brown's avatar

I'm not so sure. LLMs don't have logical reasoning as we understand it, and don't have any grounding in the external world. As such, LLMs are well known to make up plausible nonsense (aka bullshit) at the drop of a hat, and, if you feed it sufficient amounts of nonsense, will generate nonsense back at you.

Expand full comment
Florin Cojocariu's avatar

You're right, so I am confused: what aren't you sure about?

Expand full comment
Eric Brown's avatar

I'm not sure that LLMs are the savior you're looking for. Let me expand (considerably).

1) You said "Say you're interested in genetics, buy one of these machines and ask it to ingest all the genetics texts it can find on the web. If the training phase is effective (this is the hardest part), you'll end up with a machine that can answer almost anything about genetics."

Well, IMO, your first problem is *finding* all the genetics texts it can find on the web. Web crawling is slow and expensive; distinguishing between genetics texts and non-genetics texts is at least as difficult as building an LLM; and, getting back to my original comment, once you *have* this LLM, you can't trust what comes out of it.

Thirdly, elites & governments & others will trivially be able to "spam" your web crawler by generating fake input; they'll have their LLMs spewing out terabytes of plausible BS that feed your LLM.

The only reason this hasn't happened with ChatGPT (yet) is the immaturity of the internet, although even now I'm sure that people are spinning up instances of LLaMA and the like to generate more internet BS.

Expand full comment
Florin Cojocariu's avatar

And, you may want to look at this: https://chat.openai.com/share/ae962dde-1c19-4795-95d9-4114b152ba3e, there are step by step instructions on how to install a LLM on my mac M1

Expand full comment
Eric Brown's avatar

Yes, I've actually done that on my personal machine.

To be clear - my intuitions are as follows:

1) Publicly available LLMs will be at most as reliable as Wikipedia.

2) Attempts to "debias" LLMs via custom post-training will either fail or become illegal.

3) Building your own LLM from scratch (as opposed to *running* your own LLM) will *definitely* be illegal, if only by licensure of sufficiently powerful GPUs (they're already export-controlled, for instance).

4) Even if you did build your own LLM, you will at best get a competent graduate assistant, not a mad genius.

Expand full comment
Florin Cojocariu's avatar

I mostly agree, but I think in a few years the power and software to run something like a gpt4 will be on every PC. 2) is a risk, but we'll see. 4) I never said that :) Can you please share what model you used, how and what you trained it on and with what success?

Expand full comment
Florin Cojocariu's avatar

Didn't had the time yesterday, now I can give you an clear example how one can achieve what the articles say we all can. I know nothing about genetics (I'm trained in physics), but I ask chatGPT for a list of reference books in genetic engineering. I get this:

1. "Principles of Gene Manipulation and Genomics" by Sandy B. Primrose and Richard Twyman

2. "Molecular Cloning: A Laboratory Manual" by Joseph Sambrook, David W. Russell, and Michael R. Green

3. "Genetic Engineering: Principles and Methods" edited by Jane K. Setlow

4. "Gene Cloning and DNA Analysis: An Introduction" by T.A. Brown

5. "Plant Genetic Engineering: Methods and Protocols" edited by Jodie S. Holt and Assunta Bais

6. "Gene Therapy: Principles and Applications" edited by Mauro Giacca

7. "Genetic Engineering: Principles, Procedures, and Consequences" by Animesh Datta and Geetha K. Jayaraman

8. "Genetic Engineering: Principles and Methods" edited by J. K. Setlow and A. Hollaender

9. "Recombinant DNA: Genes and Genomes - A Short Course" by James D. Watson, Richard M. Myers, and Amy A. Caudy

10. "Principles of Gene Manipulation" by R.W. Old and S.B. Primrose

If I know where to look for them, I can get them for free, but I can also buy them, they're not hidden. Or I can buy them and download their pdf versions from Z Library so as not to waste time scanning and OCRing them. The point here is that they are on the web in one form or another, and if I can find them in 1 minute, a specialized crawler can do it in much less.

Next, I train my LLM on these 10 books, including citations and bibliographies. This may be the most difficult task, given the way LLM training works today. But because it's the most difficult and expensive step, I have no doubt that it will soon be greatly simplified.

Once trained, the first thing I ask it to do is to go out there on the web and search for all the authors and important topics it identified in the first training set. Then "ingest all texts related to them. That way, I'll be close to a complete training set in genetics, built on a solid foundation. Since this is a closed field with a specific, precise language, I'm pretty sure it won't take much training to get the right word embedding.

What's more, in the case of such a closed, well-defined training set, you can't get bullshit out of it. (The reason for this can't be fully explained in comments, but the short version is that when you build word embeddings - probabilities of "distances" between words - in a closed, well-defined training set, they have small deviations, because in scientific texts concepts are always grouped in a specific way. It's clear from the long essay)

The bullshit comes out of a LLM when the training set or the training method produces high statistical deviations for word embeddings, which together with the randomness intentionally built into the algorithms (by something called "temperature") can produce false or absurd statements, simply because the options for the most likely next word are too many and different for the machine, sending it on a cvasi-random path when adding word after word. We need to keep in mind how answers are generated, word by word, to understand why high deviations in word embedding values make bullshit output more likely: once a wrong word is chosen, the following ones will go on the wrong path.

Hopefully this makes it clearer why an LLM will enable people to learn almost anything.

Expand full comment
Florin Cojocariu's avatar

:) never said LLM’s are a savior. Quite the contrary, in the essay. However, I get your point. Let my try to answer: regarding the source- to get most of genetics texts today, try something like Z-library. You may need an Onion browser or to set up a Telegram bot. You’ll be shocked. And this is *not* deep web. But you have a point, it’s not as easy as I said. The thing is, it can be.

Regarding the bullshit served by LLM’s- what you put in is what you get out. If the training set is rigorous (say, all fiscal law), you can’t get bullshit from it.

My point is not that LLM’s are saviors but they are empowering the individual at an unseen scale. It’s a long discussion, it would be easier to read the main text: https://antimaterie.substack.com/p/the-all-knowing-father

Expand full comment