Interesting Content in AI, Software, Business, and Tech- 07/10/2024 [Updates]
Content to help you keep up with Machine Learning, Deep Learning, Data Science, Software Engineering, Finance, Business, and more
Hey, it’s Devansh 👋👋
In issues of Updates, I will share interesting content I came across. While the focus will be on AI and Tech, the ideas might range from business, philosophy, ethics, and much more. The goal is to share interesting content with y’all so that you can get a peek behind the scenes into my research process.
I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction. Help me democratize the most important ideas in AI Research and Engineering to over 100K readers weekly.
PS- We follow a “pay what you can” model, which allows you to support within your means. Check out this post for more details and to find a plan that works for you.
A lot of people reach out to me for reading recommendations. I figured I’d start sharing whatever AI Papers/Publications, interesting books, videos, etc I came across each week. Some will be technical, others not really. I will add whatever content I found really informative (and I remembered throughout the week). These won’t always be the most recent publications- just the ones I’m paying attention to this week. Without further ado, here are interesting readings/viewings for 07/10/2024. If you missed last week’s readings, you can find it here.
Reminder- We started an AI Made Simple Subreddit. Come join us over here- https://www.reddit.com/r/AIMadeSimple/. If you’d like to stay on top of community events and updates, join the discord for our cult here: https://discord.com/invite/EgrVtXSjYf. Lastly, if you’d like to get involved in our many fun discussions, you should join the Substack Group Chat Over here:
Community Spotlight: Hai Huang
Hai Huang is a Senior Staff Engineer at Google, working on their AI for productivity projects. He shares papers, insights, and commetary on recent AI papers. I find his posts to be very informative, because Hai doesn’t shy away from talking about the Math/Technical Details, which is a rarity on LinkedIn. He’s become one of of my go-to sources for keeping up with technical developments, and if you want to follow creators with more technical substance- I’d recommend doing the same.
If you’re doing interesting work and would like to be featured in the spotlight section, just drop your introduction in the comments/by reaching out to me. There are no rules- you could talk about a paper you’ve written, an interesting project you’ve worked on, some personal challenge you’re working on, ask me to promote your company/product, or anything else you consider important. The goal is to get to know you better, and possibly connect you with interesting people in our chocolate milk cult. No costs/obligations are attached.
Previews
Curious about what articles I’m working on? Here are the previews for the next planned articles-
Still TBD. We’re going through some planned changes, so I’m still thinking on what to do.
Coming tomorrow-
Highly Recommended
These are pieces that I feel are particularly well done. If you don’t have much time, make sure you at least catch these works.
TechBio Innovation: Latest Trends
I say this every time Marina T Alamanou, PhD publishes, but her work is amongst the best for the TechBio space. Her updates are insightful, technically comprehensive, and she mixes up market insights with biology things. If you want to keep your finger on your pulse for the tech-bio space, she’s an elite resource. Sharing her analysis of one of the players in the space to give you an idea-
SilicoGenesis (2019, Belgium) is developing an AI-enabled in silico antibody engineering platform, providing state-of-the-art AI/ML technology integrated into a scalable cloud-based platform for: Precise 3D modeling of protein structures, Accurate prediction of paratope and epitope residues, Characterization and modeling of protein-protein interactions, Enhancement of binding affinity and mutagenesis studies, Cross-species reactivity analysis, Expert humanization of antibody candidates and Rigorous assessment of developability liabilities. Their cloud-based software platform for biologics design, discovery and optimization is called Eve.
They also offer Protkit, an Open Source Python library that can be used for a variety of tasks in computational biology and bioinformatics, focusing on structural bioinformatics, protein engineering and ML. Protkit allows you to download protein structures and sequences from a variety of trusted sources, such as RCSB PDB, Uniprot and SAbDab.
In particular, Protkit can be used for a variety of tasks in computational biology, such as reading and writing from popular data file formats, downloading data from popular protein databases, data structures for representing proteins and molecules, detecting and fixing anomalies in structures, calculating properties of proteins, running various external tools, featurization for ML, etc.
The company was founded by Lionel Bisschoff and Fred Senekal in 2021 in Johannesburg, South Africa and Ian Wilkinson is the scientific advisor. Ian is a protein biochemist with an interest in protein and antibody engineering that has previously acted as the CSO of Absolute Antibody (for antibody sequencing, engineering and recombinant manufacturing) and mAbsolve with a Fc silencing technology (which he co-founded) and is the founder of mAbvice consulting services.
Shortly after, SilicoGenesis’ headquarters were established in Leuven, Belgium. In 2022, their first academic collaboration was established on a project involving CART-T cell therapy, utilizing the Eve platform. In the first quarter of 2023, they begun collaborations with partners from around the globe including one of the largest pharmas based in Europe, on an affinity maturation project for a difficult cytokine target:
⏹️ Given the sequence of the antibody candidate, they predicted its structure and PPI interaction with the target antigen. They successfully modeled the three-dimensional structure of the complex and used this highly accurate model to perform in silico saturation mutagenesis. They were able to correctly identify all of the top affinity-enhancing mutations in the CDR regions confirmed by in vitro saturation mutagenesis using ELISA and SPR.
⏹️ They performed rapid in silico affinity maturation of a lead candidate in less than a month.
Recently, SilicoGenesis began a collaboration with the Laboratory for Thrombosis Research (KU Leuven) and PharmAbs (KU Leuven Antibody Centre). The project involves in silico antibody engineering and optimization as well as in vitro and in vivo testing and validation. The goal of the project is to determine if the engineered lead molecule has improved developability characteristics and preserves binding affinity and species cross-reactivity.
How Anthropic does Tokenization
A very interesting post by Max Buckley on some of his experiments with Claude’s tokenizer-
Here is what I learned:
The tokenizer for Claude 3 and beyond handles numbers quite differently to its competitors
For numbers preceded by a non-word character (space, quotation marks, etc.) they are represented by two tokens (not one token as elsewhere).
Similar to GPT-4/Llama 3, Claude 3 groups up to 3 digits in a single token (plus the mystery standalone number preceding token at the beginning of sequences of numbers).
The Claude 3 tokenizer groups numbers right to left (R2L), rather than left to right (L2R) like its competitors. I.e. in Claude 3 the number “1000” gets tokenized as [<mystery number token>, “1”, “000”] for three tokens, while in Llama 3 or GPT4 this would be [“100”, “0”] for two tokens. There is a recent research paper https://lnkd.in/e-AfR-6c which suggested that this R2L number handling is significantly more performant.
If a number is preceded by a word character i.e. “17” then the mystery number token is not used. l.e. “17” is two tokens, as is “7”, “71” is three tokens
All of this can be verified by using the Claude API which returns the token counts or the use of the Anthropic workbench https://lnkd.in/eU9BQytH. Both of these allow you to limit the number of return tokens and the model can be asked to reply a message back to you. The approach was inspired by the work of Javier Rando
The other model tokenizers are public (GPT4, Llama 3, etc), and can be explored on tiktokenizer https://lnkd.in/eKEfb6ja
New Chunking Method for RAG-Systems
Anytime Jean David Ruvini shares something related to NLP, I listen. He’s one of the leading experts in NLP, globally- and I’ll always grateful that he takes the time to share his insights online. Sharing this one since a lot of you are working on RAG related problems.
Dividing large documents into smaller parts is an essential but critical factor that influences the performance of Retrieval Augmented Generation (RAG) systems. Frameworks for developing RAG systems usually offer several options to choose from. In this article, I want to introduce a new option that attempts to recognize a change of topic with the help of sentence embeddings when subdividing the documents to carry out the subdivision at these points. This lays the foundation that in the embedding step of an RAG system, vectors can be found for the text parts that encode a topic and are not a mixture of several. We have presented this method in a paper in the context of topic modeling, but it is also suitable for use in RAG systems.
Why It’s So Hard To Secure AI Chips
Super interesting share by LaSalle Browne on some interesting challenges on Security for AI Hardware.
Demand for high-performance chips designed specifically for AI applications is spiking, driven by massive interest in generative AI at the edge and in the data center, but the rapid growth in this sector also is raising concerns about the security of these devices and the data they process.
Generative AI — whether it’s OpenAI’s ChatGPT, Anthropic’s Claude, or xAI’s Grok — sifts through a mountain of data, some of which is private, and runs that data on highly valuable IP that companies in the field have a vital interest in keeping secret. That extends from the hardware, which can be corrupted or manipulated, to the huge volumes of data needed to train the models, which can be “poisoned” to achieve unexpected results. And while all of this may be an inconvenience in a smart phone, it can have devastating consequences in safety-critical applications such as automotive and industrial, or in mission-critical applications such as financial transactions.
Beyond the AI Apocalypse: Rethinking How We Forecast Existential Risks
Filippo Marino writes very interesting pieces on risk-judgements, AI, and what how various people use sensationalist claims to manipulate our psychology. This one, which breaks down various logical/sampling errors that famous claims of AI doom make is no exception. Copying one of his many great takes here-
I submit to you that we have an Invisible Gorilla in our AI risk assessment, and it’s this: We’re assuming that AI researchers are the best qualified to forecast the existential risks posed by AI.
Would we ask nuclear physicists to predict the geopolitical consequences of nuclear proliferation? Would we rely on geneticists to forecast the societal impacts of CRISPR technology? Probably not. We’d recognize that while their technical expertise is crucial, understanding the broader implications requires insights from political scientists, sociologists, economists, and a host of other disciplines.
Yet when it comes to AI, we’re doing exactly that. We’re asking the people who are heads-down in the code to lift their gaze and predict the fate of humanity. It’s like asking a petroleum engineer specializing in hydraulic fracturing to estimate the technology’s impact on global climate or on the frequency and severity of catastrophic weather events in the next 100 years. They might have some interesting insights, but are they really the best equipped for that task?
A Robust Algorithm for Forecasting the S&P 500 Index
AI in Financial Markets is soemthing I’ve always been interested in, but don’t know too much about. I found this piece interesting, especially the approach of using multiple kinds of signals that build maps to reduce the overfitting from one kind of data. I can’t comment on the efficacy, but from a pure AI/Math perspective- there are lots of great things here worth exploring.
Recent advancements in Artificial Intelligence (AI) and machine learning have demonstrated that equities markets can be timed with sufficient accuracy to significantly improve risk-adjusted returns relative to the traditional “buy and hold” strategy. In particular, the same family of algorithms that understand and recognize our spoken words on smartphones can be used to recognize profitable market opportunities. ITRAC software has been used by large international banks and financial institutions to guide trading. This paper presents and discusses the main concepts, approaches and algorithms used by ITRAC to provide short term forecasts for the S&P 500 index.
A powerful algorithm developed by IntelliTrade, Inc. is used to forecast noisy time-series data, with applications in finance and investing. This computer algorithm is based on discrete statistical mappings, eliminating the subjectivity of most other technical analysis procedures. The algorithm is dynamic, robust, and adaptive. It represents a paradigm shift in that it makes no assumptions of underlying statistics or distributions.
Searching for Best Practices in Retrieval-Augmented Generation
Another RAG related piece that might help a lot.
Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” strategy.
While the title is click-baity, Modern MBA does a very good job with exploring how many companies utilize AI to manufacture hype, jack up their valuations, and engage in pump and dumps. He does a good job tracing back this tendency from big data onwards, which can be helpful to understand the language orgs use to confuse investors and regular folk. AI has many useful usecases, but I it’s important to not allow yourself to get maniupulated by people trying to piggy back off successful project to sell their hype.
Tech is a sector unlike any other — it’s an industry where individuals can turn into billionaires overnight, ideas supersede fundamentals, and leaders are rewarded for showmanship. In today’s Silicon Valley, innovation is crowned and not earned. Venture capitalists and founders are symbiotic. Unprofitable companies are kept alive with injections of capital, gamed valuations, and manufactured hype with the goal of surviving long enough to IPO. Starting in the early 2010s, Silicon Valley had championed big data as a revolutionary technology that could unearth deep insights, hidden patterns, and innovation from massive amounts of data. Yet the market started to question in the early 2020s if any of these promises had even been real as nearly all consumer and SaaS startups were still bleeding nearly a decade later. Out of nowhere, ChatGPT was released and AI became Silicon Valley’s next big thing. Every tech company is now an “AI company”, every Fortune 500 needs an “AI strategy”, VCs are only investing in AI startups, and every product is an “AI” product. This is a deep dive into how artificial intelligence is just the latest tale spun by Silicon Valley to sweep prior failed trends under the rug, keep valuations high, and outlook positive. Before AI, there was crypto, web3, blockchain, virtual reality, big data, IoT, and wearables — all supposedly revolutionary technologies that have never lived up to the hype. In this episode, we’ll dive into the market dynamics that push companies and individuals to jump headfirst into tech trends, how this all started with big data, and why AI is ultimately just another pump-and-dump.
I didn’t know there was something called ‘foodtech’ so shoutout Rubén Domínguez Ibar for helping me something new.
This post is crafted for the weekly edition of The VC Corner newsletter, aiming to give the audience an insightful glimpse into the vibrant world of foodtech. I hope you enjoy it! 😀 …
Ironically, if we analyze history’s top 10 life-saving discoveries, innovations in food and agriculture stand out as one of the most critical areas, and have been indispensable for ensuring global health and longevity.
From breakthroughs in nutrition and food safety to sustainable farming practices, the impact of food-related sciences is immense, proving that food is not just a basic necessity but a pivotal element in saving lives and improving the quality of life worldwide.
But with big problems come big opportunities. Embracing new advancements can lead to healthier diets, reduced environmental footprints, and a more equitable distribution of food resources.
As we stand at the intersection of innovation and necessity, it’s time to recognize the pivotal role of agri-foodtech in shaping the future of our food system and embrace it with open arms!
The Method Google Used to Reduce LLM Size by 66%
Our boy Logan Thorneloe has been stepping up his newsletter with more detailed explorations, and this one was amongst my favs.
Tl;dr:
Knowledge distillation is a model training method that trains a smaller model to mimic the outputs of a larger model. This has the potential to train up to 70% smaller models while only losing 3–10% performance compared to their larger counterparts.
Google’s Gemma 2 shows that distilled models can perform better than models of the same architecture (and size) trained from scratch.
Google used knowledge distillation to reduce the size of their open LLM Gemma 2 from 27B parameters to 9B parameters while retaining 96% user satisfaction.
Knowledge distillation is an excellent example of how machine learning can make software development easier. It showcases the ability to code one model and train it for different tasks by only adjusting data.
7-Eleven Is Reinventing Its $17B Food Business to Be More Japanese | WSJ The Economics Of
In Japan, 7-Eleven has long led its American counterpart in prepared meals with foods like ramen and rice balls. But now, the world’s largest convenience store chain is trying to bring a similar range of food items to its U.S. stores, and market them to customers who associate the brand with Slurpees and pizza. American 7-Elevens are working on mimicking the Japanese distribution centers by upgrading their commissaries around the country.
Other Content
Human thought vs Human language: MIT psychologist explains | Edward Gibson and Lex Fridman
Given our recent Wittgenstein-esque exploration, figured this video would be interesting
Economist explains why India can never grow like China
Good exploration of the systemic differences that allow countries to leapfrog ahead of other economies.
How much would it cost to buy the ocean? — Astrid J. Hsu
Surveying his vast domain, Poseidon considers retirement. What if someone else donned the coral crown so he could spend his immortality harmonizing with whales and cozying up to hydrothermal vents? Poseidon decides he needs to prioritize himself for once. So, he summons his accountant and asks: how much could he sell the ocean for? Astrid J. Hsu conducts a financial analysis of our oceans’ worth.
How Pakistan Broke YouTube Globally for Two Hours
Interesting video on how so many of our tech based systems are reliant on trust, and can be vulnerable to exploitation (although in this case, it wasn’t deliberate). Wonder what can be done for better robustness?
Interesting insights from Nathan Lambert on his experiences using Claude. I’m a little sunny on it, but it’s still worth checking out-
Beyond the base metrics and throughput, Anthropic’s models consistently feel like the one with the strongest personality, and it happens to be a personality that I like. This type of style is likely due to focused and effective fine-tuning. The sort of thing where everyone on the team is in strong agreement with what the model should sound like.
Some ways you can see the style of Claude 3.5 include:
It is more assistant-like, asking “Should I do X” at the end of an answer to simple questions or requests.
Having a tone that is focused and particular with it’s words, as opposed to the sometimes unnecessary verbosity of ChatGPT’s recent models. I see this in some of the more flavored filler like “that makes sense” when correcting a typo in my question.
Being quicker to drop all placeholder text when asked clearly to solve a task and not contextualize in text.
If you liked this article and wish to share it, please refer to the following guidelines.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
Your ITRAC link looks like a total scam. The website has no contact information, many broken links, no products or prices, and looks like it’s from the stone-age. The videos are low resolution nonsense.
You should do better than link to this garbage.
Thanks, Dev. Interesting reads, as usual.