How to fight censorship using AI [Breakdowns]

A great demonstration of how we can use non-neural network architectures to great effect.

Jun 14, 2023

Hey, it’s Devansh 👋👋

In my series Breakdowns, I go through complicated literature on Machine Learning to extract the most valuable insights. Expect concise, jargon-free, but still useful analysis aimed at helping you understand the intricacies of Cutting Edge AI Research and the applications of Deep Learning at the highest level.

If you’d like to support my writing, please consider buying and rating my 1 Dollar Ebook on Amazon or becoming a premium subscriber to my sister publication Tech Made Simple using the button below.

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

We’ve all heard the adage- “Knowledge is Power”. Providing people with access to knowledge enables them to come up with great ideas and stress-test their theories against other people. This is one of the reasons that the pace of innovation and new knowledge in Open Source Software is so high- we have millions of people working on improving various projects independently. The insane developments in LLMs after Llama is due to this. Even this newsletter would not be possible without free access to information. To publish through traditional means, I would need someone to fund the printing, go through closed journals, send out copies to those who can afford it, etc. Reaching the growth we have hit would be close to impossible without significant investment. The internet, and the free access to information it provides enables humanity to leapfrog innovation and create next-gen technology.

To support this evolution, the report urges the international community to make global trade rules more supportive of emerging green industries in developing economies and reform intellectual property rights to facilitate technology transfer to these countries.

-The Tech and Innovation Report by the UN talks a lot about how knowledge transfer can drive innovation extensively

Unfortunately, as technology grows, so does the potential for misuse. Machine Learning can be used to automate many complex tasks in ways that would’ve been impossible just 15 years ago. Unfortunately, this includes censorship and mass surveillance at a scale previously unimaginable. AI Surveillance has the potential to deepen social inequality, which is something be to wary of. Ironically enough, AI and Machine Learning become one of the best tools for dealing with such problems as well.

I came across one such project in the video: AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more! This is a fantastic interview with the leader of the project. I loved two things about it- the fact that this research did something useful with AI, and their use of genetic algorithms. As someone who has talked about Evolutionary Algorithms being underutilized for a while, this had me interested. In this article, I will share some interesting (Machine Learning Takeaways) along with some ideas I had looking into this idea. This article is meant to introduce you to this crucial project and bring it to your attention. Make sure you look into them as they are instrumental for the future.

About Geneva

Looking at the website, “Geneva is a novel experimental genetic algorithm that evades censorship by manipulating the packet stream on one end of the connection to confuse the censor.” It has 2 components-

Strategy Engine: The strategy engine is responsible for running a given censorship evasion strategy over active network traffic.
Genetic Algorithm: The genetic algorithm is the learning component that evolves new strategies (using the engine) against a given censor.

The tool is open source, and their Github can be found here. Anyone can set up and run Geneva from their machines. Running it will start the algorithm that will try to test different strategies. It tried more and more refined strategies. If breaks through encryption, the team studies the results in order to extract information about the underlying censorship system.

Given the nature of the project, there are a lot of networking-related terminologies in their documentation. I know nothing about the field, so I won’t pretend to be able to break those aspects down. If anybody more experienced in these areas wants to get on and talk about those factors, I’d be more than happy to open up my platform to you. But I can explain the interesting AI aspects.

The Range

In the video I shared about Differential Evolution, I talked about how Evolution Based Algorithms have a larger possible search space than traditional gradient-based methods. This advantage really shows itself in problems like this. An evolutionary algorithm (like the Genetic Algorithms mentioned here) can traverse through a much more varied search space.

The fact that GAs don’t need a smooth search space is a big deal in this case

Take the search space for example. Gradient-based methods need a smooth, continuous search space. Genetic Algorithms can operate in cases when this isn’t the case. This is why they can tweak solutions and chain components of their search space to create new candidate solutions.

Google found great success using Evolutionary Algorithms for their Neural Architecture Search work. Check out my content for more information on that.

The leader actually touched upon this during the interview link. He mentioned how there is no gradient for ML methods to compute over. This is true for both the search space (we build strategies by chaining commands) and the output (passing through the censorship). In fact, he even mentioned that since Genetic Algorithms can test everything, they actually had to constrain the algorithm to some basic commands.

Search Space Commands

Evolution Based methods always have a valid basic set of commands that they can test out. This is what they use to create candidate solutions, and in tweaking existing solutions during the recombination/mutation phases. These are domain-specific. For this project, we have the following building blocks.

The team is able to use these 4 simple blocks to overcome extensive censors.

It’s important to understand how they came up with these as the basis. The team used a straightforward definition of censoring- Censorship is merely the modification of network traffic. Thus, the strategy is simply a “description of how network traffic should be modified”. From that lens, it’s clear that the block should consist of the possible ways that network packets might be modified.

The goal of a censorship evasion strategy is to modify the network traffic in a such a way that the censor is unable to censor it, but the client/server communication is unimpacted. — From the authors

The GitHub readme provides a pretty concise description of each of the building blocks.

duplicate: takes one packet and returns two copies of the packet
drop: takes one packet and returns no packets (drops the packet)
tamper: takes one packet and returns the modified packet
fragment: takes one packet and returns two fragments or two segments

It is important to note that duplication and fragmenting “introduce branching, these actions are composed into a binary-tree structure called an action tree. The trigger describes which packets the tree should run on, and the tree describes what should happen to each of those packets when the trigger fires.” An example tree can be here

The simple strategy shared earlier was also an example. I’m sure my followers (who are all very intelligent) will come up with more samples when they look into this.

The action trees are combined together to create full-fledged evasion strategies.

Fitness Function

Fitness functions are crucial to all Genetic Algorithms. They determine what the algorithm will consider good and bad, and will ultimately direct what kinds of strategies are involved. Just like the building blocks, designing this is not trivial. The authors of this protocol had a relatively simple function (which is great because it allows for many enhancements and modifications). In the documentation, they share the following priorities

As mentioned earlier, the possible search space for solutions is infinite. This means that cost-effective performance is crucial. This is especially crucial because the system is running from local machines (not large servers like most ML these days), and is against the computing power of a state. This is why the authors put such a premium on tripping potential deadweight solutions early

This hierarchy accomplishes a significant search space reduction. Instead of Geneva fuzzing the entire space of possible strategies (for which there are many!), it instead quickly eliminates strategies that break the underlying connection and encourages the genetic algorithm to concentrate effort on only those strategies that keep the underlying connection alive.

The fitness function creates learners with an emphasis on minimization of regret (regret here being the useless solutions). This works out for them because the search space is naturally large so we can hope that we are hitting the best viable performance. I would be interested in trying out runs with a “reservation” for a few bad solutions. Sometimes they can introduce strong learners when mixed with strong candidates after many generations. Since the building blocks are simple, it might not be super effective in this case, but I would love to see it.

Closing

This is an interesting example of adversarial design, where the learner is learning by trying to break the censors. During the interview, Kevin talked about how they were trying to design a few censors themselves. It would be interesting to explore an automated censor that evolves with this evader. Not only is this likely to happen in the future, but such a solution would accelerate the process. This kind of approach has seen great success in most ML

Combining this and a setup where we can link the results of all the runs back to the main system (like we do for self-driving cars) would be a great way to supercharge the evasion agent. It would also leverage scale very well. Make sure you look into the project and share your thoughts.

In case any of you are inspired into trying out Evolutionary Algorithms in getting around Network Censorship, here is a quick cheat sheet of the main components that you can use-

**Learning Nation-State Censorship with Genetic Algorithms**

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.