Why and How is Neural Architecture Search Biased?

What does this mean for their performance?

Feb 05, 2023

Neural Architecture Search (NAS) is being touted as one Machine Learning’s big breakthroughs. It is a technique for automating the design of neural networks. As someone interested in automation and machine learning, this is something I’ve been following for a while. Recently a paper titled “ Understanding the wiring evolution in differentiable neural architecture search” by Sirui Xie et al caught my attention. It delves into the question of whether “neural architecture search methods discover wiring topology effectively”. This paper provides a framework for evaluating bias by proposing “a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization”. It categorically shows that differentiable NAS is biased when designing networks, and expands upon the 3 common types.

A quick overview of the NAS. The paper looks at Differentiable (calculus-based) NAS

In this article, I will explain the types of biases, why they exist, and how they are detected. By understanding the techniques, you will be able to understand how to implement them to evaluate your own NAS (and other related techniques). Please be sure to leave your feedback on this article, and share it if you find it useful. NOTE: I use NAS, but this paper and article are specific to differentiable NAS. Experiments on other ones aren’t done yet.

If you’d like a more thorough understanding of NAS, feel free to watch the video below

A Tale of 3 Biases

The team did a thorough investigation of the 3 common patterns found in differentiable Networks created through NAS. In the words of the team: “Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization”. Figure 1 is an illustration from the paper showing the first 2 in a concise manner.

The team provides possible reasons for each pattern, as well as validations for their theories. I will be explaining each of them in detail.

Pattern 1: Growing instead of Pruning

Those familiar with Trees and Backpropagation would recognize the term pruning. Pruning refers to removing all the redundant or useless edges in a tree (or graphs in general). This is very useful in optimizing algorithms and used to simplify Decision Trees. Since Neural Networks have the same structure as Directed and Weighted graphs, Pruning can be implemented to reduce the cost of the network, while sometimes boosting results by reducing error that can occur since lower quality nodes are used.

In the case of differentiable NAS frameworks though we see something else happen. Instead of a Neural Network sniping of the low-quality edges in the network, the first step has the Network drop all edges. It then proceeds to pick the ones that have the best scores. This might not be a problem by itself but it leads to some sticky situations. A lot of the details and nuances of the proof for this involve a lot of math, that would require an entire series to break down. If you are interested, they are on Pages 5–7. In my annotated version of the paper (linked at the end of the article), I have highlighted the important aspects. They should help understand the flow a bit better. Here I will be attaching the graphs that show the trends that clearly show a tendency to grow.

“Surprisingly, for all operations except None, cost is inclined towards positive at initialization (Fig.4(a)). Similarly, we estimate the cost mean statistics after updating weight parameters for 150 epochs4 with architecture parameters still fixed. As shown in Fig.4(b), most of the cost becomes negative. It then becomes apparent that None operations are preferred in the beginning as they minimize these costs. While after training, the cost minimizer would prefer operations with the smallest negative cost.” None operations have a cost of 0, making them the easiest to lower costs. As training occurs, we see a shift from positive to then to negative. This is an indication that cell wiring topology is in fact growing.

Pattern 2: Preference to Width Over Height

This one is slightly easier to understand. The proof stems from an analysis of the data gathered over the first hypothesis (NAS biases growing over pruning). To phrase the problem simply, we want to find out if NAS created networks bias Wide Neural Nets over Deep ones. To understand the distinction look at the figure below. Wide networks would have lots of input layers, while deep ones would have more layers. Another way to understand is the following: Wide networks have fewer layers but more neurons per layer while Deep networks have more layers but fewer neurons per layer.

This shows itself in a simple way. Remember how NAS networks tend to drop all layers starting out? While growing we see a clear preference for the network to recover edges (connections)to input neurons before going to one of the intermediate (hidden) neurons. To understand width-bias, we need to understand 2 things: 1) NAS makes a distinction between input and intermediate neurons; 2) It favors the former. We also need to show that these are problems caused by bias in the NAS.

The paper hypothesizes that bias occurs because intermediate cells (neurons) are less trained. Taking an example from the paper: “ Note that in A every input must be followed by an output edge. Reflected in the simplified cell, 0;1 and 0;2 are always trained as long as they are not sampled as None. Particularly, 0;1 is updated with gradients from two paths (3–2–1–0) and (3–1–0). When None is sampled on edge(1; 2), 0;1 can be updated with gradient from path (3–1–0). However, when a None is sampled on edge(0; 1), 1;2 cannot be updated because its input is zero. Even if None is not included in edge(0; 1), there are more model instances on path (3–2–1–0) than path (3–2–0) and (3–1–0)that share the training signal.”

It validates this through the following experiment:

Showing we can alter preference from Width to Depth through training, we show unequal training to be the cause of the bias.

Pattern 3: No edge selected in bi-level optimization

Bilevel optimization is a special kind of optimization where one problem is embedded (nested) within another. The outer optimization task is commonly referred to as the upper-level optimization task, and the inner optimization task is commonly referred to as the lower-level optimization task. For some reason, we see that bi-level optimization tasks don’t mesh with the NAS generated networks.

The paper did not go into great detail as to why or with the proof. It explains the pattern by stating that “Fig.11(b) shows the comparison of L and H in the training set and the search set. For correct classification, L and H are almost comparable in the training set and the search set. But for data classified incorrectly, the classification loss L is much larger in the search set. That is, data in the search set are classified poorly. This can be explained by overfitting … In sum, subnetworks are erroneously confident in the held-out set, on which their larger Lactually indicates their misclassification. As a result, the cost sum in bi-level optimization becomes more and more positive. None operation is chosen at all edges.”

If that was a bit much here’s the summary: There is an indication of overfitting (large error in misclassification). This causes the cost of the bi-level optimization to rise, causing NAS to choose None at edges.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this writeup, I would appreciate you sharing it with more people.

For those of you interested in taking your skills to the next level, keep reading. I have something that you will love.

Upgrade your tech career with my newsletter ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here.

To help me understand you fill out this survey (anonymous)

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

Artificial Intelligence Made Simple

Discussion about this post