1 Comment

Since the Swish is basically the ReLU with a parameter that can tune it to be more like a Sigmoid (or vice versa), isn't it obvious from the outset that it will perform as good or better than either ReLU or Sigmoid?

It's like adding a new derived parameter to the model, if there's enough data to prevent overfitting the new function should outperform any functions that it can represent as a special case, right?

Expand full comment