Since the Swish is basically the ReLU with a parameter that can tune it to be more like a Sigmoid (or vice versa), isn't it obvious from the outset that it will perform as good or better than either ReLU or Sigmoid?
It's like adding a new derived parameter to the model, if there's enough data to prevent overfitting the new function should outperform any functions that it can represent as a special case, right?
Since the Swish is basically the ReLU with a parameter that can tune it to be more like a Sigmoid (or vice versa), isn't it obvious from the outset that it will perform as good or better than either ReLU or Sigmoid?
It's like adding a new derived parameter to the model, if there's enough data to prevent overfitting the new function should outperform any functions that it can represent as a special case, right?