Cloud Infrastructure for AI: What You…

Mar 3

How compute, storage, and networking shape your AI projects—and why understanding infrastructure can give you a competitive edge

Read →

6 Comments

Linus Blumentrath

Mar 10

Very interesting article. Thank you for publishing. As I am researching the topic myself, I am asking myself why you mention NVLink and InfiniBand as Inter-GPU communication protocols. Maybe you can help clarify it for me :) It is my understanding that NVLink is a protocol that serves intra-server GPU-to-GPU communication, while InfiniBand is used as protocol for inter-server GPU-to-GPU communication. As such they are not really interchangeable with each other. While I understand that there is no real alternative to NVLink in intra server communication, this is not the case for InfiniBand where Ethernet-based protocols could be considered as an alternative. Especially, considering RDMA like feature like RoCE. What is your take on this? Kind regards and thanks again for the great read!

Expand full comment

Reply (2)

Devansh

Mar 11

This is a question for @Barak Epstein since he’s the guest author

Expand full comment

Barak Epstein

Mar 18

Linus, thanks for being a careful reader and apologies for the delay in response.

The section was about "key areas in which networking-related innovation is likely to impact the success of AI systems" and Infiniband can be leveraged as part of a cross-server, multi-GPU cluster*. It's true that there are alternatives, such as RoCE, but Infiniband is more prevalent in HPC enviroments, which is where I spend more of my time, so that probably influenced the discussion.

I admit that this section could have been structured more clearly.

* https://militaryembedded.com/comms/communications/gpus-infiniband-accelerate-high-performance-computing

Expand full comment

Reply (1)

Linus Blumentrath

Mar 22

Dear Barak, thank you very much for taking the time to reply and for clearing things up for me:)

It is true that when looking at the networking fabrics used in the TOP500 supercomputers, this is mostly InfiniBand or other „hyperfabrics“ like Cray etc. However, I see that Ethernet also has some presence there and might even be „catching“ up. But my guess is that HPC is not always HPC and the deployment of a certain fabric is highly use case/workload specific. But well thanks again and have a great day:)

Expand full comment

Biotech Bagholder

Mar 18

Great article

Expand full comment

Reply (1)

Barak Epstein

Mar 18

Thank you, Michael!!

Expand full comment

Artificial Intelligence Made Simple

Cloud Infrastructure for AI: What You…