Cloud and Fabs – different but similar

With all the buzz about cloud, multi-cloud and the ongoing consolidation in the cloud, I was reminded of a conversation with Ryan Floyd a couple of years back. Back then, we were comparing and contrasting the viability of cloud as a business.  The cloud was rapidly looking like the fab business, while Ryan felt different.  The conversation then was on the capital intensive nature of cloud as a business and the analogies with semiconductor fabs . There are some interesting similarities and differences. Lets contrast the two….

Fabs:  A view on Semiconductor/Fab business in the context of this thread:  It has taken 30 years of Moore’s law and consolidation to result in perhaps 3 companies that have the capital, capability and platform stack. In the case of logic – its Intel, TSMC and Global Foundries. In the case of Memory its Samsung, Toshiba, Micron and perhaps Hynix. 1985 was the year of the modern CMOS logic fabs (with Intel shifting to logic from memory). But what is interesting is where they top 3 are in terms of their approach.  Intel is vertically integrated (fabs + products) and trying to move upstream. TSMC has taken a horizontal ‘platform’ approach and Global Foundries has had a mix (processors – x86 and Power now) and still trying to find its way into the Horizontal vs Vertical integration chasm.

Cloud: By all accounts – the cloud business kicked off circa 2006 with AWS launch of web services. It has taken roughly 10 years to arrive at the same stage in the cloud business with the big three AWS, Google and Microsoft cosolidating the category .  All three have reached scale, capacity and technology stack, that its going to be hard to be created by others. Sure, there might be Tier-2 cloud (Ebay, Apple, SAP, Oracle, IBM etc) or geo specific (China) or compliance specific cloud operations,  but these three will drive consolidation and adoption. Its no longer just capital, its the technology stack as well.

What’s more interesting is to compare and contrast the top 3 semiconductor fabs and the 3 cloud companies in their approach and where do they go from here.

Lets start with the fabs and will focus on the logic side of the business as its the fountain head of all compute infrastructure. Their combined revenue is close to $100B ($22B in capital spend – 2016). Apart from being capital intensive, we now have a complex technology stack to deliver silicon (design rules, to libraries to IP, packaging and even software/tooling) to make effective use of the billions of transistors.   Similarly, now with cloud with the value is moving up the stack to the platform aspects. It is no longer logging into and renting compute or dump data into S3 via a simple get/put API. It is about how to use infrastructure at scale with Lambda, Functions or PaaS/Platform level features, APIs that is specific to that Cloud. You can now query on your S3 data. That is API/vendor lock-in.

The tooling required to deliver a chip product, while not specific to a fab, the optimizations are specific to the fab. Same with the cloud. The tooling might be generic (VM, containers for e.g.) or open source, but increasingly its proprietary to the cloud operations and that is the way it will be.

From a technology stack, competency and approach to market, Google looks more like Intel, AWS like TSMC and Microsoft like Global foundries (not in the sense of today’s market leadership).  Intel is vertically integrated and Google shows more of that.  Intel has deep technology investments and leads the sector and so does Google/GCP as contrasted with AWS or Microsoft by far. Every fabless semiconductor first use is TSMC foundry and same with any cloud based business (AWS first).  Cloud infrastructure unlike fabs were not the primary business to start with. All three leveraged their internal needs (Google for search, AWS for books/shopping and Microsoft for bing and its enterprise apps) leveraged their initial or primary business to fund the infrastructure.

Can one build your own cloud  i.e. “Real Men have fabs” – [TJ Rodgers – CEO of Cypress Semiconductor famously quoted]. While building and developing semiconductor is both capital intensive and needs deep technology and operational experience, cloud can be built with all the open source code that is available. While that is true,  despite the availability of plethora of open source tools, its the breadth and depth of tooling that is difficult to pull off. Sure, we can assemble one with CoreOS, Mesosphere, Openstack, KVM, Xen, Graphana, Kibana, Elastic search etc etc. Increasingly the stack becoming deep and broad, its going to be hard for any one company (including the big Tier-2 clouds named above) to pull it off at scale and gain operational efficiency. Sure, one could build a cloud in 1 or 2 locations. How to do you step and repeat and make it available around the globe and at scale. Intel and TSMC eventually excelled in operational efficiency.  Sure Dropbox might find it cheaper to build their own, but the value is shifting from just storing to how to make it available for compute. That level of integration will force the swing back to the big three.

Cloud Arbitrage: Multi-cloud vs Multi-fab: There is the rage today to go multi-cloud. How great it would be to move from AWS to GCP to Azure at the click of a button. Tried the same in the 1990s, while at Sun for processors. Wanted to have multi-fab. TI was the main fab and wanted TSMC, UMC and had engagements with AMD. The reality is, the platform stack has unique features that the solution will naturally gravitate to. Its more expensive to be in multi-cloud as a strategic direction than picking a cloud partner and drive deep integration. Yes, from a business continuity and leverage of spend, one would want multi-cloud. The reality is Netflix is with AWS and not with Microsoft or Google. They are doing fine as a business. Perhaps, you don’t have to pick the entire application stack to run in all the cloud. You are better off picking specific categories and LOBs can run them in specific cloud. That brings diversity and perhaps continuity of business as well as leverage the unique properties. For e.g. developing machine learning type apps, GCP is better than AWS. For  video streaming, maybe AWS is just fine (although google will tell you they have more POPs and capability due to YouTube for this).

Where do we go from here in the cloud? I most certainly will be wrong if you were to come and read this blog in 2020. But there are some truisms (not a complete list – but a start..)

Vertical Integration: The current $18B Cloud business will be >$100B between 2020-2025. That is a seismic shift that will impact all including businesses down the infrastructure stack (semiconductor companies for e.g.) as the big three will show signs of more vertical integration in their stack including having their own silicon. Intel likewise is trying to get more vertically integrated and microsoft is trying to find its way there. Maybe the exception here is going to be TSMC staying truly horizontal. The big three cloud operations are and will be more vertically integrated. There is also a culture or gene pool aspect of this.

Service v Services (or IaaS vs PaaS):  Despite all the technology chops at Intel, they have had mixed results  in the fab service business. While AWS has excelled in the IaaS part, its ability to build an compelling eco-system around its platform strategy will be tested. Likewise for Google, while traditionally it has strong in-house platform assets,  building a strong developer community (e.g.  Android) while delivering a great customer centric experience would be the challenge.  Microsoft by nature of having a strong enterprise apps footprint can and could get both the service and services right. Goes back to the gene pool or the service mentality vs services mentality. AWS has excelled in the service aspect  (TSMC excelled in the 90s) and leader in services as well. GCP (akin to Intel) has the platform strengths and has to supplant it with a modern era customer engagement model to gain market share. This  will require  cultural/organizational shift to be service oriented. Not just technology or business model.

Lock-in: Lock-in is reality. Have to be careful which lock and key you will want, but that will be real and go with eyes wide open.  Its now at the API level and moving up the stack.

Data gravity: Increasingly data will be differentiator. Each one of the big three will hoard data. There are three types broadly speaking (private, shared and global)  of data and applications will use all of them. This will be a gravitational pull to use a specific cloud for specific applications. IBM has started the trend to acquire data (weather.com). Expect the big three to acquire data that is needed by applications as part of their offering. This will be another form of lock-in.

Cloud native programming (Lambda, Functions, Tensorflow…..) A similar holy grail approach has been attempted in the silicon side as well with ESL and high level synthesis. What is interesting for comparison is, generic application development is approaching what the hardware guys have been doing for decades – event driven (sync, async) programming or approach data flow. This is an obvious trend that kicked off in 2016.  This is a chasm for the generic programmer to cross (despite the crumminess of verilog, its hard to program). This is where each one of the big three will take different approaches and differentiate and create the next lock-in.

In Summary, today (2017), AWS looks more like TSMC. GCP more like Intel and Microsoft somewhere in-between (GF?). We will revisit in 2020 to see what they look like – more similar or different?

The VMware moment in storage?

“No army can withstand the strength of an idea whose time has come” – Victor Hugo.

We seem to have reached that moment in time for storage like with compute back in Circa 2003.

Why 2003? The 2 socket 2U became the workhorse compute engine for range of workloads (the web cloud and enterprise virtualization).

 

pedge_2650

This is a form factor that spawned a whole category of compute that was not done before. The same underlying silicon technology (processor and DRAM) enabled the acceleration of adoption of the architectural shift.  I pick 2003 because between 2002 and 2004 number of events that conspired to rise of compute virtualization while simultaneously enabling distributed computing. Here are some notable events.

  1. Throughput (Multi-core/threading) over latency (MHz): AMD announcement of Opteron (x86-64) in 2003 with Intel following suite in Gallatin in 2003, followed by its own 64 bit in 2004. It was the beginning of the multi-core era for x86 (whereas it had dawned on others earlier (sparc & sparc ). DDR2 memory was introduced at the same time.
  2. Compute Virtualization: VMware acquisition by EMC but more importantly major update to ESX version with support for VMotion, Virtual SMP.  VMware drove the definitions of Virtualization into x86 CPUs that showed up 2-3 years later. (VT-x etc)
  3. Distributed Computing: Google sharing ‘GFS’ the paper that was perhaps the motivation for MapReduce and eventually Hadoop

The common theme with these 3 events were, the availability of dual socket multi-core CPU based system (dell-2U) with more than adequate compute and memory at low cost that enabled both server consolidation (Virtualization) and emergence of distributed computing outside of Google (Hadoop for e.g.).


 

2017 = 2003 for storage. The emergence of 2U 24 Drive NVMe storage

224-nvme

  1. Storage throughput: Emergence of NVMe as the performant flash/storage interface with potential cost cross over SSD and more importantly delivering high throughput much like multi-core/multi-socket CPUs for compute. With each drive sustaining 2-3GB/s, a commodity storage platform can deliver 30-40 0GB/s. This is timely with the emergence of RoCE and 100Gb Ethernet (4x100Gb)
  2. Emergence of 16 TB flash drives (3D-NAND) and cost of NAND is at the cusp of dramatic cost decrease with capacity increase.
  3. Distributed System: The emergence of variety of robust distributed data stores (some call it software defined storage) solutions – mostly from emerging startups (see  below).

Once again thanks to underlying silicon technology (3D NAND in this case and NVMe), throughput, capacity and cost reach a perfect storm that will enable a whole range of new categories for storage.  The value shifts to the software as it did then with VMware. Time is ripe for ‘VMware’ of storage to emerge.

The emergence of  Top Of the Rack Storage (TORS), much like TOR emerged with the transition to commodity 2 socket rack mount systems, will enable a whole new class of systems to be deployed. For e.g. its now possible to go ‘diskless’ in each server and with the advent of NVMoF coupled to a TB 2U box, its very likely that most cloud scale infrastructure could be built with compute that has just CPU and memory and all storage consolidated to this TORS within the rack.

 

tor

Shirjeet Mukherjee of Cumulus networks makes a good analogy and similar observation for networking (see trident ). He asks..

  • Which OS will unlock the networking innovations and thinking like Linux vendors like RedHat, SuSE, and TurboLinux did for compute applications? ….

A corollary question is who will emerge to be the ‘VMware’ for storage and what are the key attributes. The who is likely to be a company that was founded a few years back much like Google and VMware were founded a few years before the 2003 moment. The key attributes are truly distributed data management system that is drive, node, rack, DC failure tolerant, continuous availability, expose file, block, object and emerging data access methods (KV, tables, streams, queues etc).

Looking back – 1998 was an interesting year. That was the year VMware and Google were founded. Co-incidently In 1998, I led the team that enabled Sun to deliver the lowest cost ($1K) workstation and server running Unix that was faster than Intel CPUs while at the same time Sun announced the E10K at $1M apiece. Little did I know then 5 years later, the seeds of the shift was sown in 1998.