21 min read

My Thoughts on the Apple M1

It’s been a couple of weeks since the Apple M1 computers began shipping, and so far it looks like the transition has been a success. The 10W M1 has shown to be very competitive with any chip below 30W, and in some applications such as video editing it far exceeds anything in the x86 world.

App developers have been quick to jump onto the platform, with many developers reporting that clicking “recompile” in Xcode generates a functional Universal build. All of Apple’s apps have been Apple silicon ready since June, and major third party apps such as Google Chrome and Microsoft Office have gold builds of M1 already out. Adobe Lightroom is out and Photoshop will be coming early next year.

Rosetta 2 runs shockingly well, at only 20–50% loss of performance compared to native Intel. In fact, some applications (such as Adobe Premier) have faster performance in Rosetta 2 than running natively on equivalent ultraportable Intel machines.

When M1 was announced on November 10, there was quite a bit of controversy as to whether these chips would be competitive with Intel or AMD’s offerings. Of note was the fact that Apple’s marketing slides were vague on performance, stating values like “3x performance” without proper units or baselines. Apple also presented a series of vague performance graphs without labeled axes. Despite this criticism however, there was still a lot of hope that these chips will be very impressive. I have compiled some of my personal reasons as to why I felt that the M1 would be successful:

Why I Felt the M1 Would Be Successful

  1. Apple did not decide to switch to their own silicon due to any competitive forces. All mainstream Windows machines are using the same x86 chips that Apple is, and ARM-based Chromebooks have not gained mainstream appeal outside the education space. Experimental ARM-based Windows machines have been a failure for the past 8 years and have shown no signs of turning around. There is no indication that Intel will stop working with Apple anytime soon. Despite Apple’s comfortable position, they decided that now is the time to transition to ARM. The reason is clear: they feel that the technology is ready.

  2. Apple’s year-on-year performance on their iPhones and iPads has been exceptional, far better than the year-on-year on Intel’s platform. Rudimentary benchmarks have shown that Apple’s A and A_X chips are legitimate competitors to mid-range x86 offerings. Now of course the mobile platform has much more specialized hardware for specialized tasks instead of the more general computing environment of a desktop, so it’s not a clear cut comparison. But still, it’s a good sign.

  3. Although Apple’s graphs were vague during the presentation, they still did release quite a bit of information pointing toward a powerful processor. They claimed to have the world’s fastest CPU core (later rescinded to fastest CPU core in low-power silicon, as they are a bit slower than the current top-end AMD Zen3 laptop CPUs), world’s fastest integrated GPU in a PC, and world’s best performance-per-watt in any CPU. They also mentioned that they are using the 5 nm process, compared to AMD’s 7 nm or Intel’s 14 nm processes. A die of the same size with a smaller fabrication process will contain more transistors, many of which Apple has poured into their generous L1 and L2 caches. L2 cache is also shared, which can grant greater performance as it reduces redundancy and allows a single core to access the entire L2 cache. On top of this, they teased their ultra-wide execution architecture for their high-performance cores, which is speculated to be an 8 decoder wide out-of-order execution pipeline.

  4. Universal DRAM memory will eliminate expensive copying and translation between the CPU and GPU components of the SoC die, bringing a large performance increase in GPU accelerated tasks. Even though Intel chips have onboard Iris graphics, the CPU and GPU part of the chip do not share memory, instead the memory is separated into areas dedicated to each component. If the CPU needs the same data that the GPU is using, the data is translated and copied over. On desktop machines, CPUs and GPUs are kept separate, with GPUs getting their own memory pool. GPU memory specializes in high bandwidth, while CPU memory specializes in low latency. M1’s unified memory uses high-bandwidth low-latency DRAM directly integrated into the SoC die, providing for both the CPU’s and GPU’s needs while eliminating the slowdowns due to copying and translation.

The last few weeks of user testing has shown that the M1 is indeed a very fast chip. The M1 is a 10W SoC, which means all components (CPU, GPU, NPU, and many other things) consume 10 watts total. There are chips available that benchmark higher than the Apple M1 but they all consume over 30 watts and they don’t spread out those watts over as many components. Therefore the M1 is a baseline that shows a lot of promise for future Apple silicon devices, when Apple releases higher-end machines which will be outfitted with higher-wattage chips.

The RISC Advantage

Apple’s previous generation of computers used chips from Intel, which uses the x86 instruction set architecture (ISA). An ISA includes the list of instructions a CPU can understand, the number of registers, how memory is handled, and other core specifications of a computer. In this article I will be focusing on just the list of instructions: the instruction set. These instructions are the base language of a computer; whenever a program is run ultimately these are the instructions that will be fed into the CPU.

Intel’s x86 instruction set is a CISC (complex instruction set computer) architecture. CISC was developed in an era where compilers were unreliable, so programmers worked directly with assembly code. Because of this, CISC instruction sets tend to contain complex instructions which are a combination of simpler instructions. These complex instructions turned chunks of common simple instructions into one complex one, saving programmers time.

Years later, compilers got better and writing code directly in assembly fell out of fashion. Compiler writers weren’t fans of most of the complex instructions, instead preferring simpler instructions a majority of the time. CISC instruction sets ended up following an 80/20 rule: 20% of instructions were being used 80% of the time. Instructions take up valuable silicon on the CPU die, and the fact that most were redundant and rarely used proved to be a waste. This led to the development of RISC (reduced instruction set computer), which ditches the redundant complicated instructions and keeps the simpler ones around. These new chips needed much less circuitry dedicated to instruction set logic, allowing CPUs to use transistors for other resources such as SRAM, the high speed memory kept inside the CPU. Less CPU circuitry also means lower power requirements and less heat output.

A large majority of computers sold today use the x86 architecture, the CISC instruction set used by Intel and AMD. Apple is breaking the mold by moving their computers to a RISC architecture developed by ARM Holdings. This is the same instruction set used in their iPhones and iPads, which allow them to run cool and have great battery life while maintaining powerful performance.

RISC divides instructions into two camps: one which accesses memory (loading and storing between memory and registers) and another which performs arithmetic on registers. This heavily reduces dependencies within instructions which makes pipelining much easier. Pipelines are a stream of instructions that makes sure each stage of a CPU is being used and not sitting idle during any clock cycle. All CPU cores have 5 different stages (fetch instruction, decode instruction into micro-operations, execute micro-operations, fetch data from memory, write results back to register), and it would be wasteful to simply input a single instruction and wait for the output before inputing the next instruction. By feeding a constant stream of instructions to the CPU, you can keep all 5 stages busy.

CISC’s complex instructions interferes with pipelining, because they can be difficult to divide cleanly and put into an efficient pipeline. CISC contains instructions which both load/store between memory and registers and perform arithmetic on registers, causing a lot of dependencies which make it difficult to chop the instruction up. To ease this issue, CISC chip manufacturers developed micro-operations, which are a simpler set of instructions borne from the complex one and can be pipelined on their own. So now CISC chips can take complex instructions and divide them into micro-operations to help map out dependencies and create a workable pipeline. Even though RISC chips avoid arithmetic logic instructions with memory load/store dependencies, higher end ARM chips still use micro-operations. The reason they do this is to perform out-of-order execution.

Manufacturers realized that executing micro-operations in the order they came in is not as efficient as other possible orders. This led to the development of out-of-order execution. After the instructions have been decoded into micro-operations, you form a micro-op buffer which you can execute out-of-order. Out-of-order execution allows you to chunk data retrievals from memory together. Retrieving data from memory takes a while. But when a CPU accesses data from a block in memory, they can access nearby blocks at little additional cost. So the CPU chunks instructions which correspond to blocks of memory which it can access together, saving time. To do this, the CPU needs to be able to look into the micro-operation buffer and see what instructions are coming up so it can make a more efficient order.

A longer micro-operation buffer means you can make more efficient memory retrievals. To achieve this manufacturers started adding multiple decoders per CPU core, splitting the pipeline for each decoder and filling up your buffer faster. Here we have one of the biggest advantages of RISC chips: fixed instruction length. RISC not only uses a smaller, less complex instruction set, they also make sure that all instructions are the same size. All of ARM’s instructions are 4 bytes while x86 instructions can vary between 1–15 bytes. CISC chips really struggle at splitting their pipeline among decoders due to their variable length instructions; they don’t know where one instruction ends and another begins. Before decode, the CPU just sees a stream of bytes. High-end Intel and AMD chips top out at 4 decoders, because the splitting and decoding process is so convoluted that they cannot add more past that point. But ARM’s instruction set has fixed 4 byte length instructions, so it’s easy to know where one instruction ends and one begins! So Apple has put 8 decoders into their CPU cores, giving them an out-of-execution buffer that’s at least 3 times as large as top-end Intel and AMD chipsets. And Apple should have no problem adding more decoders in their future processors, bringing more out-of-order execution gains.

In short, RISC’s simpler instruction set saves circuitry which can be poured into other CPU resources such as the cache. The smaller circuitry requirements save a lot of power and reduce thermal output. The split between arithmetic logic instructions and load/store to memory instructions makes pipelining easier. The fixed length instructions allow for a larger out-of-order execution buffer which allows for more efficient chunking of memory loads over CISC chips. All of these advantages are seen in Apple’s M1, a cool, efficient, and extremely powerful CPU that rivals high-end offerings from Intel and AMD.

M1’s Coprocessors and Heterogeneous Compute

Apple’s GPU is faster than Intel Xe integrated GPUs, which is great for ultraportables, but for the general industry Apple still has to deal with Nvidia. For the past couple decades Nvidia has been the clear leader in GPU technology and, unlike Intel, Nvidia has been having no issues with their year-on-year performance increases. Nvidia is also leading the way for deep learning and AI accelerator technology on the desktop with their CUDA cores. So far Apple has not mentioned anything about their GPU plans for the desktop or large-laptop market. There’s been some speculation that they’ll stick to AMD GPUs, as die size and thermal constraints may make it difficult to keep everything integrated in the desktop market. However, I believe that universal memory will give Apple enough of a lead that they can still be competitive with Nvidia even if their GPU/NPU/ML accelerator performance would be worse in isolation. Always using the latest fabrication processes allows Apple to fit more transistors into a smaller package, which also mitigates this problem. Apple’s RISC CPU runs cool which helps control the total heat from the SoC, though not as well as just isolating the GPU altogether like Nvidia does. Ultimately I believe that Apple will tackle Nvidia head-on in the productivity market with their SoCs, making GPUs for rendering, NPUs for deep learning acceleration, and ML accelerators for AI, but ignoring Nvidia’s gaming market. It will be a tough battle, but Apple’s universal memory and smaller fabrication will give them an edge.

As of yet the NPU and ML accelerators on the M1 has not had a chance to be tested by users. The NPU on the iPhone is mostly used for the cameras built into the phone, voice recognition for Siri, and FaceID. A general use NPU is new for Apple. In my opinion the best test for the performance of the NPU and ML accelerators will come from TensorFlow, Google’s machine learning Python package. The TensorFlow team was given pre-production units of M1 devices and they reported that indeed TensorFlow does do better than nearby Intel counterparts. However TensorFlow for Apple silicon is still in early alpha stages, and thus far users have reported that they have difficulty getting the program to run at all. I think it’ll take at least a year for TensorFlow to be fully optimized for Apple’s deep learning and AI hardware, and only then can the NPU/ML accelerators be tested by users.

Apple has integrated their own SSD controller into the M1. For the rest of the PC industry, SSD controllers are not in the CPU die, instead they are bundled with the flash modules themselves. Apple cited their integrated SSD controller as one of the reasons for the large jump in SSD performance relative to Intel machines. Apple has also adopted PCIe 4.0, which should also boost SSD performance.

The M1 is an SoC, which is basically all the components on a traditional motherboard integrated into a single chip. Outside of the CPU, Intel’s chips come with a GPU and a video encoder/decoder core. Intel’s Xeon chips also come with an ML accelerator. Meanwhile, Apple’s M1 comes with all of these, alongside an NPU, an image signal processor (handles camera input), a digital signal processor (converts between analogue and digital signals), a Secure Enclave (holds biometric information and handles hardware encryption), an SSD controller, an I/O controller and a Thunderbolt controller. Apple is leveraging heterogenous compute, where instead of adding general computing cores, they add specialized cores for specific functions. By contrast, Intel’s and AMD’s primary focus is to increase the speed of the CPU for general compute; the coprocessors are just there as a bare minimum in cases where the computer doesn’t come with a dedicated GPU. Apple’s approach is to scale up the integrated components so they are competitive with dedicated ones, replacing them entirely. In this process, Apple is not only integrating the GPU, but various smaller processors and controllers as well.

Apple’s single core performance is extremely fast; in some benchmarks it’s competitive with Ryzen 7 chips. There is nothing in the mobile space that can compete. Apple’s multicore performance is still somewhat behind though, with only 4 performance cores M1 loses out to some high-performance laptop CPUs. But I don’t think Apple really plans to compete head-to-head with AMD for the number of general-purpose CPU cores. Instead, I believe they will increase the power of their coprocessors, allowing the general CPU cores to offload tasks on to specialized hardware which can get the job done faster.

In my opinion, adding more than 8 CPU cores in any consumer PC is not useful. The average person will not be running enough processes to saturate even half of the 32 CPU core systems AMD makes for desktops. Most programs people use day-to-day do not have parallel programming support, and most people are probably not running over 8 processes that can saturate an entire core. For specialists applications that do need more than 8 cores, they are probably running processes that can be heavily parallelized. Because coprocessors are not as versatile as a general CPU, they need less silicon per core, and therefore you can pack more coprocessor cores onto a die than general CPU cores. Check out the Esperanto ET-SoC-1, a chip with 4 general CPU cores and 1,089 ML accelerator cores. This chip contains 23.8 billion transistors, not a large jump over the M1’s 16 billion. For ML workloads, this chip sees 30–50 times improvement over incumbent solutions, all while drawing 100 times less power. And while the ET-SoC-1 is a very specialized chip for machine learning applications, it is still a good demonstration of what coprocessors are capable of.

The ET-SoC-1 uses the RISC-V ISA, which is a new architecture designed with heterogeneous compute in mind. It has an extremely small instruction set and the processor developer uses extensions for the coprocessors they are implementing. Currently there are rumors that Apple is using instruction set extensions for some of their coprocessors. These instructions (if they exist) are locked behind libraries such as CoreML. In the future they could make their coprocessors run off of RISC-V + extensions instead of ARM + extensions. This would allow them to add many more coprocessor cores as each core would shed the extra weight of the full ARM instruction set. This possible future development is probably why Apple decided to say they are switching to “Apple silicon” instead of “ARM,” because they plan to add instructions of their own in the future.

It will take a couple of years for developers to take advantage of Apple’s heterogeneous compute environment and for Apple to release their desktop SoCs. But once these transition pains are over, we’ll be able to compare who’s strategy is better: AMD’s CPUs packed with general purpose CPU cores or Apple’s SoC with a melange of specialized cores.

Other Thoughts

All of the M1 devices start at 8 Gb of DRAM and max out at 16 Gb. Even though this sounds like it’s “standard” compared to the rest of the ultraportable industry, it’s important to note that Apple’s SoC will use the RAM differently than an Intel chip will, so it’s difficult to know when you will page out. In addition, Apple’s M1 devices are all equipped with PCIe 4 NVMe SSDs, so page outs may not lead to a large dip in performance. Although RISC processors tend to use more memory on average, Apple utilizes advanced RAM compression on their iOS devices, which allows them to ship devices with much less RAM and still outperform their Android counterparts. I don’t know how well this method will transfer to the desktop space where a user runs multiple processes at once compared to a phone which runs one at a time.

There are some drawbacks to universal memory, the main one being the potential for less memory overall. GPUs and CPUs have opposite needs for memory; GPU needs high bandwidth while the CPU needs low latency. If Apple continues this architecture into the desktop space and has a difficult time adding lots of high-bandwidth low-latency memory, this could prove to be a drawback. Also, although the memory is directly integrated into the SoC, having it be universal means that there needs to be a process for arbitration in the case where multiple components attempt to edit the same memory register simultaneously. This arbitration process must be carried out by an intermediary controller on the SoC which will reduce performance. Of course any modern CPU has a memory controller to handle the logic of writing/reading to memory and handling error control, but this one will likely require more logic and might be slower. Finally, an integrated memory solution like Apple’s kills any chance of upgrading memory yourself in the future, but for a majority of current Apple devices this was already the case anyway. For Intel Macs, only the 27” iMac and Mac Pro currently offers user expandable memory.

The M1’s battery tests have been very good, with the MacBook Pro easily hitting the promised 20 hours of battery life. Apple is using ARM’s big.LITTLE architecture, which splits off the CPU cores into high-performance and high-efficiency. Utilizing this architecture has been a huge win for power usage, alongside the general efficiency gains from a RISC architecture.

Windows on ARM has been an ongoing project since 2012, and thus far it has been a pathetic failure. Each ARM laptop they produce supports little to no third party apps. In addition, Windows on ARM did not have an x86 emulator until very recently (32-bit x86 came out in February, and x86_64 is still in beta) and performance has been miserable, over half as slow as the slowest Intel processor.

There are a lot of issues with Microsoft’s approach (most laid out by The Friday Checkout here) but the underlying problem is this: Microsoft has many enterprise customers who cannot go through a transition. Many businesses have core parts held together with legacy Wintel systems and do not want to risk jeopardizing their business by changing things. Because Microsoft can’t go through with a transition, they can’t commit fully to the project and can only really dip their toe in the water to see if there’s a market. Since Microsoft can’t fully commit to the project, their hardware partners don’t bother committing either, and consumers are presented with a poor product. No one buys it, Microsoft gets discouraged, and doesn’t bother pursuing it further. Microsoft has backed themselves up into a corner of legacy clients and will have a difficult time moving forward. However, with Apple taking the reins on the RISC transition, Microsoft might be able to ride the wave alongside and finally come out with competitive RISC-based products.

One person in particular, Linus Sebastian from Linus Tech Tips, predicts that Apple will stop bringing feature updates to M1 devices much earlier than future “M2” devices. His basis for this claim is that the first iPad and first Apple Watch were dropped from feature updates much faster than their second generation counterparts. I argue that it’s unlikely for M1 devices to be dropped early, as his cited devices were most likely dropped for performance reasons. The first iPad was cut after 2 years of support, with its last operating system being iOS 6. Seeing how many iOS 7 features were cut from the iPhone 4 due to performance reasons, it’s clear why Apple axed the iPad from running it at all. The iPhone 4 and first iPad both ran the same A4 SoC, and the iPad with its larger screen has more performance requirements than the iPhone. The Apple Watch was extremely processor limited at launch, and was barely able to run even Apple’s apps. The fact that it was supported for 4 years is surprisingly high. My prediction is that M1 will be powerful enough to gain feature updates for years to come. Now, buying an Intel Mac at this time is probably not wise…


I believe that Apple silicon will give the Mac a boost it’s needed for a long time. The Mac’s thermal limitations has made its Intel chips run slower than bulkier Windows and Linux counterparts. For most industries, people will happily trade industrial design for performance, and by consequence the Mac has been completely ejected from many industries. Having top-tier CPU performance is what the Mac needs to become relevant again.

Apple’s refusal to work with the temperamental Nvidia has forced them to use AMD’s subpar GPUs. If Apple’s GPU/NPU performance scales well in the desktop market, then they can really become relevant in completely Mac-free markets such as mechanical engineering. But the M1 is not a good test for this; once when Apple enters the desktop market we can really compare their performance to Nvidia.

Apple utilizing an SoC instead of separate CPU and GPU is something uniquely within their capabilities. Intel is not a GPU firm and Nvidia is not a CPU firm, both companies dedicate their efforts into their own side of things. Apple has created an SoC which is greater than the sum of its parts, allowing both the CPU and GPU to work together more efficiently through unified memory. Apple also producing the OS means that the software and hardware are tuned together, so Apple can test their OS, see where the pain points are, then add dedicated coprocessors to speed up those processes.

I predict that within three years the Mac will double their marketshare for yearly sales to 15%. Apple has averaged 7–8% for decades, so this will be a big deal. This is of course a risky prediction, and it can easily go south, especially depending on Mac’s GPU performance. But safe predictions are boring! So 15% is my forecast. If you believe otherwise, please email me at farzad.saif@gmail.com and we can make a bet.

Sources:

Apple M1 Event

Apple unleashes M1 Press Release (which rescinds the “world’s fastest CPU core” claim)

The Friday Checkout’s analysis of Apple silicon’s June announcement

The Friday Checkout’s analysis of M1’s November announcement

RISC vs CISC basics

RISC vs CISC in 2020

Why Is Apple’s M1 Chip So Fast?

Apple Foreshadows the Rise of RISC-V

Coding Coach’s analysis of M1

Gary Sim’s analysis of M1

Linus Sebastian’s prediction that the M1 devices will be dropped from Apple support much earlier than M2 devices

Linus Tech Tips benchmarks of M1

Official TensorFlow tests on M1

MacBook Air SSD benchmarks twice as fast as previous model

Windows’ x86 emulation benchmarks

Esperanto ET-SoC-1

Intel Quick Sync Video

Intel’s AI Accelerator in Xeon processors

Memory controllers

Load/store architecture

iOS RAM compression

List of iOS devices and the number of months they received feature updates

iPhone 4’s limitations on iOS 7