Thread overview | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 kinke <kinke@gmx.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kinke@gmx.net Component|dmd |druntime OS|Linux |All --- Comment #1 from kinke <kinke@gmx.net> --- According to the backtrace, the problem is in druntime's `core.cpuid` - of the **host compiler's** druntime used to build LDC. Which one did you use? The issue might have been fixed in recent druntime already. -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 Jure Pečar <jurij.pecar@embl.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|druntime |dmd OS|All |Linux --- Comment #2 from Jure Pečar <jurij.pecar@embl.de> --- I don't know what official binaries are built with, my build used gcc 12.3, llvm 16.0.6 and ldc 1.24 to build ldc 1.35. -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 kinke <kinke@gmx.net> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|dmd |druntime OS|Linux |All --- Comment #3 from kinke <kinke@gmx.net> --- Please leave my component and hardware changes in place, this has absolutely nothing to do with the DMD compiler. The official LDC binaries are compiled with itself, so v1.35 is built with v1.35. So from your description, we only know that `core.cpuid` of the LDC v1.24 druntime, i.e. druntime v2.094, doesn't support your CPU. But as your LDC v1.24 host compiler works, this means that the host druntime used for building that LDC v1.24 works. -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #4 from Jure Pečar <jurij.pecar@embl.de> --- Sorry this is my first time meeting D ecosystem, I'm having trouble following your feedback. So by "host compiler" you mean the previous version of LDC that was used to build current version of LDC? If that's the case, can you tell me which version of LDC started recognizing and working with zen4c cpus? -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #5 from kinke <kinke@gmx.net> --- (In reply to Jure Pečar from comment #4) > Sorry this is my first time meeting D ecosystem, I'm having trouble following your feedback. No worries. > So by "host compiler" you mean the previous version of LDC that was used to build current version of LDC? Yes. > If that's the case, can you tell me which version of LDC started recognizing and working with zen4c cpus? I don't know if it is working in current druntime. You could e.g. launch some Ubuntu/Debian container (min Ubuntu 20.04 for glibc) and try to run the official v1.35 in there. If there's no startup error, druntime v2.105 probably works. FWIW, the problematic module is https://github.com/dlang/dmd/blob/master/druntime/src/core/cpuid.d. As the name suggests, it uses/depends on the CPUID instruction. I can only tell you that everything works on my workstation, a Threadripper 3960X. -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #6 from Jure Pečar <jurij.pecar@embl.de> --- I'll try to wade through this cpuid detection logic in an attempt to spot something. How would I narrow down the approximate location in the code where crash happens? FYI, Sambamba (and LDC) works fine on zen4 cpus such as Genoa and Genoa-X. /proc/cpuinfo reports identical cpuid level and flags for all three. Only difference for zen4c (Bergamo) should be smaller cache. Does that help us narrowing down the issue? -- |
November 21, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #7 from Jure Pečar <jurij.pecar@embl.de> --- Here's a diff of `cpuid -1` output from 32c Genoa (-) and 128c Bergamo (+): @@ -3,16 +3,16 @@ version information (1/eax): processor type = primary processor (0) family = 0xf (15) - model = 0x1 (1) - stepping id = 0x1 (1) + model = 0x0 (0) + stepping id = 0x2 (2) extended family = 0xa (10) - extended model = 0x1 (1) + extended model = 0xa (10) (family synth) = 0x19 (25) - (model synth) = 0x11 (17) - (simple synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (model synth) = 0xa0 (160) + (simple synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm miscellaneous (1/ebx): - process local APIC physical ID = 0x10 (16) - maximum IDs for CPUs in pkg = 0x40 (64) + process local APIC physical ID = 0xd6 (214) + maximum IDs for CPUs in pkg = 0xff (255) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown @@ -80,7 +80,7 @@ RDRAND instruction = true hypervisor guest status = false cache and TLB information (2): - processor serial number = 00A1-0F11-0000-0000-0000-0000 + processor serial number = 00AA-0F02-0000-0000-0000-0000 deterministic cache parameters (4): --- cache 0 --- cache type = no more caches (0) @@ -287,7 +287,7 @@ bit width of fixed counters = 0x0 (0) anythread deprecation = false x2APIC features / processor topology (0xb): - extended APIC ID = 16 + extended APIC ID = 214 --- level 0 --- level number = 0x0 (0) level type = thread (1) @@ -296,8 +296,8 @@ --- level 1 --- level number = 0x1 (1) level type = core (2) - bit width of level & previous levels = 0x6 (6) - number of logical processors at level = 0x40 (64) + bit width of level & previous levels = 0x8 (8) + number of logical processors at level = 0x100 (256) --- level 2 --- level number = 0x2 (2) level type = invalid (0) @@ -401,13 +401,13 @@ highest COS number supported = 0xf (15) extended processor signature (0x80000001/eax): family/generation = 0xf (15) - model = 0x1 (1) - stepping id = 0x1 (1) + model = 0x0 (0) + stepping id = 0x2 (2) extended family = 0xa (10) - extended model = 0x1 (1) + extended model = 0xa (10) (family synth) = 0x19 (25) - (model synth) = 0x11 (17) - (simple synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (model synth) = 0xa0 (160) + (simple synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm extended feature flags (0x80000001/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true @@ -469,7 +469,7 @@ LLC performance counter extensions = true MWAITX/MONITORX supported = true Address mask extension support = true - brand = "AMD EPYC 9334 32-Core Processor " + brand = "AMD EPYC 9754 128-Core Processor " L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x40 (64) instruction associativity = 0xff (255) @@ -509,7 +509,7 @@ line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x9 (9) - size (in 512KB units) = 0x100 (256) + size (in 512KB units) = 0x200 (512) RAS Capability (0x80000007/ebx): MCA overflow recovery support = true SUCCOR support = true @@ -566,8 +566,8 @@ branch sampling feature support = false (vuln to branch type confusion synth) = false Size Identifiers (0x80000008/ecx): - number of threads = 0x40 (64) - ApicIdCoreIdSize = 0x6 (6) + number of threads = 0x100 (256) + ApicIdCoreIdSize = 0x8 (8) performance time-stamp counter size = 40 bits (0) Feature Extended Size (0x80000008/edx): max page count for INVLPGB instruction = 0x7 (7) @@ -714,13 +714,13 @@ line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0x10 (16) - number of sets = 32768 + number of sets = 16384 write-back invalidate = true cache inclusive of lower levels = false - (synth size) = 33554432 (32 MB) - extended APIC ID = 16 + (synth size) = 16777216 (16 MB) + extended APIC ID = 214 Core Identifiers (0x8000001e/ebx): - core ID = 0x8 (8) + core ID = 0x6b (107) threads per core = 0x2 (2) Node Identifiers (0x8000001e/ecx): node ID = 0x0 (0) @@ -799,14 +799,14 @@ number of LBR stack entries = 0x10 (16) number of avail Northbridge perf ctrs = 0x10 (16) number of available UMC PMCs = 0x20 (32) - active UMCs bitmask = 0x6db + active UMCs bitmask = 0xfff Multi-Key Encrypted Memory Capabilities (0x80000023): secure host multi-key memory support = true number of encryption key IDs = 0x3f (63) 0x80000024 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000025 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 AMD Extended CPU Topology (0x80000026): - extended APIC ID = 16 + extended APIC ID = 214 --- level 0 --- level number = 0x0 (0) level type = core (1) @@ -821,9 +821,9 @@ CMPXCHG8B = true conditional move/compare = true PREFETCH/PREFETCHW = true - (multi-processing synth) = multi-core (c=32), hyper-threaded (t=2) + (multi-processing synth) = multi-core (c=128), hyper-threaded (t=2) (multi-processing method) = AMD leaf 0xb - (APIC widths synth): CORE_width=5 SMT_width=1 - (APIC synth): PKG_ID=0 CORE_ID=8 SMT_ID=0 - (uarch synth) = AMD Zen 4, 5nm - (synth) = AMD EPYC (4th Gen) (Genoa B1) [Zen 4], 5nm + (APIC widths synth): CORE_width=7 SMT_width=1 + (APIC synth): PKG_ID=0 CORE_ID=107 SMT_ID=0 + (uarch synth) = AMD Zen 4c, 5nm + (synth) = AMD Ryzen (Bergamo) [Zen 4c], 5nm Since that cpuid.d is mostly poking around these register values, I'm pretty sure that the key to fixing this issue is hiding in here. -- |
November 23, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #8 from Jure Pečar <jurij.pecar@embl.de> --- Keen eyes in easybuild community noticed that cpuid.d uses ubyte for numcores in a couple of places. For example, function getcacheinfoCPUID4 uses uint for numcores, but function getAMDcacheinfo uses ubyte. It also doesn't differentiate between cores and threads so I assume it walks all logical cpus there in the loop on lines 633-641. Bergamo has 128 cores, 256 threads, ubyte rolls over and then on line 659 you divide something by numcores. Boom. To test this hypothesis, I disabled SMT on one of the Bergamo nodes. Indeed, LDC then works as expected: # ldc2 Error: No source files So I'd say the fix is to just s/ubyte/uint/g on cpuid.d. And check if you do any similar things elsewhere. Thanks, -- |
November 23, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #9 from Dlang Bot <dlang-bot@dlang.rocks> --- @kinke created dlang/dmd pull request #15859 "core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores" mentioning this issue: - core.cpuid: Fix div-by-zero on AMD CPUs with 256 (physical?) cores See: https://en.wikipedia.org/wiki/CPUID#EAX=80000008h:_Virtual_and_Physical_address_Sizes This *might* fix Issue 24254, although I'd expect the read value for that CPU to be 127 (*physical* cores minus 1), not the problematic 255. https://github.com/dlang/dmd/pull/15859 -- |
November 23, 2023 [Issue 24254] LDC crash on Epyc Bergamo | ||||
---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=24254 --- Comment #10 from kinke <kinke@gmx.net> --- (In reply to Jure Pečar from comment #8) > Keen eyes in easybuild community noticed that cpuid.d uses ubyte for numcores in a couple of places […] Thank you, and please send those keen eyes my regards. :) -- |
Copyright © 1999-2021 by the D Language Foundation