Challenge to High-Speed, Memory-Efficient XML Processing
2026-01-23
Introduction
PLATEAU and other 3D city model datasets come in many formats, but PLATEAU contains a large number of huge CityGML files (XML). For that reason, being able to process them quickly is important. It is not an exaggeration to say that the cost and total processing time are largely determined by the XML reading step itself. Faster parsing lowers cost, shortens the feedback loop to users, and improves the overall experience.
In this article, I will introduce a Go library called gosax that I built with the goal of high-speed, memory-efficient XML streaming processing.
- When and why Go’s standard
encoding/xmlbecomes slow - What
gosaxsacrifices and what it prioritizes - Which optimizations are effective, and what pitfalls to watch out for
What is gosax?
This is the library.
https://github.com/orisano/gosax
gosax is a fast, memory-efficient SAX-like streaming XML parser written in Go.
It does not use CGO (meaning it does not depend on libxml2, etc.), and it is designed to process input without expanding the entire document into memory. Performance is the primary goal, and in exchange, there are parts of the XML spec that it does not fully cover.
For easier experimentation and for migration from encoding/xml, I also provide a slightly richer package named gosax/xmlb.
Background
First, there was a discussion about using Rust’s XML parser, quick-xml (https://github.com/tafia/quick-xml), as a way to process XML quickly.
quick-xml is a high-performance XML reader/writer written in Rust. Its characteristics include a near zero-copy design, efficient memory allocation, support for various encodings and namespace resolution, and benchmark results showing performance about 50× faster than xml-rs.
Using the right tool for the job is important, but using multiple programming languages within a single system increases maintenance costs. Since our existing applications were written in Go, we decided to first explore an implementation in Go.
However, Go’s standard encoding/xml raises performance concerns. In the official issue, a user reported extremely low performance in 2017 when doing SAX-style parsing of large XML files compared to C#, and noted that the bottleneck was CPU rather than disk I/O. The issue is still labeled “NeedsInvestigation” and “Performance”, and there has been no fundamental improvement.
Beyond that issue, there are many reports of people writing their own implementations because encoding/xml is slow for certain use cases. On Reddit, someone reported developing an XML tokenizer that is 4× faster than the standard library (github.com/muktihari/xmltokenizer).
Also, github.com/tamerh/xml-stream-parser is a streaming parsing library for large XML data. It supports skipping elements, parsing attributes only, XPath queries, and more.
An article on golangleipzig.space introduces an approach to address encoding/xml performance problems via parallelism. By implementing a custom TagSplitter and splitting the XML stream into ~16MB chunks for batch processing, it improved the processing time of 327GB of PubMed XML data (about 36 million documents) from 177 minutes (sequential) to 20 minutes (parallel), roughly a 9× improvement.
There is also a detailed benchmark in Eli Bendersky’s 2019 article, which shows encoding/xml is slow. For processing a 230MB XML file, Go’s standard library took 6.24 seconds, while Python’s xml.etree.ElementTree (using the C implementation libexpat) took 3.7 seconds, and a pure C implementation (libxml’s SAX API) took 0.56 seconds. The article also provides a gosax module that wraps libxml via cgo, and after optimization it reached 3.68 seconds, comparable to the Python implementation. The implementation code and benchmark repository are also published.
This benchmark was re-run by a third party in 2024 in a Hacker News comment thread. On an M1 Mac, the reported results were: Go standard library 3.35 seconds, gosax 1.45 seconds, and C 0.38 seconds, confirming that there was no large improvement. I also ran the benchmark locally and confirmed similar results.
Benchmarking existing implementations
I created similar implementations in quick-xml and checked the results. For a 223MB out.xml file, the results were as follows.
This benchmark is a very simple case: “count elements in a large XML.” The goal is to focus on the parser core rather than small differences in string handling. Also, with data like CityGML, this kind of scan is commonly used as a preprocessing step.
| Command | Note | Mean [s] | Min [s] | Max [s] | URL |
|---|---|---|---|---|---|
./go-stdlib-count ./out.xml | go 1.22.2 | 3.094 ± 0.025 | 3.045 | 3.132 | github |
python3 etree-count.py ./out.xml | Python 3.12.3 | 2.262 ± 0.023 | 2.245 | 2.319 | github |
python3 lxml-count.py ./out.xml | Python 3.12.3 | 2.218 ± 0.015 | 2.191 | 2.232 | github |
./xml-stream-parser-count ./out.xml | go 1.22.2 | 1.553 ± 0.008 | 1.548 | 1.573 | gist |
./eliben-gosax-count ./out.xml | go 1.22.2 | 1.146 ± 0.010 | 1.134 | 1.171 | github |
./quick-xml-count ./out.xml | rustc 1.81.0-nightly, release build | 0.426 ± 0.031 | 0.409 | 0.513 | gist |
./c-libxmlsax-count ./out.xml | Apple clang version 14.0.3, -O2 | 0.365 ± 0.004 | 0.362 | 0.374 | github |
From these results, we can see that existing Go implementations cannot match quick-xml’s performance. The starting point for this work was: “If we want to aim for quick-xml-level performance in Go, what do we need to do?”
Implementation Policy
If you want to build a fast library, you must be careful about API design and fine-grained performance details. A common approach to avoid premature optimization is to start with a straightforward implementation and then remove hotspots one by one. However, that approach is not always sufficient.
I agree with the perspective in ”Hotspot performance engineering fails”. I do not think there is always a single hotspot that you can fix to solve everything.
So this time, I followed this order:
- First, define constraints.
- Within those constraints, lean into a design that has a “path to winning.”
- Then, refine while profiling.
Constraints
- Avoid CGO because it complicates builds → do not depend on libxml2
- Be able to process without expanding the whole input in memory → streaming parser
- Achieve quick-xml-level speed → avoid unnecessary copies
I set quick-xml’s speed as the target and started by porting the design to Go. I considered that quick-xml’s parser does not need to account for GC, so it is not the case that Rust’s compiler optimization alone explains the gap.
Still, Go cannot do everything in the same way as Rust, so we needed to think carefully about implementation choices for the port.
Because I had previously contributed to github.com/goccy/go-json, I was familiar with examples of building high-performance parsers in Go.
- https://dave.cheney.net/paste/gophercon-sg-2023.html
- https://dconf.org/2017/talks/schveighoffer.html
The buffer strategies for avoiding unnecessary copies, and the approach to writing a parser on top of such buffers, were heavily informed by pkg/json and goccy/go-json.
gosax design
When processing large XML, even small amounts of memory allocation, data copying, and branching can affect performance. More importantly, it is often not enough to remove one bottleneck; multiple factors tend to compound and degrade performance at the same time.
For that reason, gosax starts by clearly defining “design constraints that are likely to improve performance,” and then optimizing the implementation under those constraints. The main characteristics are:
- Expose a portion of the internal buffer directly to the user, so no copy occurs
- It is fine as long as you access it at that moment, but you must copy if you want to keep it beyond that scope.
- In other words, it is possible to use it incorrectly (speed first).
- No interfaces
- In Go, using interfaces typically forces values onto the heap.
- State management via functions
- This reduces switch costs but prevents inlining.
- The parser grows the buffer itself
- The parsing function becomes essentially a for-loop.
This involves some trade-offs and preferences, but for the goal of scanning huge XML quickly, these choices were effective.
The key to speed: byte sequence search
One of the most important optimizations was the byte search portion.
A streaming XML parser is ultimately dominated by searching for delimiters such as <, >, />, quotes, and whitespace. If this part is slow, no amount of clever design on top will make the parser fast.
quick-xml uses BurntSushi/memchr, a library built with significant effort for speed. Because this forms the performance foundation, it is not realistic to build and maintain an equivalent library in Go.
However, bytes.IndexByte uses SIMD effectively internally, and its logic is optimized and maintained by the Go team, so I tried to use it as much as possible. On the other hand, bytes.IndexAny does not use SIMD, so that case needs separate handling.
Another possible approach would be to write custom SIMD assembly in Go. But in my experience, I often encountered cases where it did not become faster (for example, because it cannot be inlined, or due to ABI0 overhead). Considering maintenance cost and the fact that you would need multiple implementations per environment, I did not choose that route.
SWAR (which does not require writing assembly) has little maintenance cost, so I adopted that approach. Improving this byte search significantly impacted gosax’s performance.
In other words, when writing a fast parser, you cannot fully avoid SIMD or SWAR. You need to make sure you can realize those techniques at some layer.
Results
With the various optimizations described above, gosax achieved the following results.
These are still simple cases, so the numbers do not directly translate into real production workloads. But at least, it shows that Go can reach high speed if you push in this direction.
| Command | Note | Mean [s] | Min [s] | Max [s] | Relative | URL |
|---|---|---|---|---|---|---|
./go-stdlib-count ./out.xml | go 1.22.2 | 3.094 ± 0.025 | 3.045 | 3.132 | 14.02 ± 0.34 | github |
python3 etree-count.py ./out.xml | Python 3.12.3 | 2.262 ± 0.023 | 2.245 | 2.319 | 10.25 ± 0.26 | github |
python3 lxml-count.py ./out.xml | Python 3.12.3 | 2.218 ± 0.015 | 2.191 | 2.232 | 10.05 ± 0.24 | github |
./xml-stream-parser-count ./out.xml | go 1.22.2 | 1.553 ± 0.008 | 1.548 | 1.573 | 7.04 ± 0.16 | gist |
./eliben-gosax-count ./out.xml | go 1.22.2 | 1.146 ± 0.010 | 1.134 | 1.171 | 5.19 ± 0.13 | github |
./quick-xml-count ./out.xml | rustc 1.81.0-nightly, release build | 0.426 ± 0.031 | 0.409 | 0.513 | 1.93 ± 0.15 | gist |
./c-libxmlsax-count ./out.xml | Apple clang version 14.0.3, -O2 | 0.365 ± 0.004 | 0.362 | 0.374 | 1.66 ± 0.04 | github |
./orisano-gosax-count ./out.xml | go 1.22.2 | 0.221 ± 0.005 | 0.218 | 0.237 | 1.00 | github |
In this simple case, it seems to have achieved the target performance.
Notes on reading benchmarks
There are many “fast” parsers across different formats. Benchmarks often show the best-case results and claim superiority.
What you need to be careful about is whether the parser is simply ignoring expensive parts of the spec. Common examples include not interpreting escape sequences (making string processing faster), not sorting results, and so on.
Of course, that is one valid design choice. Depending on the use case, those features may be unnecessary, and you may prefer performance and handle the missing responsibilities on the user side.
gosax adopts the same philosophy. To prioritize speed, gosax is designed to not take on certain responsibilities. As a result, using it with the same expectations as encoding/xml can cause problems. Users need to read the documentation and understand the correct way to use it.
And ideally, you should always re-run benchmarks on your own use case.
Even with the same library and the same usage style, the bottleneck can change depending on how namespaces are handled, what information you need (attributes, text, etc.), and whether you need escape processing. The most reliable way to decide whether gosax fits is to evaluate it with real data and profiling.
Conclusion
After achieving the target performance, I implemented various processing tasks using gosax, including PLATEAU CityGML processing. With these implementations, we were able to process XML extremely quickly. In PLATEAU CityGML API, we achieved an API that reads huge CityGML on the fly and returns results within a few seconds.
This shows that, with the right techniques, it is possible to build a fast XML parser in Go. However, to avoid unnecessary copies in Go, the only option is an unsafe implementation. I felt that languages like Rust, which can provide safe and fast APIs, are better suited to this area.
Also, do not take published benchmark results at face value. The best approach is to re-run benchmarks for your own use case. When you do, include profiling as well, and go as far as tuning library options.
If you like it, please give it a star. It is encouraging.
Eukaryaでは様々な職種で採用を行っています!OSSにコントリビュートしていただける皆様からの応募をお待ちしております!
Eukarya is hiring for various positions! We are looking forward to your application from everyone who can contribute to OSS!
Eukaryaは、Re:Earthと呼ばれるWebGISのSaaSの開発運営・研究開発を行っています。Web上で3Dを含むGIS(地図アプリの公開、データ管理、データ変換等)に関するあらゆる業務を完結できることを目指しています。ソースコードはほとんどOSSとしてGitHubで公開されています。
➔ Re:Earth / ➔ Eukarya / ➔ note / ➔ GitHub
Eukarya is developing and operating a WebGIS SaaS called Re:Earth. We aim to complete all GIS-related tasks including 3D (such as publishing map applications, data management, and data conversion) on the web. Most of the source code is published on GitHub as OSS.