Adding a URL class

Sep 13, 2023

Hi all! Now that Chris & I have finished all of the chapters in the book, we’re going back through earlier chapters and trying to fix anything that turned out to be a bad idea. We’ll blog about these fixes when we can, especially because I think it can be helpful to see what code didn’t, but could have, made it into the book.

How Chapter 1 was written

Chapter 1 is pretty old—it was published in July 2020, just as the world was starting to first emerge from Covid lockdowns, and in fact a month before Chris joined the project! This means it was before we had any real sense of the tone or style of the book. For example, some of the “Go Further” blocks in that chapter were pretty strange by the standards of later chapters: a short sentence linking to a standards document, nothing more.

The presentation was also odd—it was breezier, less concerned with writing or organizing code. For example, most of the code was written as if https://example.org/index.html was the only URL you might be requesting, and it was up to the reader to actually generalize it to handle multiple URLs. That was confusing, and it was easy to get it wrong (for example, by not updating the Host header). In fact, that style wouldn’t last even a second chapter, because as the browser gets bigger and more complex, it becomes more and more important to know where each line of code lives and when it is executed.

So anyway, the point is that Chapter 1 tried out some stylistic ideas that didn’t quite work, and I was itching to rewrite it anyway.

Why one big function

Besides style, there was a key technical issues with Chapter 1. As originally published, Chapter 1 basically writes one big function: request.1 The issue with request is that it’s quite long, which makes it kind of annoying to talk about later in the book, when we need to add features (like cookies). The obvious choice would be to break it up into multiple steps:

parse_url to parse a URL and return pieces like the host and path
request_url to send an HTTP request
Maybe parse_response to parse the resulting HTTP response

In fact, this isn’t hypothetical, because it’s how I initially wrote this chapter (sadly, pre-git, so lost to the mists of time). However, this design was pretty verbose and over time I merged pieces. For example, while splitting the HTTP request and response is nice, the object passed between the two would be a live socket connection. That seemed dangerous: different functions open and close the socket, and errors can happen in either function. So in the first version of the chapter that I committed, I had already combined the request and response logic into a single request function. So it was request and parse.

Of course I couldn’t really call the parse_url function, just parse. Browsers parse lots of things! But even parse_url was awkward! It would return a four-tuple, which I then had to pass to request. Since I didn’t want to use fancy Python features like *args, that meant every HTTP request would take up two lines (one to parse and unpack into variables, and another to call request). Plus those lines were just long and ugly. So eventually I just combined all of it into one big request function. This worked fine for the first chapter (where I could just walk through each piece of it step by step) and also for the next few, because we don’t modify the request function until Chapter 8, and because the most complex modifications don’t happen until Chapter 10.

Issues with one big function

That said, as we wrote more and more of the book, we started to dislike the one big function. First of all, over time, we changed Chapter 10 to focus more on browser-internal security policies (like CSP) and that meant adding functions like url_origin. These duplicate pieces of the URL parsing logic.

There was also a sense that much of the parsing code was eventually “organized” into classes like CSSParser and HTMLParser, but URL parsing ended up scattered across multiple functions called from multiple classes.

I initially tried a Networking class that could be used as a namespace for all the URL-focused classes. The issue with that was there wasn’t any state for the Networking class to represent until Chapter 10 (when cookies are added), which would make it weird to create the class in Chapter 1. (It would have only static methods, something we’ve avoided in the rest of the book.) And splitting parsing into its own function would reintroduce the issue that its return value was an ugly four-tuple.

Eventually I hit up a variation that would work: a class to represent URLs.

A URL class

Upon reconsideration, the theme of the book is browsers, not web protocols like HTTP. So while the first chapter does focus on networking, we need to think about networking from a browser-focused perspective, which to me means a procedure to convert URLs to resources. So the key object to represent is the URL, not networking in abstract. A URL class would fix a lot of issues.

First, the URL parser could be the constructor for URL. Instead of returning a four-tuple, it would just return a URL object with proper fields. Instead of passing all the fields to request, request could be a method on URLs. In Chapter 10 we need to pass more arguments to request, including a second URL, and distinguishing the URL being requested by making it the object whose request method is called would make that code clearer.

Things like url_origin and resolve_url could become other methods on URLs. Plus, it would become possible to test URL parsing directly by just constructing a URL and querying its fields, instead of inspecting the HTTP request. (This would help my students, who use the browser unit tests as part of their assignments.)

The networking chapter could also be refocused on URLs instead of networking in general. This just required moving some paragraphs around, not a full rewrite, but Chris & I felt it made the chapter more concrete. It also means that we can get to code (for URL parsing) right away, which is more in keeping with the rest of the book.

Conclusion

Anyway, we’ve now rewritten the first chapter, and the rest of the book, to introduce a URL class. The updated chapter is live on the site. We’ll be making more modifications to the book, and pushing those as we go, with blog posts when we can. And hopefully soon we’ll be able to share news about an on-paper publication sometime in 2024!

It also introduces a show function, but that’s rapidly obsoleted, and so doesn’t really affect later chapters.

Web Browser Engineering Blog

Discussion about this post