HTML is too complex

(This is the ninth post in a series on the publishing industry’s new product categories.)

The syntax of HTML and XML—angle brackets and closing elements—isn’t complex. It’s tedious, but it isn’t complex. If the problem lay in the basic syntax we’d have an easy time fixing it. The problem with markup complexity lies in the underlying model. Or, in the lack of one. Simply put, HTML is a mess.

This is from an email sent by Matthew Thomas to the WhatWG mailing list (that list was at the time responsible for the development of HTML5) almost ten years ago. Everything it says is still true:

In response to the proposal that HTML5 add a host of semantic elements, each with no default rendering to distinguish it from other elements, Matthew predicted the following:

  • The A-list of Web developers will begin using all the elements
    correctly on their Weblogs, and they will feel good about it.

  • A greater number of Web developers will never use most of these
    elements, but they will replace all occurrences of <div> on their
    pages with <section> because it’s more “semantic” (just like they
    did with <em> for <i> and <strong> for <b>), and they will feel good
    about it.

  • The vast majority of article producers (Weblogs and online
    newspapers) will never use <article>, because there’s no visual or
    behavioral benefit from doing so. So <article> will never become a
    reliable way of dissecting or aggregating pages.

  • The number of knowledgable HTML authors, the proportion of HTML
    pages that are valid, and therefore the overall usefulness of the
    Web, will be less than it otherwise would have been because of
    HTML’s increased complexity.

I’d argue that his prediction, ten years ago, was pretty much spot on:

  • The A-list rewrote their own sites to use fancy HTML5 semantic elements, then wrote books, presented talks, and sold workshops teach people how to do the same.
  • The hangers on and wannabes try a bit but don’t use any of the elements except maybe header and footer, and possibly article after that was blessed as a generic sort of standalone content container instead of section. Most of the elements are regularly used incorrectly.
  • The vast majority don’t use any of the semantic elements unless it’s by accident like a thoughtless copy-paste.
  • The only reason why the proportion of valid HTML files has increased is because HTML5 retroactively blessed invalid files as valid, provided they wear the HTML5 doctype.

The web remains too unstructured for article to become a good way for ‘dissecting or aggregating pages’ as originally envisioned. The HTML5 outlining algorithm isn’t used by anybody (except the A-list gurus) and, even worse, supported by very few browsers or screenreaders.

As Matthew Thomas mentioned in the email above, unless there is an immediate visual or behavioural benefit to using an element, most people will ignore it. This is compounded by the angle-brackets mess of HTML. By completely separating design (CSS), behaviour (JS), and structure (HTML) the specification gods have taken away the context that would make it easier for us mere mortals to give our documents a meaningful structure.

That’s without getting into the problems with the syntax itself.

While the separation makes using HTML for documents and ebooks more difficult, it is essential for it becoming an app platform, which obviously now the web’s primary purpose.

(Most websites today are just web apps for delivering ads. They certainly aren’t made with readability in mind.)

There was a long period of time when the markup of most websites was unreadable because they used a mess of nested table tags to render the site. The markup was meaningless and complex. For a few years, though, after that, when you viewed the source of your average website, you would have seen relatively clean and nicely structured markup that most people could understand, even without specific knowledge about HTML. Google’s web crawlers loved simple, well-structured documents and so the web filled with them.

Now we’re back to seeing almost the same level of complexity and messiness in most web pages as we saw in the worst days of table-hacking. The semantic elements from HTML5 are largely unused. Those that are used such as <header> and <footer>, are used incorrectly because people misunderstand what they mean. Every page is riddled with div elements with opaque classes and IDs nested in a document structure that is more complex than many I saw in the table-layout days.

This escalating complexity is arguably one of the biggest ongoing issues in web development because it makes things like authorship, search engines, discoverability, and automation more difficult than it should.

You see, if the markup you assign to a piece of content has a specific meaning, you can write code that’s aware of this meaning. You make human meaning machine readable. This is useful if you want to make the text more searchable or if you want blind people to be able to hear it with their screenreaders. If the markup is too complex (both the underlying model and the markup syntax) to use properly, the humans won’t be able to do the markup properly, making the content’s meaning machine-opaque again. HTML5 has a big problem with markup complexity where even A-list developers have spent countless hours debating what the various new semantic elements actually mean.

Hint: They don’t mean what most of us assume they mean. Section, Article, Footer, Header, all of them have differences in meaning from what we’d assume from existing practice or basic understanding of English.

HTML5 is itself complex. Most developers can’t or won’t put in the effort to properly mark up their content semantically. EPUB3 and its ilk add even more complexity, more ‘semantic’ elements and attributes, all of them even more difficult to understand and harder to explain than the basic new semantic elements of HTML5.

Badly implemented complexity, such as in HTML5 and EPUB3, means we get all the pain and difficulty of escalating complexity, but with few of the benefits. Unfortunately, these are formats whose limitations we have to work around and surpass. They are a disadvantage on both the web and ebook industry. One of the tasks publishing has ahead is to try to neutralise that disadvantage.

15 thoughts on “HTML is too complex

  1. Great article—the human aspect of this is dead on. The machine-readable aspect may overlook what Google is likely doing behind the scenes with it, which I suspect will lead to another era of SEO service providers having something on which to hang their hats.

  2. Nobody is getting paid to deliver well structured semantic mark-up and it is, as the article suggests, a “feel good” for the author. We have evolved, the days of slicing images into tables for layout purposes is mostly gone, but then, just the other day I threw in the towel on CSS, where the right approach was using divs and floats to achieve my layout requirement, but it was just not working! 2 min. later, I converted the layout to a table and I was done.

  3. HTML is not an archival form for representing information – HTML changes over time, and you can’t document why you didn’t use “section” before it was invented, why you didn’t use “aside” or “poem” before they were added. it’s a markup language for interchanging between a Web server and a Web browser. If you want “semantics”, if you want markup that helps a human understand why the text is the way it is, and that helps a computer to process the text, use XML, and translate to HTML for rendering. They you’re in control of your vocabulary yourself.

    • HTML, specifically its vocabulary, is the basis of all ebook formats where you aren’t in control of your vocabulary yourself. The only thing that is stored and transmitted is what HTML supports. It very much is an archival format for representing information.

      Even if you do use XML for storage that still leaves the authorship problem. Except now, instead of trying to fix authorship for a format in common use (HTML) you have to build one from scratch. You have to be really careful and meticulous if you aren’t going to end up with the same sort of mess as the one we have in HTML.

      But, yeah, authoring and storing in some other format before transforming it to HTML is definitely a part of the solution.

      It doesn’t address the fact that, with ebooks, we have been forced to use HTML as an archival format for information.

  4. I wonder if understanding of the new elements could have been enhanced if they did have different styling by default.

    My thoughts on the complexity and messiness of modern web-pages is that web-devs have this idea that:

    • either they compose all the markup of the page on the server
    • or they implement a JS driven single-page-app with downloaded templates and JSON encoded content.

    When they opt for the former then naturally they need to distinguish banners and site nav and ads from the primary content and of course these need layout hooks for CSS which used to be done with tables and so it quickly gets messy.

    No-one (literally) feels free to have just the markup for the primary content in the page and then layout / decorate to their hearts content with DOM scripting in the browser.

    • So so many more people author HTML than just web devs. Any time somebody in marketing touches a WordPress rich text field they are authoring HTML and leaving out everything but the absolute simplest elements because authoring would become too complex with those elements included.

      And there are hundreds, if not thousands, of corporate CMSes where the users have to use HTML or a flavour of it directly.

    • When it comes to WordPress, it’s mostly about the themes, less about the plugins, but the core is html5.

    • Not sure what your point is there @baldur. “Someone in marketing” isn’t making the markup messy and complex. Its WordPress (or equivalent CMS) and WP devs that do that.

    • This is where we disagree. The markup normal people make is messy and complex because the model underlying HTML is messy and complex. If it weren’t, normal people would have an easier time authoring it.

    • What I’ve suggested in the past is that all ebook readers have a detailed and expressive default stylesheet where all elements have a distinct look. Bonus points if the default stylesheet is good enough for most people to skip making their own styles for their ebooks.

      On the web browser side, the future of HTML belongs to programmers. Since browsers are now first an app platform and second a document platform, the only people qualified to deal with HTML’s mess are programmers. What they will have to do is use that platform to create authorship tools for the rest of the world that abstract away the mess and let people create structured text. Which is basically what is happening already.

  5. I have bagged EPUB3 – it’s a wolf in sheep’s clothing masquerading as open but really bolstering the e-reader as a walled garden. Why go the extra mile to have all the character ripped away by the likes of Apple, Amazon and even Readium?

    Instead I am developing a CMS to create html5 docs stored locally. I am starting with a simple process to create rich stand alone single documents that contain by themselves everything demanded – media, CSS and javascript. While there are some limitations, we’ve got time.

    I am sure the e-reader developers know this is coming but why not make some $$ on an extra media cycle similar to 45s->records->8-track->cassettes->CD->DVD->digital drm-> and finally where we are now.

  6. PDF is a format, going back into the days when a screen layout should look like on paper, because verything was printed and the paper layout was the model.

    HTLM is definately the superior technique and vastly superior to the needs of reading on screens with the different screen formats.
    PDF is technically an outdated format.

    And there is another HUGE positive aspect of HTML: it’s an open standard.

    Understandable that Adobe is quite nervous with their outdated format. I expect more stupid articles trying to discredit HTML in the future. But the fact, that HTML is perfect for reading on dofferent screens and PDF is awful, HTML is fast, PDF is awfully slow, HTML is open, PDF is restricted, will not change with false propaganda how bad HTML was.

    The scientifc community is already adapting to HTML since the benefits over PDF are so dramatic.

    And HTML has another positive aspect: the NSA has no access to your documents if you choose the correct software vendor…

  7. Some people consider HTML as complex language but it is all about describing of data. No doubt there are some tags which are very difficult but I think this is not difficult to ran away from it. For a web-developer it is not a big problem to deal with HTML. Its pretty good to read your post about HTML.

Comments are closed.