SEO optimalisation for Historiana

Description of the current state and planned work in regard to the Historiana project.

What is SEO?

Search Engine Optimization is the process of improving the visibility of a website in the search results generated by search engines by making it easier for search engines to index and connect results to queries of users.

Common practises

In the early days any url on the web was essentially an HTML document. More often than not it was a physical file on a disk on some webserver somewhere on the internet.

From these days the first definitions of meta-data within a HTML document were made. In the HTML HEAD section one can add tags which are a free-form way of adding data which is not directly visible in the page itself.

However they are included in the source code of the page and easily parsed by the different search engines. While the fields of the meta are freeform there are some common practices.

<meta name="author" content="Chris Mills">
<meta name="description" content="The MDN Web Docs Learning Area aims to provide
complete beginners to the Web with all they need to know to get
started with developing web sites and applications.">

These fields are shown in search results of Google. Other companies made other standards; Facebook introduced the Open Graph Protocol eg:

<meta property="og:image" content="https://developer.cdn.mozilla.net/static/img/opengraph-logo.dc4e08e2f6af.png">
<meta property="og:description" content="The Mozilla Developer Network (MDN) provides
information about Open Web technologies including HTML, CSS, and APIs for both Web sites
and HTML5 Apps. It also documents Mozilla products, like Firefox OS.">
<meta property="og:title" content="Mozilla Developer Network">

Twitter has a similar but different set of fields, eg:

<meta name="twitter:title" content="Mozilla Developer Network">

A full set of Twitter fields is at https://developer.twitter.com/en/docs/tweets/optimize-with-cards/overview/markup

There are other parties but these three are most common. Apart from feeding additional data to search engines part of this meta is also used in environments where URLs are communicated to add an image, title and description as a preview for the site.

/Pasted_Image_20_05_2019__16_33.png (Telegram shows icon, title and description of a site)

/Pasted_Image_20_05_2019__16_31.png (Whatsapp does a similar thing, without an image)

The last decade much development has been done on HTML and most pages nowadays are automatically generated. This makes adding relevant tags less elaborate than with manual editing but to support a range of services all the relevant service-specific tags need to be added.

When a "page" should work nice in Google, Twitter and Facebook all the individual tags need to be added to that page.

Modern websites

With the advent of javascript and other modern technologies the web has changed dramatically compared to the early days. Soon websites were developed with the use of server-side technology where the webserver would generate the HTML to be sent to the browser.

Later this shifted to a more client central focus as suddenly there was not just a computer but also a mobile phone and even other platforms like iPads joining the web.

Each platform has their own idiosyncrasies and require adaption of the data to optimise the usability of the information on the specific platform.

Instead of generating a "everyone gets the same data" a shift was made where the client would request specific versions of data suited towards their own usage. The client took over the control over the data, instead of serving pages websites became a "REST API endpoint" where clients request their data and display it in a User Interface.

This lead to the rise of many platforms / frameworks which were based on the "SPA" model, the Single-page application.

These kind of apps load their page and do incremental updates to it in order to smoothen the load time of the page and user experience in general.

This way a page is loaded once and changed by javascript when additional data is loaded conform the interaction of the user with the page. Thus avoiding loading enormous amounts of data which are not needed on the current view.

Drawbacks

This shift to a more intelligent client side brought a better user experience (UX) but also has several drawbacks.

When in the old days navigation actually changed to a new page it triggered both a new entry in a access-log file of the webserver and it loaded a new page which could contain its own page-specific meta information.

Thus the rise of the SPA broke website statistics and meta-information shown to search-engine bots.

SEO in the Historiana context

Initial versions of the Historiana project were Django based and server side rendered. However no particular effort was made to add <meta> tags to the content other than the incidental <title> tag.

The more recent iterations of the project are using Vue and are SPA based. Soon after the cutover to the new version we lacked statistics and apparently also the Google search results went into a decline.

What has been done

For the statistics we were using Google Analytics which gave little insight into the usage of the various parts of the site because everything is registered as a single visit.

As an experiment we installed a local version of Matomo which is in ways quite similar to Google Analytics but can be self-hosted. They document tracking SPA navigation fairly well and currently we are using the trackPageView calls in the live version of Historiana.

/Pasted_Image_21_05_2019__18_01.png (example of custom page-tags in Matomo statistics)

We are able to differentiate between the various sections of the site but can't track the deeper content like specific Historical Content items.

Although this mostly solves the tracking part of the SPA it does not help for search-engine optimisation at all. Bots visiting the site for Google and other search engines won't be able to navigate to specific pages. Partly because some of the navigational code is in Javascript and bots don't parse Javascript, they just look for <a> links.

Another reason is that currently Historiana is using hash navigation; links to content are created with a # sign. For example: https://historiana.eu/#/teaching-learning instead of https://historiana.eu/teaching-learning

This is the default behaviour of Vue but causes problems on the SEO side of things. Originally this # symbol is used to address a position within a page. Frameworks like Vue misuses this feature to their advantage. Search engines however won't support these links as it breaks things on their end.

What will be done

Fix the # issues

Vue now supports the history mode where the # is no longer needed. This requires some changes in the server but more importantly also in the codebase of Historiana. There a quite a few hardcoded links in the code which all need to be changed. Also the tracker-code above needs to be revisited as this might break due to the lack of the # sign in the url.

Result: search-engine bots should be able to "see" the complete site with all the URLs. Due to Javascript issues this might not solve the visibility off all content

SSR

Server-side-rendering aka SSR means that the whole javascript issue is circumvented by falling back to HTML completely rendered at the server instead of in the browser.

There are various ways of doing so, we need to research our own way of doing things based on the recipes outlined at the Vue SSR information page.

As stated on the SSR page of Vue most search engines are capable of handling some javascript and they are still improving. We probably end up using a mix of techniques where some main entry points are statically rendered on the server for optimal SEO experience combined with adding more meta-information into the deeper layers of the site.

The exact outcome will be determined as research and development has made progress in this area.

Sitemap

In the old days sites often had a "sitemap" which was a standard HTML page showing a tree of all the pages in the site. Some sites still have this feature but Google is pushing an invisible version of this concept: sitemap.xml.

This is a (computer generated) file which describes all known URLs of a website in a XML format. It is based on an open standard developed by Google.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

We will implement a utility which generates such a sitemap based on the information in the Graph database of Historiana. Once made this can run daily to automatically update the map with the latest content added to the site.

Google Search Console

While Google is not the only search engine is it the biggest one and many people use it by default. Which results in many companies buying keywords and trying to beat others and be on the first page of search results.

This business model brings in a load of money and hence Google created tools for their customers to give insight into their investments. The current name of the main entry point of these tools is Google Search Console; earlier this was called the Webmaster Tools (Wikipedia)

We already set up an account for the Historiana project and we will continue using this tool.

/GoogleSearchConsole.png

Once the generation of the sitemap is done we can use this tool to validate it. Also we will monitor the statistics and use-ability reports it generates.

Quicker managed releases

Historiana is developed using Gitlab and currently the "development" and "master" branch are used for dev.historiana.eu and historiana.eu respectively.

Currently the differences between both versions are too big to allow for quick updates. We should work towards a quicker release cycle as SEO feedback can only be seen on the real live site. There is no use in developing SEO tools and only publish them on the development site.

In order to do this we will need to push the current development site into a state which can be published as live site. Then we need to setup an additional branch in which long-term development can be done. Releasing a new version should be further automated, perhaps even via CI/CD options of Gitlab.

What will NOT be done

Currently some suggest that the real solution to SEO for SPA/PWA applications is the use of a complete Server Side Rendering solution using frameworks like NuxtJS.

We won't go that route.

Although using such a framework is still using Vue it adds another layer of complexity to the system. While we are working on Opening Up Historiana the more generic our web-stack is the better. Even with "just Vue" its difficult to find partners with sufficient experience in coding to add value to project. Adding more nice to the project is counterproductive.

Furthermore as stated in the Vue SSR notes search-bots like Google are constantly improving their javascript support which also lessens the need of such a drastic change in architecture.