Getting a New Site Indexed: Observations
Date Published: 14th March 2019
As a full-time SEO who tends to work with established companies, it’s been a while now I’ve been involved in launching a new website from scratch and seeing how Google responds to it.
By new website from scratch, I don’t mean a new redesign or site migration; I’m talking a new-ish domain with 0 results currently indexed in Google.
You’d assume that as SEOs we’d all be continually testing and building new websites but unfortunately from experience of myself, colleagues and other peers I know, this isn’t actually often the case.
It’s entirely possible you are a Senior SEO at a large multinational agency who has a) never launched a website or b) started from a position of 0 pages indexed or 0 links.
Doing this stuff doesn’t preclude you from being excellent at your job, and I’m certainly not SEO gatekeeping here at all.
I’m just making a long-winded point that it’s been too long now since I’ve personally observed how Google responds to brand new websites; a very very very basic and small part of the overall SEO picture.
My Initial Observations
Disclaimer: Obviously one swallow does not a summer make. Correlation != causation. Stuff I have observed on nicksamuel.com doesn’t guarantee it will be repeatable on a different domain.
Second Disclaimer: A lot the points below interrelate so the subheadings aren’t perfect but what are you going to do?
Third Disclaimer: This was mostly written 1 week after the site has been indexable to Google and submitted via Search Console. More patience is definitely needed in the real world of SEO but since this is called initial observations…
“It” has a good memory – I sort of lied when I said nicksamuel.com has zero search results and is a brand new domain. I actually used it to host a short-lived website about five years ago, however it has long been removed and a site:search would have returned 0 results in the index.
Low and behold my surprise when I was checking out the server logs (See my post on LogFlare), when I saw Mr Googlebot trying to “randomly” crawl previously existing urls (and 404’ing hard). My old site five years ago actually used the www subdomain and as a product of its time, defaulted to http so not sure where Google got these links from right away e.g. https + non-www combined with old urls.
We know that site: search is generally quite unreliable, and Google is happy to filter available search results based on stuff such as duplicate content.
Question: “To what extent does Google have a hidden index, or an index which isn’t publicly viewable of domains/urls/pages?”
“It” doesn’t always use the frontdoor – Keeping an eye on server logs I wanted to see Googlebot ripping through the site ala Screaming Frog or your cloud based Crawling tool of choice but the reality is regardless of your site, it’s more efficient for Googlebot to dip in and out.
Especially when the web is comprised of billions, wait, probably trillions of pages. I was slightly disappointed to not see this “rapid parsing” even after the sitemap was submitted and again when I asked for it to be recrawled but such is life but crawl rates exist for a reason!
Question: “How often if ever does Googlebot do a “deep” crawl across sites? e.g. Using the homepage or sitemap as a starting point and then going several levels deep?”
“It” is out hustled by Bingbot – The assumption is usually that Google is the “better” search engine therefore it must be because it has a fresher index i.e crawls more often, however this isn’t true so far for nicksamuel.com and for many other websites out there. It could well be the old cliche that Google works smarter rather than harder compared to Bing, but I thought it was interesting in this instance as I haven’t actually got round to setting up Bing Webmaster Tools yet.
Question: “Does setting up Bing Webmaster tools mean Bing crawls your site more or less; How does Bing define Crawl efficiency compared to Google?”
Site Search Is Unreliable – Ok so I touched upon this on my first point about Googlebot but this one more specifically relates to the publicly viewable index of pages.
I’m sure we’ve all seen this in the field where we will do a site search using all four common websites prefixes (http, https, www and non-www), and the compare what it should be with a blank site:domain.com search.
Or even do the same site:domain.com search across two days where we know content is static. It can fluctuate quite unexpectedly which means using this as a general FYI reporting metric can sometimes be misleading.
Anyway, I was surprised to see this index fluctuate to the extent it did when we’re talking about a domain with 11 URLs. Yes, eleven URLs as shown in the sitemap below. So somehow only 8 were indexed, and the homepage wasn’t…until about 4 days later and several crawl and index requests.
Question: “How can webmasters ensure all URLs within a new sitemap and indexed at the same time?”
Old Sitemaps Remain – This might partially answer where Google got the “random” URLs to crawl from. When I verified the entire domain property using DNS, it was brought to my attention that there was an old sitemap rocking http and www (from five years ago as I mentioned above).
This is a fantastic example which shows the value of this new feature, as I probably wouldn’t have been arsed to re-add every property to check the old search console/settings etc.
Anyway, two points:
1) That old Search console property was deleted a long time ago and I no longer have access to it (HTTP and WWW) so not sure why Google saw fit to reference it.
2) As above, Google probably tried to guess the correct URLs by combing https + non-www with urls on the old sitemap, however even after they were technically able to recrawl that old site map (and verify it didn’t exist), they still tried to crawl old non-existent pages!
Question: “How can we remove old sitemaps and tell Google to ignore them using the domain property? (Do we still need to add the individual search console properties e.g. http + www?”
Questionable Relevancy/Understanding – As an SEO naturally I want to track some keywords for Nick Samuel to see where my site lands and what Google deems to be the most relevant pages. The two keywords I will initially monitor are simply “Nick Samuel” and “Nick Samuel SEO”.
I love seeing where a website or new piece of content lands in order to gauge a few things as a very soft acid test:
A) How relevant Google sees the page (with NO backlinks/off-page optimisation)?
B) What the level of competition is (based on key words, phrases and semantic relevancy alone)?
Anyway for “Nick Samuel” it shot into the Page 7. Not bad I thought considering Google still reckoned it couldn’t crawl the website…
For “Nick Samuel SEO”, it started ranking 6/7th. I cut off the top of this SERP accidentally and didn’t count at the time. I actually go into initial rankings a bit further down this post.
Main point here is that by simply adding one keyword modifier (SEO), Google deemed my website first page worthy based on a title tag, a handful of posts and probably a very light sprinkling of “Nick Samuel” + “SEO” throughout the website.
More importantly, it deemed the homepage the most relevant. However to begin with it did initially rank my “SEO Reviews” page higher than just nicksamuel.com.
This *could* indicates two things:
1) Title tags are still important, placement of keyword matters to a certain extent and lastly, the one which is very much “no” shit…
2) Getting basic on-page will get you pretty damn far if you’re targeting long-tail keywords or anything with say, three keywords and low exact-match competition.
I mean obviously I have an exact-match domain as well, but regardless low-level SEO isn’t dead, long live SEO etc.
Question: “Where will nicksamuel.com eventually land for “Nick Samuel” with zero linkbuilding?”
Google Search Console
New Domain Properties Are Laggy – I feel like an old person making unfounded claims related to the past, but I swear Google Search Console was better back in my day.
We used to call it Webmaster Tools, it gave us lots of useful information, keyword data AND it seemed to update way quicker. After one week, I still haven’t got any solid information as to what pages have been properly indexed by Google or a reason why they wouldn’t be.
Not sure if this is since the switch to “new” Search Console or if it has always been a bit slow. In fairness with client sites, I would check back every couple of days and always have some data to examine; starting from scratch is certainly a less common scenario but regardless, it really underlines the data lag.
Question: “If Google “canonicalises” SERP data, are we still supposed to add 4 x Search Console properties?”
New Domain Properties Are Buggy – This could well be related to point one; there’s a lag in updating data but there was a serious (in my opinion) bug with the robots.txt file.
Basically it had indexed the version from staging, and despite numerous attempt, I couldn’t get it updated. It was a weird catch 22 (at least a colloquial catch-22 anyway) where Google said my page was no-indexed due to robots, but then also wouldn’t crawl the robots file due to robots.
After numerous fetch as Google attempts throughout the day, including a dozen on the robots.txt itself, I eventually had to re-add the https non-www property to update using old Google Search Console.
However, this of course contradicted what the main property view said and whilst pages were now being crawled, there was no official acknowledgement from either property view. Uh-oh.
So yeah, my advice would be for the time being to only setup GSC post-live. This should avoid any potential caching issues with robots or premature crawling issues.
I guess the main benefit to setting up GSC in advance is to ensure Google can’t index but you could solve this by password protecting your site with .htaccess or setting up a X-Robots-Tag in the header response.
Question: “When will every option from old Google Search Console be migrated across to new Google Search Console (and why hasn’t it already for the “launch!” “?
Ok no points or question here, just a quick FYI here and confirmation where nicksamuel.com landed for two search terms. To use a bit of old school terminology, have I avoided the mythical Google Sandbox and what I can I expect from the Google dance?
Here’s the first screenshot from the ever reliable SEMrush:
And here’s a follow up from from new-ish SERP tracker SerpWoo:
Overall a much longer post than I ever planned on writing. Not very actionable for anyone else but it will be useful for me to refer to in the future with nicksamuel.com or a similar new website.