sitemap_gen.py MemoryError

I have a site which gets good amount of traffic and accesslog now gets around 10GB per month. Sometime ago I have installed sitemap_gen.py by Google to create sitemap for search engines (Google again). When I had small logs it was not a problem but when logs became large I started to get MemoryError when attempting to create sitemap out of large accesslog file. Well… We live in interesting times and here we go again: Google! And guess what?! There no solution on the web about it (at least after hours of googling I have not found it), Google’s own product has major fault which they did not rectified since 05-12-2005 and Google still recommends sitemap_gen.py on their own site https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html (it was correct at the time of writing on 14-May-2009 - now Google has removed the page. Perhaps they’re ashamed to endorse their own product).

I looked around for alternatives and there’s some commercial sitemap generators available but I am open source supporter and want to have truly free and open system. Unfortunately I could not find any free open source sitemap generators either.

So as Gary Oldman said in Fifth Element: if you want something done properly - do it yourself. I started sitemap_gen.py under strace and immediately problem came to the light: statement file.readlines() causes WHOLE accesslog to be read into memory. Oops. I can understand Google developers have virtually limitless amount of memory at their disposal (and that’s what word Google effectively means by the way) but we are just pathetic mortals who have only few gigabytes of memory and accesslog quite often may exceed it. So I used my own brains and one minute of time to replace this statement with the one which causes python to read the file line by line.

Obviously Google developers have not done stress testing of their own script. And despite large number of people having problem with that nothing done to it for years. But hopefully one minute I have spent (and another few minutes writing this blog entry) will save a lot of headaches to everyone who running their sites and want to have sitemaps.

You can download fixed sitemap_gen.py from here: http://www.bashkirtsev.com/files/sitemap_gen.py.

Tags: , , ,

9 Responses to “sitemap_gen.py MemoryError”

  1. KrisBelucci says:

    Hi, cool post. I have been wondering about this topic,so thanks for writing.

  2. Kelly Brown says:

    I really like your post. Does it copyright protected?

  3. admin says:

    If you speaking about sitemap_gen.py - it is copyrighted by Google. My post obviously is my own copyright. But sitemap_gen.py released under BSD 2.0 new license which allows modifications and free use as long as original authors are mentioned. Full text of license is here

  4. JaneRadriges says:

    Hi, gr8 post thanks for posting. Information is useful!

  5. The article is usefull for me. I’ll be coming back to your blog.

  6. GarykPatton says:

    How soon will you update your blog? I’m interested in reading some more information on this issue.

  7. admin says:

    I do not expect to alter this article as it is more or less final. However it may be good idea to subscribe to RSS and if something would come up you would know about it.

  8. Derekp says:

    I think i’ve seen this somewhere before…but it’s not bad at all

  9. Cordula says:

    Good catch! Using file.readlines() was really sloppy coding.

Leave a Reply