How to scrape AJAX trees

DISCO, European Dictionary of Skills and Competences, offers the user a tree to be searched or browsed. Inspecting the tree nodes, we see that concepts are contained in LI elements with an liItem class. Executing $(‘.liItem’).length in the console we get 676. They claim instead to collect more than 104000 concepts. A bold claim?

A better look at the tree structure reveals that some concepts have a data-loaded attribute set to true and some set to false. In particular, true denotes readily available nodes (downloaded with the initial page load) and false denotes nodes that require an AJAX call before being displayed. Leaf nodes are always of the former kind, but internal nodes can be of both kinds. Would we get those 104000 concepts if we unfolded all the false nodes?

We’ll try. Along the way we’ll also store all nodes into a different structure, something more portable than bare HTML. JSON seems a good option. Ironically, DISCO uses getJSON to download HTML snippets. To summarize, we are now going to store all the HTML tree of DISCO into a JSON structure.

As you probably understood by reading some of my last articles, I had decided to scrape the DISCO tree by means of the support provided by jsFiddle. That was before I discovered the existence of Custom JavaScript Snippets in Google Chrome Developer Tools. Apparently they’ve been there for quite some time, now almost two years !

Screen Shot 2014-06-21 at 14.44.43

Too bad for me. At least it was fun to make DISCO’s page behave outside of DISCO’s server.

Here is the snippet I came up with:

(function() {
    
    no_frills();
        
    var limit = 0;              // a guard to safely try things out
    var node_count = 0;         // a counter of visited nodes
    var pending_responses = 0;  // a counter for tracing ajax calls

    $('body')
        .prepend('<div>Limit: <input id="limit" type="text" size="5" /> <a id="run" href="#">Run</a></div>');
    
    $('#run')
        .click(function (e) {
            e.preventDefault();

            // when the execution ends, window.disco contains all nodes
            window.disco = {nodes: {}};

            limit = $('#limit').val() * 1 || 1;
            node_count = 0;

            var horizontal_skills = $('.rootList > li').get(0);
            console.log(clean_text($(horizontal_skills).text()));
            download(horizontal_skills);

            var vertical_skills = $('.rootList > li').get(2);
            console.log(clean_text($(vertical_skills).text()));
            download(vertical_skills);
        });

    //---

    // outputs window.disco as soon as there are no pending responses
    function output_result_if_done() {
        if (0 < pending_responses) return;
        console.log('(result)', window.disco);
    }

    // downloads a node
    // node is one li.liItem DOM element
    function download(node) {
        if (limit-- <= 0) return;
        grab_contents(node);
        if (has_children(node)) {
            if (children_are_ready(node)) {
                var children = $(node).next().find('> ul > .liItem').get();
                download_children(node, children);
            }
            else {
                var url = get_children_url(node);
                ++pending_responses;
                $.getJSON(url, function(data) {
                    var children = $(data.html).find('.liItem').get();
                    download_children(node, children);
                    --pending_responses;
                    output_result_if_done();
                });
            }
        }
    }

    // downloads a node's children
    // children is an array of li.liItem DOM elements
    function download_children(node, children) {
        grab_children(node, children);
        for (var i = 0, iTop = children.length; i < iTop; i++) {
            download(children[i]);
        }
    }

    // returns the URL to download a node's contents
    function get_contents_url(node) {
        var template = 'ajax/ajaxCalls.php?ajaxFunction=loadTermData&term_id=--ID--&lang_id=0';
        var id = get_id(node);
        var result = template.replace('--ID--', id);
        return result;
    }

    // returns the URL to download a node's children
    function get_children_url(node) {
        var template = 'ajax/ajaxCalls.php?ajaxFunction=loadNode&prefix=node_&node=--ID--&lang_id=0&documents=false';
        var id = get_id(node);
        var result = template.replace('--ID--', id);
        return result;
    }

    // returns the id of a node
    function get_id(node) {
        var result = $(node).data('termid').match(/node_(d+)_0/)[1] * 1;
        return result;
    }

    // adds a key / value pair to an id position into window.disco
    function store(id, key, value) {
        if ('label' == key) {
            console.log('new node ' + (++node_count));
        }
        console.log('storing ' + key + ' for ' + id);
        window.disco.nodes[id] = window.disco.nodes[id] || {};
        window.disco.nodes[id][key] = value;
    }

    // true if node has children, indipendently from the fact that they are ready or not
    function has_children(node) {
        var result = $(node).children(':first').is('a.dummyLink');
        return result;
    }

    // true if node's children are ready to be visited
    function children_are_ready(node) {
        var next$ = $(node).next();
        if (! (next$.is('li.noBulletsLi'))) return false;

        var child$ = next$.children(':first');
        if (! (child$.is('ul.innerUl'))) return false;

        var siblings = child$.siblings();
        if (! (siblings.length == 0)) return false;

        return true;
    }

    // grabs what we want to store about a node's contents elements
    function grab_contents(node) {

        var id = get_id(node);
        var label = clean_text($(node).find('.itemToBeAdded').text());
        store(id, 'label', label);

        var url = get_contents_url(node);
        ++pending_responses;
        $.getJSON(url, function(data) {

            var html$ = $('<div>' + data.html + '</div>');  // why .wrapAll doesn't work here?

            var stuff = {};
            stuff['term']         = {
                items$: html$.find('#infoTerm'),
                key:    null,
                value:  function(li$) { return clean_text(li$.find('.infoItemTranslation').text()); }
            };
            stuff['synonyms']     = {
                items$: html$.find('h1:contains("synonyms:")').next().find('li'),
                key:    null,
                value:  function(li$) { return clean_text(li$.text()); }
            };
            stuff['translations'] = {
                items$: html$.find('#termTranslations li'),
                key:    function(li$) { return clean_text(li$.find('.langIdentifier').text()); },
                value:  function(li$) { return clean_text(li$.find('.infoItemTranslation').text()); }
            };
            stuff['phrases']      = {
                items$: html$.find('h1:contains("attached phrases:")').next().find('li'),
                key:    null,
                value:  function(li$) { return clean_text(li$.text()); }
            };
            stuff['related']      = {
                items$: html$.find('h1:contains("related terms:")').next().find('li'),
                key:    null,
                value:  function(li$) { return clean_text(li$.text()); }
            };

            var contents = {id: id};
            $.each(stuff, function(stuff_key, stuff_value) {
                if (stuff_value.key) {
                    contents[stuff_key] = {};
                    stuff_value.items$.each(function() {
                        var this$ = $(this);
                        contents[stuff_key][stuff_value.key(this$)] = stuff_value.value(this$);
                    });
                }
                else {
                    contents[stuff_key] = [];
                    stuff_value.items$.each(function() {
                        var this$ = $(this);
                        contents[stuff_key].push(stuff_value.value(this$));
                    });
                }
            });

            store(id, 'contents', contents);

            --pending_responses;
            output_result_if_done();

        });
    }

    // grabs what we want to store about a node's children elements
    function grab_children(node, data) {
        var children = $.map(data, get_id);
        var id = get_id(node);
        store(id, 'children', children);
    }

    // cleans up a text (no leading nor trailing spaces, no trailing colon)
    function clean_text(text) {
        return text.replace(/^s+|s+$/, '').replace(/:$/, '');
    }

    // removes decoration elements
    function no_frills() {
        $('.mainNavi, .subNavi, #SpalteMitte, #SpalteRechts, .searchWrapper, .footer, .bannerWrapper, .contentHeader')
            .remove();
    }

})();

There are only a few things to note:

  • I’ve put a limit as a guard to safely try things out
    • with recursive structures –like trees– it’s very useful to limit actions to a small amount of nodes before going full monty
    • this limit is just how many nodes to visit, you can start with a low number like 20 or 50 and see how it works
    • you should get quite a long list of messages output to the console, and if all was fine, the last message will be the result
  • The result is a hash of node ids as keys and node objects as values
    • for example, window.disco.nodes[16901] is
      {
         "label":"aesthetic sensitivity",
         "contents":{
            "id":16091,
            "term":[
               "aesthetic sensitivity"
            ],
            "synonyms":[
               "aesthetic sense",
               "sense of aesthetics"
            ],
            "translations":{
               "CZ":"estetické cítění",
               "DE":"ästhetisches Empfinden",
               "ES":"sensibilidad estética",
               "FR":"sensibilité esthétique",
               "HU":"esztétikai érzék",
               "IT":"sensibilità estetica",
               "LT":"estetinis jautrumas",
               "PL":"wrażliwość estetyczna",
               "SK":"estetické cítenie",
               "SE":"estetisk känsla"
            },
            "phrases":[
      
            ],
            "related":[
               "tolerance of change and uncertainty"
            ]
         },
         "children":[
            15215,
            15213
         ]
      }

      which corresponds to this node in DISCO’s page
      Screen Shot 2014-06-21 at 19.09.54
  • The functions download(node) and download_children(node, children) are mutually recursive
    • their arguments are coherent, i.e. node is an LI element and children is an array of LI elements
    • the latter is not integrated into the former because we need to provide the same treatment to both children readily available and those that will be in the future
    • they start visiting from the two roots –horizontal_skills and vertical_skills– and drill down into the tree structure
  • The UI is never updated by the snippet, instead all the state is automatically kept in memory by the recursive descent
    • if you unfolded aesthetic sensitivity (node 16091) in the tree yourself between two executions with a small number of nodes (say 20), you’d get two different results
    • the first result would (probably) not show aesthetic sensitivity children while the second result would (probably) not show the last two nodes of the first result, thus keeping the number of nodes stuck to the given limit
    • if you want to go back to the initial mint state, a simple reloading won’t be enough without deleting first the session cookie
  • Finally, you can run JSON.stringify(window.disco) and get a nice JSON string which you can copy and paste somewhere and save to a file
    • the hash to string conversion is gonna need some minutes… so many in fact that I left the browser working “indefinitely” (half an hour?)
    • the resulting string is humungous too: 3.785.133 bytes (3,8 MB on disk).

Conclusions

The execution of the above snippet with a limit of 105000 nodes takes around 3 minutes on my MacBook Air with 4GB RAM. At the end, you’ll discover that the last node was number 7380 !!

Wow, that’s a huge difference from the claimed “more than 104000 concepts”. How can it be?

Even considering that they provide a multilingual thesauri with 11 languages and they could have inflated 7380 * 11 times = 81180, there is still around 28% of missing concepts. Could they have added the number of phrases? No, because they separately claim “approximately 36000 example phrases”. They could have instead added the number of synonyms.

$.map(window.disco.nodes, function(v) { 
    return v.contents.synonyms.length; 
}).reduce(function(a, b) {
    return a + b;
});

Running the above code we get 3443 synonyms, which added to 7380 concepts make for 10823 terms, which inflated 11 times make for 119053 terms in all languages.

  • 7380 * 11 + 3443 + X * 10 = 104000
  • X = (104000 – 84623) / 10
  • X = 1937.7 = 56% * 3443 = 26% * 7380

Hm, I don’t know. It seems to me that they “confused” concepts with terms and at the same time, while English synonyms count is around 47% of concepts, in all other languages synonyms count is around 26% of concepts, which is ostensibly much less.

All in all, 7380 concepts is a good number but it’s only the 7% of what they claim.

My Horizontal Skills

What follows is a list of my horizontal skills in no particular order, according to the terminology of DISCO, the European Dictionary of Skills and Competences, non domain specific skills and competencies category.

  • artistic skills and competences
    • aesthetic sensitivity
      • feeling for form and space
      • sense of color
    • creativity
  • computer skills and competences
    • ability to learn new software applications
    • database management system
      • Lotus Domino
      • MySQL
    • office software knowledge
      • spreadsheet knowledge
      • word-processor knowledge
    • special operating systems and server software
      • Apache server
      • Apple operating systems
        • Mac OS X
      • Linux
        • LAMP system
        • Ubuntu
  • driving licenses
    • motorcycles and light motorcycles
      • driving license category A
      • driving license category A1
    • private cars
      • driving license category B
  • languages
    • English
      • business English
      • technical English
    • Italian (mother tongue)
    • Spanish
  • managerial and organizational skills
    • ability to coordinate
    • ability to organize oneself
    • decision making competence
    • entrepreneurial thinking
    • leadership skills
    • risk-taking behavior
    • self-sufficiency
    • setting standards
    • setting targets
    • strategic planning
    • supervision skills
  • personal skills and competences
    • cognitive skills and problem solving abilities
      • ability to concentrate
      • ability to learn new languages
      • analytical thinking
      • application of professional techniques
      • general technical skills
      • information gathering
      • intellectual curiosity
      • inventiveness
      • learning ability
      • logical thinking
      • powers of discernment
      • powers of observation
      • problem solving ability
        • problem identification
      • resourcefulness
      • self assessment
      • systematic approach
      • technical understanding
      • understanding of working environments
      • willingness to learn
    • personal skills and abilities
      • ability to cooperate
      • ability to cope with pressure
      • ability to work in a team
      • absence of phobias
        • absence of claustrophobia
        • tolerance of heights
      • assertiveness
      • balanced personality
      • carefulness
      • competitive mentality
      • courage
      • credibility
      • discretion
      • enthusiasm
      • flexibility
      • friendliness
      • honesty
      • hygiene
      • independence
      • invention
      • judgement
      • loyalty
      • open-mindedness
      • originality
      • personal initiative
      • politeness
      • preferences and aversions
        • preference for indoor work
        • preference for working unsupervised
        • preference for working with people
        • willingness to travel
      • reliability
      • safety awareness
      • self-confidence
      • selflessness
      • sense of humour
      • sense of responsibility
      • spatial orientation
      • tenacity
      • thoughtfulness
      • tolerance of change and uncertainty
      • tolerance of emotional stress
      • tolerance of frustration
      • willingness to accept personal responsibility
    • physical attributes and abilities
      • absence of allergies
      • absence of perception defects
        • absence of colour blindness
        • absence of limiting medical conditions
        • good hearing
        • good vision (with glasses)
        • keen sense of smell
        • keen sense of taste
        • keen sense of touch
      • agility
      • general physical fitness
      • sense of balance
      • sure-footedness
  • social and communication skills and competences
    • basic competence in verbal expression
      • correct spelling
      • correct use of grammar
    • clear and distinct diction
    • communication in English
    • communication in foreign languages
    • competence in professional communication
    • competence in verbal expression
    • competence in written expression
      • clear writing style
      • creative writing style
      • writing drafts
      • writing technical information and documents
    • effective questioning
    • fostering contacts
    • intercultural competence
    • listening comprehension
    • negotiation skills
      • ability to find a compromise
      • conflict resolution
      • persuasiveness
    • oral comprehension
    • participate actively in discussions
    • teaching ability