Responsible Web Scraping
Dr Alan Hanna discusses the ethical considerations behind the new module ‘Python for Finance’ on our BSc Finance programme.
The word ‘hack’ and its derivatives have been partially rehabilitated in recent years. A ‘life hack’ is considered to a useful shortcut to boost efficiency or wellbeing. ‘Hackathons’ are collaborative events that bring developer communities together to learn, share, and create solutions. This is a departure from the more sinister use of the word ‘hacker’ to describe a cybercriminal or its more derogatory use for a less-than-professional developer (akin to its use in golf). Like most skills, coding can be used for good or ill.
At Queen’s Management School, we have recently added a new module ‘Python for Finance’ to our BSc Finance programme. Since most of the students are new to coding, the module can be viewed as an introduction to the programming language itself and to the universe of possibilities that it opens. Part of the appeal of python is that it allows novice developers to accomplish significant results with just a few lines of relatively simply code by building on an ever-expanding collection of freely available packages.
With these new-found skills, also come a potential minefield of ethical issues and professional responsibilities in areas such as privacy and data-driven decision making. These are issues that can all too easily be overlooked and for which students need some guidance. An excellent starting point is the Association for Computing Machinery’s (ACM) Code of Ethics which reminds computing professionals to ‘act responsibly’ and to ‘reflect upon the wider impacts of their work, consistently supporting the public good’.
Web scraping
Extracting data from the internet, particularly in an automated fashion, is referred to as web scraping. As an industry, finance has an insatiable appetite for information and using free-to-download libraries like Beautiful Soup, Selenium, and Scrapy, one can easily write a few lines of code to acquire data. The benefits are clear and immediate: through automation one can replace tedious manual processes and increase the speed and volume of data acquisition to realise huge productivity gains.
Such an approach though is not always welcomed by the organisation behind the website. Recognising the value of their content, some websites try to make it difficult for wholesale harvesting of data and automated website access. For this, we all pay the small cost of trying to prove from time to time that we are in fact ‘not a robot’ via the CAPTCHA system. Blade Runner-style tests aside, other techniques include limiting the number of search results per page, restricting the frequency (throttling) of requests from a single IP address, and generating content in non-HTML format. Thus, while it may now be technically possible to extract data from a website, one should always pause to ask if it is legitimate to do so.
To begin with, one should consult the website for terms of service. This can clarify, for example, if personal, educational, or commercial usage is permitted. If in doubt, consider reaching out to the company to check. Most domains also include a robots.txt file (see for example https://www.yahoo.com/robots.txt). While primarily aimed at search engine crawlers, this can indicate parts of a website where automated requests are unwelcome.
Some websites are happy to share their data to the point of facilitating information requests via an application programming interface (API), often with accompanying documentation and sample code. These define protocols for requesting data (or performing other operations) and allow companies to better marshal such requests. Where available, these should be the preferred mode of access.
A responsible web scraper can choose to share additional information via the user-agent request header. This allows servers to check the application (normally a browser), version, and operating system from where requests have been made. Some automated requests can be distinguished (and blocked) in this way. To promote transparency, the header can be customised to provide additional information (such as a contact email address) that would allow the domain owner to understand or query unexpected usage.
A further consideration is fair usage of a shared resource. While a single user running a single process is unlikely to overload a server, sending too many requests could impact the service available to others. Taken to an extreme, this could result in a situation similar to a denial-of-service (DoS) attack, rendering the website unavailable to users. A simple solution is to slow the speed of requests by adding sleep commands to periodically pause the execution of code.
Post-scraping, the developer is also faced with responsible storage, processing, and interpretation of the data. One should also consider how the data itself will be used. For example, if content is subject to copyright, can it be reshared in raw or derived formats, what attribution is required, and is commercial exploitation permitted.
As the infamous web-crawler Peter Parker was reminded, with great power comes great responsibility. This is true not only of our students with their new-found coding skills, but also for those who impart such knowledge.