Elixir and Web Scraping
For those being introduced to the topic, the concept in broad terms is basically making requests to one or more URLs and working on the result, translating the obtained content into the language’s structures.
Agenda
The Go-to for Web Scraping
For those being introduced to the topic, the concept in broad terms is basically making requests to one or more URLs and working on the result, translating the obtained content into the language’s structures. This makes it easy to manipulate in various ways, allowing for information to be sent to another service, stored locally, among other things.
Setting Up the Project
Our project “stack”:
- Elixir
- Libraries
- Tesla -> HTTP Client
- Floki -> HTML Parser
If Elixir is not installed: Install Elixir
Creating via Mix
Using Mix, we create the initial project structure which allows us to easily: install libraries, run tests, create executables, among other things.
$ mix new erlscrap
Output:
* creating README.md
* creating .formatter.exs
* creating .gitignore
* creating mix.exs
* creating lib
* creating lib/elscrap.ex
* creating test
* creating test/test_helper.exs
* creating test/elscrap_test.exs
Your Mix project was created successfully.
You can use "mix" to compile it, test it, and more:
cd elscrap
mix test
Run "mix help" for more commands.
Library Installation
tesla Tesla is an HTTP client loosely based on Faraday. It embraces the concept of middleware when processing the request/response cycle.
The Tesla library is very simple to use and highly extensible. Excellent for simple requests or even complex clients with middlewares.
floki Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Developed by our dear Philipe Sampaio, the Floki library is a simple and performant HTML parser that allows manipulation of HTML elements via CSS selectors, making it very easy to retrieve data from pages.
Initially, we will edit the deps
function inside the mix.exs
file:
defp deps do
[
{:tesla, "~> 1.3.0"},
{:floki, "~> 0.29.0"}
]
end
After running the command: mix deps.get
, the dependencies will be installed.
Features
Our tool will have 4 basic functionalities:
Parse CLI Arguments
Before parsing arguments from the command line, we will configure our application so that it is possible to generate a binary using escript that can run on any system with Erlang installed (There are other binary distribution methods I want to explore in future posts).
- First, we will modify the
project
function in ourmix.exs
:
def project do
[
app: :elscrap,
version: "0.1.0",
elixir: "~> 1.11",
start_permanent: Mix.env() == :prod,
deps: deps(),
escript: escript() # Added this call to the escript/0 function
]
end
- Next, implement the new
escript/0
function in the same file:
defp escript do
[
main_module: Elscrap.Cli, # Our main module
path: "bin/elscrap" # Output path for our binary
]
end
With this done, we can create our main module: lib/cli.ex
. As we already have our entry point defined, we can start implementing the main/1
function which receives the arguments passed via CLI:
defmodule Elscrap.Cli do
def main (args \\ []) do
IO.inspect(args) # debug
end
end
For testing purposes, we can now create a binary and check if the arguments are passed correctly:
$ mix escript.build
$ ./bin/elscrap --extract-links --url "https://github.com" --save
["--extract-links", "--url", "https://github.com", "--save"]
These are the arguments we want to have at the end of our application. When executed, we can see the list generated below.
To take advantage of these arguments, we can use the OptionParser.parse/2
function which gives us various input options. Similar to what’s in the escript documentation, we will define our --extract-links
flag as a boolean to simplify the operations we have in mind.
Thus, our main/1
function changes, and we implement the parse_args/1
function:
def main (args \\ []) do
args
|> parse_args
end
defp parse_args(args) do
{opts, _value, _} =
args
|> OptionParser.parse(switches: [extract_links: :boolean]) # Here we define the --extract-links parameter as a boolean
opts
end
:warning: off-topic
In this snippet, there are some expressions that you might find unusual if you are not familiar with Elixir… (There are many articles about this, but I will write something about it soon) The first thing to highlight is the pipe operator (|>
), which simply creates an execution flow between functions, passing the result of one function as an argument to the next call. We can create something like:
[1, 2, 3]
|> Enum.map
|> Enum.filter
|> Foo.bar
This makes the result of one function passed to another function (It is always good to remember that in Elixir all functions return their last expression). To explore more: Pipe Operator - Elixir School
Another thing highlighted is the use of pattern matching
in the following expression:
{opts, _value, _} = ...
Here, we use it as a “destructuring expression” to simplify the comparison with other languages, but pattern matching is much more than that and I strongly recommend the doc if you are not familiar with it: Pattern Matching - Elixir School.
Request
Now we move on to the request to the URL passed through the url
parameter.
Our main/1
function gains an extra call and introduces the scrap/1
and request/1
functions:
def main (args \\ []) do
args
|> parse_args
|> scrap # new call
end
defp scrap(opts) do
url = opts[:url] || nil
unless url do
IO.puts("Url required")
System.halt(0)
end
if opts[:extract_links] do
links = request(url)
end
end
defp request(url) do
IO.puts("Extracting urls from: #{url}\n")
{:ok, response} = Tesla.get(url)
response.body
end
Making a request with the Tesla library is indeed simple:
{:ok, response} = Tesla.get(url)
response.body
Next, we return the body of the response and expect a valid response through the atom :ok
. We will use the IO.inspect/1
function to ensure that the return actually works:
{:ok, response} = Tesla.get(url)
IO.inspect(response.body)
Test:
$ mix escript.build
...
$ ./bin/elscrap --extract-links --url "https://github.com"
Extracting urls from: https://github.com
"\n\n\n\n\n<!DOCTYPE html>\n<html lang=\"en\" class=\"html-fluid\">\n <head>\n <meta charset=\"utf-8\">\n <link rel=\"dns-prefetch\" href=\"https://github.githubassets.com\">\n <link rel=\"dns-prefetch\" href=\"https://avatars0.githubusercontent.com\">\n
...
We can then move on to extracting links.
Extracting Links
The scrap
function now evolves:
if opts[:extract_links] do
links = request(url)
|> extract_links
IO.puts("#{Enum.join(links, "\n")}")
end
We are stating that we will make the request via the request/1
function, pass the result to extract_links/1
which returns a list, and then print our list line by line.
Implementation of the extract_links/1
function:
defp extract_links(response_body) do
{:ok, document} = Floki.parse_document(response_body)
links = document
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.filter(fn href -> String.trim(href) != "" end)
|> Enum.filter(fn href -> String.starts_with?(href, "http") end)
|> Enum.uniq
links
end
Some things happen here:
{:ok, document} = Floki.parse_document(response_body)
# Here we only expect a successful parsing response -> :ok
links = document # passing the variable obtained from parsing above
|> Floki.find("a") # search for <a> tags within the HTML
|> Floki.attribute("href") # with the found tags, we want what's inside href (link)
|> Enum.filter(fn href -> String.trim(href) != "" end) # Filter for each element iterated and return only non-empty strings
|> Enum.filter(fn href -> String.starts_with?(href, "http") end) # Filter for each element iterated and return only strings starting with http
|> Enum.uniq # Return a new list with only non-repeated URLs
Currently, our module looks like this:
defmodule Elscrap.Cli do
def main (args \\ []) do
args
|> parse_args
|> scrap
end
defp parse_args(args) do
{opts, _value, _} =
args
|> OptionParser.parse(switches: [extract_links: :boolean])
opts
end
defp scrap(opts) do
url = opts[:url] || nil
unless url do
IO.puts("Url required")
System.halt(0)
end
if opts[:extract_links] do
links = request(url)
|> extract_links
IO.puts("#{Enum.join(links, "\n")}")
end
end
defp request(url) do
IO.puts("Extracting urls from: #{url}\n")
{:ok, response} = Tesla.get(url)
IO.inspect(response.body)
end
defp extract_links(response_body) do
{:ok, document} = Floki.parse_document(response_body)
links = document
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.filter(fn href -> String.trim(href) != "" end)
|> Enum.filter(fn href -> String.starts_with?(href, "http") end)
|> Enum.uniq
links
end
end
Now we can rebuild our application and voilĂ :
$ mix escript.build
...
$ ./bin/elscrap --extract-links --url "https://github.com"
Extracting urls from: https://github.com
https://docs.github.com/articles/supported-browsers
https://github.com/
https://lab.github.com/
https://opensource.guide
https://github.com/events
https://github.community
https://education.github.com
https://stars.github.com
https://enterprise.github.com/contact
https://enterprise.github.com/contact?ref_page=/&ref_cta=Contact%20Sales&ref_loc=billboard%20launchpad
https://www.npmjs.com
https://apps.apple.com/app/github/id1477376905?ls=1
https://play.google.com/store/apps/details?id=com.github.android
https://desktop.github.com/
https://cli.github.com
https://docs.github.com/github/managing-security-vulnerabilities/configuring-dependabot-security-updates
https://docs.github.com/discussions
https://enterprise.github.com/contact?ref_page=/&ref_cta=Contact%20Sales&ref_loc=footer%20launchpad
https://resources.github.com
https://github.com/github/roadmap
https://docs.github.com
http://partner.github.com/
https://atom.io
http://electronjs.org
https://services.github.com/
https://githubstatus.com/
https://github.com/contact
https://github.com/about
https://github.blog
https://socialimpact.github.com/
https://shop.github.com
https://twitter.com/github
https://www.facebook.com/GitHub
https://www.youtube.com/github
https://www.linkedin.com/company/github
https://github.com/github
Thus, we obtain all the URLs present on the GitHub page.
Save Data
As it stands, it’s easy to save the result to a file using the command: ./bin/elscrap .. >> result.txt
, but to explore a bit more, we can implement a simple function that creates an output file when needed.
Within our scrap/1
function, we can add an if
checking if the --save
parameter was provided:
if opts[:extract_links] do
links = request(url)
|> extract_links
IO.puts("#{Enum.join(links, "\n")}")
if opts[:save], do: save_links(url, links)
end
And now the implementation of the save_links/2
function:
defp save_links(url_id, links) do
IO.puts("Saving links")
file = "output/links.txt"
content = links
|> Enum.join("\n")
case File.write(file, content) do
:ok -> IO.puts("[#{url_id}] Links saved to: #{file}")
{:error, reason} -> IO.puts("Error on save links: #{reason}")
end
end
We create the folder, rebuild the tool, and now when running the script with the new parameter:
./bin/elscrap --extract-links --url "https://github.com" --save
Extracting urls from: https://github.com
...
Saving links
[https://github.com] Links saved to: output/links.txt
Conclusion
Just like in other languages, it’s very simple to create a web scraping tool in Elixir. We also have the advantage of implementing asynchronous and performant functionalities effectively and simply, in addition to the expressiveness of the language. That’s it…
The tool repo is available on GitHub: Elscrap