JCactus collectively refers to a package of tools developed at ICASA that perform various
types of web data collection. These tools are all developed in Java, and work under a common
philosophy: given a user specification of one or more "starting points", mechanically explore
proximal content, retrieving and storing it (while also possibly applying some simple filters
or lexical analysis). The JCactus tools are focused on mass collection (with the option for
some basic filtering), and do not perform complex content analyses, such as the indexing and
semantic analysis employed by Internet search engines.
Architectural overview of JCactus crawler
The first tool in the JCactus suite is a web content crawler and scraper. As a crawler, it
navigates through structures of web pages by following hyperlinks it encounters; as a scraper,
it also collects the content of pages it crawls. The JCactus crawler/scraper is designed in a
modular fashion, such that different combinations of modules can be used for different
collection tasks. Furthermore, this allows for quicker adaptation to new tasks, as a developer
can author a new module for use, rather than having to make subtle changes to a monolithic
code base. The figure above shows the modular structure of the crawler/scraper. The following
definitions will support further discussion of these capabilities:
- A particular web page from which a web crawl begins. A web crawl requires at least one
seed to begin, but may have multiple. Seeds can be sourced in a variety of ways.
- The optional process of restricting which pages are crawled and/or scraped based on
user-specified criteria. These criteria can be based on any of various aspects of a page,
such as its URL or its content.
- (Un)focused Crawl
- A web crawl/scrape executed with(out) focus criteria.
In the diagram, the execution begins at the top with "seeders", or modules that supply seeds
based on input. The URL List Seeder simply reads in seed URLs from a user-specified text file.
The Web Search Seeder provides a small list of seeds by querying a search engine for user-
specified queries. The Backlinks Seeder finds what pages link to existing seeds, and uses
those as seeds as well. The seeds are used to initiate the crawler, which first goes down the
left path in the figure. Here, a queue of URLs to be crawled is maintained, and entries are
pulled from the head of the queue by multiple "spiders" that retrieve and process the pages.
The spiders work with the filtering modules in the lower left of the figure (as applicable) to
accomplish crawl focus. A page encountered by a spider must be accepted by all filters applied
to be considered, otherwise its content will not be stored by the scraper, nor will the pages
to which it links be crawled. The Lexical Matcher searches for instances of words in each
page; at least one word from a user-provided list must be present in each page for it to be
accepted. The Internal/External Crawl Filter either accepts only pages with the same domain as
one of the seeds (internal crawl) or with a domain different from all the seeds (external
crawl). The Domain Filter rejects any page matching a user-specified blacklist of domains.
When a page has been crawled by a spider, and passed all filters present, it is moved to the
process on the right side of the figure, starting with the crawled pages queue. At this point,
page content has been scraped and processed, and is stored in memory. An I/O Manager pulls
these page structures from the queue and distributes them to all active I/O modules; typically
only one is used at a time, but it is possible to use many. The Database I/O module inserts
page data into an SQL database with a defined schema, which includes provisions for page
content (both raw HTML and just rendered text), links, and various metadata. The File System
I/O module simply writes pages to files in a specified folder, mirroring the pages crawled.
Another JCactus tool is based around automated collection of Twitter content and data by using
Twitter's provided APIs. There are three separate collection tools implemented:
Slurper — One Twitter API provides a continual sampling of public tweets
in near real time. The JCactus Twitter Slurper simply collects the content from each sampled
tweet and stores it in an SQL database, along with attributes such as its timestamp and the
name of the posting user. Twitter has three levels of volume in the sampling they provide,
referred to as the spritzer, garden hose, and fire hose; the JCactus
Twitter Slurper is designed to work with the spritzer feed, which represents approximately 1%
of Twitter's volume.
Tweet Search — Through the use of another API, this tool automatically
executes searches for tweets containing any of a provided list of terms. The results are
returned as plain text, but include author and timestamp information, as well as unique
Follower Graph Builder — Rather than collecting Twitter content, this tool
explores relationships between Twitter users. It works in a manner similar to the JCactus web
crawler, starting with a list of "seed" users, and using Twitter API to request the lists of
users following and followed by each user. Each new user found by such requests will go
through the same process, out to a user-specified distance from the seeds. The results are
stored in an SQL database, including additional data about each user such as their screen
name, preferences and so forth.
The JCactus Twitter tools are subject to some limitations enforced on the API they use.
Namely, certain types of API requests may only be used a certain amount of times over a period
of time. The tools are constructed to respect these limits by ceasing requests until such a
time as they are permitted again. As a result, certain tasks such as Follower Graph
construction accrue substantial amounts of "waiting" time once executed beyond a particular
size. Also, the "spritzer" stream mentioned above limits the statistical properties of the
Blog webpages contain more detailed and structured information than a general webpage does.
For instance, a post contains a block of content associated with the blog author at a given
timestamp, followed by a number of comments, which are separate blocks of content associated
with other authors at their own timestamps. The JCactus blog scraper uses blog service APIs to
collect this structured information and store it into a hierarchically-designed SQL database
schema. The user of this tool provides a list of blog URLs, with collection start and end
timestamps for each. At present, support for Wordpress and Google's Blogger (a.k.a. Blogspot)
is implemented, but additional services could be added, assuming they have sufficient API
Similar to the case of the Twitter tools, the blog service APIs have some limits on their
usage. The blog collection tools also respect such limits, and therefore blog collection
beyond a certain volume can become very time-consuming.