JCactus collectively refers to a package of tools developed at ICASA that perform various types of web data collection. These tools are all developed in Java, and work under a common philosophy: given a user specification of one or more "starting points", mechanically explore proximal content, retrieving and storing it (while also possibly applying some simple filters or lexical analysis). The JCactus tools are focused on mass collection (with the option for some basic filtering), and do not perform complex content analyses, such as the indexing and semantic analysis employed by Internet search engines.
The first tool in the JCactus suite is a web content crawler and scraper. As a crawler, it navigates through structures of web pages by following hyperlinks it encounters; as a scraper, it also collects the content of pages it crawls. The JCactus crawler/scraper is designed in a modular fashion, such that different combinations of modules can be used for different collection tasks. Furthermore, this allows for quicker adaptation to new tasks, as a developer can author a new module for use, rather than having to make subtle changes to a monolithic code base. The figure above shows the modular structure of the crawler/scraper. The following definitions will support further discussion of these capabilities:
In the diagram, the execution begins at the top with "seeders", or modules that supply seeds based on input. The URL List Seeder simply reads in seed URLs from a user-specified text file. The Web Search Seeder provides a small list of seeds by querying a search engine for user- specified queries. The Backlinks Seeder finds what pages link to existing seeds, and uses those as seeds as well. The seeds are used to initiate the crawler, which first goes down the left path in the figure. Here, a queue of URLs to be crawled is maintained, and entries are pulled from the head of the queue by multiple "spiders" that retrieve and process the pages.
The spiders work with the filtering modules in the lower left of the figure (as applicable) to accomplish crawl focus. A page encountered by a spider must be accepted by all filters applied to be considered, otherwise its content will not be stored by the scraper, nor will the pages to which it links be crawled. The Lexical Matcher searches for instances of words in each page; at least one word from a user-provided list must be present in each page for it to be accepted. The Internal/External Crawl Filter either accepts only pages with the same domain as one of the seeds (internal crawl) or with a domain different from all the seeds (external crawl). The Domain Filter rejects any page matching a user-specified blacklist of domains.
When a page has been crawled by a spider, and passed all filters present, it is moved to the process on the right side of the figure, starting with the crawled pages queue. At this point, page content has been scraped and processed, and is stored in memory. An I/O Manager pulls these page structures from the queue and distributes them to all active I/O modules; typically only one is used at a time, but it is possible to use many. The Database I/O module inserts page data into an SQL database with a defined schema, which includes provisions for page content (both raw HTML and just rendered text), links, and various metadata. The File System I/O module simply writes pages to files in a specified folder, mirroring the pages crawled.
Another JCactus tool is based around automated collection of Twitter content and data by using Twitter's provided APIs. There are three separate collection tools implemented:
Slurper — One Twitter API provides a continual sampling of public tweets in near real time. The JCactus Twitter Slurper simply collects the content from each sampled tweet and stores it in an SQL database, along with attributes such as its timestamp and the name of the posting user. Twitter has three levels of volume in the sampling they provide, referred to as the spritzer, garden hose, and fire hose; the JCactus Twitter Slurper is designed to work with the spritzer feed, which represents approximately 1% of Twitter's volume.
Tweet Search — Through the use of another API, this tool automatically executes searches for tweets containing any of a provided list of terms. The results are returned as plain text, but include author and timestamp information, as well as unique identifier number.
Follower Graph Builder — Rather than collecting Twitter content, this tool explores relationships between Twitter users. It works in a manner similar to the JCactus web crawler, starting with a list of "seed" users, and using Twitter API to request the lists of users following and followed by each user. Each new user found by such requests will go through the same process, out to a user-specified distance from the seeds. The results are stored in an SQL database, including additional data about each user such as their screen name, preferences and so forth.
The JCactus Twitter tools are subject to some limitations enforced on the API they use. Namely, certain types of API requests may only be used a certain amount of times over a period of time. The tools are constructed to respect these limits by ceasing requests until such a time as they are permitted again. As a result, certain tasks such as Follower Graph construction accrue substantial amounts of "waiting" time once executed beyond a particular size. Also, the "spritzer" stream mentioned above limits the statistical properties of the Slurper results.
Blog webpages contain more detailed and structured information than a general webpage does. For instance, a post contains a block of content associated with the blog author at a given timestamp, followed by a number of comments, which are separate blocks of content associated with other authors at their own timestamps. The JCactus blog scraper uses blog service APIs to collect this structured information and store it into a hierarchically-designed SQL database schema. The user of this tool provides a list of blog URLs, with collection start and end timestamps for each. At present, support for Wordpress and Google's Blogger (a.k.a. Blogspot) is implemented, but additional services could be added, assuming they have sufficient API support.
Similar to the case of the Twitter tools, the blog service APIs have some limits on their usage. The blog collection tools also respect such limits, and therefore blog collection beyond a certain volume can become very time-consuming.