IDENTIFICATION AND CATEGORIZATION
OF ONLINE PRODUCT SPECIFICATIONS
BACHELOR THESIS
STEFAN PAAP
10288279
SUPERVISOR: H. AFSARMANESH
27-‐06-‐2014
Abstract
When designing and constructing complex products, such as solar plants, that are highly customized, the sub-‐products that constitute the complex product and the services that enhance it are usually not provided by one supplier. Suppliers often provide
specifications of their products on their websites. Manually obtaining and classifying this information is impractical, so an approach is needed to identify and classify documents contains product specifications. The problem is even more challenging considering the wide heterogeneity in defined categories at different sites and discovering product specifications specific to complex products. As a first step, this approach will scan a website for product specifications (HTML and PDF files). When a specification page is found, as a second step the URL or breadcrumb structures are used as the categorization of the product. As a final step this categorization is compared to the EU standards NACE and PRODCOM, to eliminate the problem of the wide heterogeneity and determine if a product is of any interest for a user.
For the recognition of product specification multiple constraints are used, such as the presence of a table structure and a colon in the line of a unit. For comparing the
breadcrumb structure against the EU standard, the content of the breadcrumb structure and the description of the standards are compared to each other using the Levenshtein
Table of contents
ABSTRACT ... 2 INTRODUCTION ... 4 GLONET ... 4 RESEARCH QUESTION ... 4 APPROACH ... 6 METHOD ... 10OPEN SOURCE TOOLS ... 10
PYTHON SCRIPT ... 11 HTM(L) Files ... 12 PDF files ... 13 HADOOP ... 19 TEST SETUP ... 20 RESULTS ... 21 HTM(L) FILES ... 21 PDF FILES ... 26
NACE & PRODCOM ... 46
DISCUSSION ... 50
CONCLUSION ... 52
REFERENCES ... 53
Introduction
When designing networks of organizations for collaborative manufacturing, it is
important to analyze the market for existing products and services. There are two main reasons for this need. First, the organization is interested to see what products or services it can use to build its own more complicated product or service. Second, it is interested in knowing the alternative products or services in the market that compete with its own product or service. Essentially, the organization requires information on what products or services are offered by other organization, potentially competing or completing. The focus of this research lies on the specific types of products that are needed as components for specification of complex products, such as the solar plants and intelligent buildings. These types of products usually require a large variety of competencies and resources and are most often hardly available in a single enterprise. This calls for collaboration among several companies and individuals. The GloNet EU-‐ funded project focuses on this problem and the scope of this research will lie within this project. Therefore, this research focuses on developing an empirical approach to identify product specifications and their categorizations.
GloNet
The scope of this research will lie within the GloNet EU-‐funded project. The aim of the project is to design, develop and deploy an agile environment for networks that are involved in designing and constructing complex products, products that are highly customized and service-‐enhanced, through end-‐to-‐end collaboration with local suppliers and customers. GloNet tends to implement “the glocal enterprise notion with value creation from global networked operations and involving global supply chain
management, product-‐service linkage, and management of distributed manufacturing units” (GloNet, n.d.).
Mass customization, a growing trend in manufacturing, is the term for highly customized products, usually one-‐of-‐a-‐kind. The term refers to “a customer co-‐design process of products and services which meet the needs/choices of each individual customer with regard to the variety of different product features” (GloNet, n.d.). There are important challenges that can be obtained from the requirements of complex technical
infrastructures within this manufacturing domain, such as alternative energy, security infrastructures and illumination systems in large public buildings or urban equipment. But, they can also be found in more traditional complex products, for example
customized kitchens. Within this scope this research intends to study the process of identifying existing specification for sub-‐products and enhancing-‐services suitable to compose a complex product.
Research question
The process of identifying product and service specifications consists of three main steps. First identifying relevant documents that are specifications from an organizations website, second identify how these products and services are categorized by the
an understanding of what type of product or service the existing specifications belongs to.
Manual identification and classification of product specification from other
organizations that can or may be involved in the design and construction of a complex product through their website is impractical, this is due to the fact that not only complex products such as solar power plants consist of multitudinous sub-‐product but also these sub-‐product and their enhancing services are provided by hundreds of different
organizations.
Also considering that websites of organizations and specifications in their websites change dynamically, one might need to update the information about an organization whenever the website is updated. For example whenever a new page gets added or an existing one is being edited, the information of a website/page can change and therefore the information that is captured about the specifications of an organization might
become invalid. So in order to always have valid information, one might need to update their information about an organization multiple times per day. The process of this information gathering can become very time consuming if done manually.
Coming back to the need for retrieval of categories related to sub-‐products and services, the wide heterogeneity in the defined categories within different organizations that is reflected also in their websites sites adds to the complication. So, for example, the latest Samsung smartphone could be categorized under ‘Telephones’ on one website, but on another it can be under ‘Phones’. Although here, the text is different but the semantics behind the two texts are the same. Obviously we as humans, most of the time, interpret the two phrases as similar phrases but for a computer system this would be difficult. In order to tackle this problem we will use an empirical approach. We have defined the following research question:
How can we develop an empirical approach to identify and categorize complex product/service specifications from websites of organizations?
To be able to answer the research question, we have divided the main research question into multiple sub questions as follows:
1. How can we identify the files with the data about product/service specifications? 2. What data is needed from a file and what not for identification and classifying
specifications?
3. How are specifications related in a hierarchy and how could two hierarchies be compared?
a. How are product specifications structured in a website? b. How can the derived structure for product specifications of an
organization be compared to NACE/PRODCOM?
Approach
Nowadays, companies use their websites to promote and provide information about their products and services. Part of the information that they provide is product specifications. Although previous research has been preformed on extracting product information (Dave et al, 2003; Matsuda & Fukushima, 1999; Davulcu et al, 2003; Lam & Gong, 2003), our focus in this research is more specific: we are after identifying
specification sheets/pages related to sub-‐products of complex products and their enhancing services. As so in this research we have preformed an empirical study to come up with approach in finding this specific type of specifications. Specifically we are looking for specifications that are related to such intelligent buildings and solar plants.
Organizational websites not only consist of product/service specifications but also include a variety of types of information such as promotional, contact, support etc. As such the first step for acquiring product/service specifications sheets/pages from an organizations website, we should be able to distinguish the relevant files from other files in the website. To do so we have developed a two-‐step approach to distinguish the data we want from the other data.
First, the types of files that we lookup are narrowed down. Since we are looking for product specifications, this data is often presented on webpages or attached files, especially in Portable Document Format (PDF) files. Although there might be product specifications in other formats but for the sake of this research and to simplify the situation we have assumed that product specifications are limited to the following types of files: webpages with the extension .htm, .html, and .pdf files.
In order to get hold of available files in the mentioned formats, a software also known as a crawler walks trough a website starting with the websites homepage and save the links it come across on a webpage. For all these links, the software follows the links to other pages and checks the links on those pages as well. This process provides an overview of a websites structure, which can be used for, for example, analysing, monitoring or extracting data. In most modern crawlers one can limit the scope of the crawling to specific extensions etc. For the sake of this research we have limited crawling to .htm, .html and .pdf as we have explained before.
The second step is to determine what files contain relevant information (i.e. contain product/service specifications). To do this, the content of the file is scanned and checked for specific keywords (e.g. units) and structural elements (e.g. HTML tags). If a file
satisfies specific constrains then the file would be a candidate for product/service specifications. After complying with these constraints, the pages are further scanned for specific regular expressions, which look for certain textual structures and/or patterns. Tsay et al. (2009) did research on how to integrate product specifications of different websites. To find product specification pages, they used specification patterns as proposed in Tsay et al. (2007) and found it was effective. For example, the values of features in a product specification are often presented by a number of digits behind a unit (e.g. cm, mm, etc.). In the upcoming sections we will describe the constraints and regular expressions that we used in more details.
After identifying product/service specification pages/files, we would like to be able to classify them. To do so we would need to capture specific information from the page. Since we want to know the categorization of products and services, gaining information about the locations of the files in the sites hierarchy could come handy. Therefore we have looked into searching for three types of elements: the Uniform Resource Locator (URL) of the file, Uniform Resource Identifier (URI) existing in the file and the so-‐called breadcrumbs. URI’s, such as ISSN numbers, are pre-‐defined structures which are
commonly known and therefore finding the URI for the specifications provides us with a pre-‐defined structure to classify the specifications. In an initial test preformed by us we were not able to acquire any URIs from the website of the industrial partners involved in the GloNet project nor from common EPCs such as Siemens. The breadcrumb structure is a good representation of the websites hierarchy and therefore provides us a good insight. So the next most reliable source would be the breadcrumbs. A breadcrumb is a trail on each page of the website that represents your current location within the websites hierarchy. Figure 1 shows an example of these breadcrumbs on the website of Siemens.
Figure 1: example of the breadcrumb structure on Siemens.com
When a breadcrumb is missing the URL can be an useful alternative. The URL is most likely useful when it is in a search engine friendly (SEF) Format. A SEF URL is a good candidate as an alternative for the product structure because it can represent a websites structure in a meaningful manner similar to the breadcrumb. Figures 2, provides an example of a SEF URL while Figure 3 demonstrates a non-‐SEF URL.
Figure 2: example of Search Engine Friendly URL
Figure 3: example of regular URL (non search engine friendly)
After identifying a candidate structure for product/service specification classification one would be interested in comparing the two organizations products and services together. This could only be achieved if the classification structures are somehow
comparable. We propose comparing the candidate structures against an accepted
standard structure in order to be able to classify the product specifications in a common manner. Then using this classification we can identify similar products of the two
organizations. In this research we have chosen to do compare the derived structures against the European Standards NACE and PRODCOM. This is to tackle and eliminate the problem of having heterogeneous categories.
The standards used for the classification of economic activities are the Nomenclature statistique des activités économiques dans la Communauté européenne (English:
Statistical Classification of Economic Activities in the European Community), commonly known as NACE, and the PRODCOM, a sub list made specifically for the product of Manufactured Goods section of Industry, Trade and Services. Both are European standard classification systems for the industry (European Commission, 2010). The NACE coding system is a 6 digit coding system while PRODCOM codes are 8 digits. The NACE code represents the main activity of a business. Based on these classifications, taxes and laws for example are assigned to certain types of industry.
For this research, the standards will be used to see if a product belongs to a category that is of interest for the designer of network of organizations for collaborative
manufacturing. If a product belongs to a category of interest, for example as a competing product, the user can anticipate on this product. The same accounts for searching for sub-‐products for a complex product. In such cases one can look for a certain category of sub-‐products that are required for designing and constructing a required complex product.
In order to categorize each specification within the NACE or PROCOM structures the words that create the breadcrumb or URL are used to compare the specification against the NACE and PRODCOM category descriptions. This is to know to what category a certain specification belongs (figure 4).
The comparison of these structures against each other are done by extracting the content of the breadcrumbs and placing the content in a sentence. This sentence will then be compared to every description of categories in the NACE/PRODCOM structures and a distance is calculated. This distance is being normalized by dividing it by the total length of the breadcrumbs content and the description of the NACE/PRODCOM code. This normalized distance is used to rank the relevance of a category to a specification and to determine in what categories a product belongs.
Figure 4: example of how the websites structures can be compared to the NACE/PRODCOM structures
Method
For the purpose of this research, the Energy website of Siemens.com has been used as an example case:
http://www.energy.siemens.com
The Siemens website contains a lot of data that can be used: product specifications on webpages, PDF files containing product specifications and a lot of other webpages that are not wanted. This makes it a good example when designing a crawling approach, since the crawler has to crawl through a lot of different types of data and has to pick the desired ones.
Open Source Tools
The open source community offers a wide range of free, open source web crawler solutions, such as: crawler4j (Crawler4j, n.d.), Apache Nutch (Apache Nutch, 2014) and Scrapy (Scrapy, n.d.). Crawler4j and Apache Nutch are both Java based. Scrapy is Python based and was used as tool to retrieve the data wanted from the website.
Scrapy is a pre-‐build web crawler in which a user can simply declare a website, its domain and the items one wants to extract. The following example (code 1) gives a better insight:
Code 1: example of Scrapy setup
As can be seen in Code 1, the first step is to import libraries necessary to execute the script. Then the class SiemensSpider is created in which a spider names thesis is being called. The domain on which the spider has to operate is then declared in
allowed_domains. The start_urls option is used to declare the webpage or –pages on which you want the spider to start. Usually the homepage of a webpage is declared here, so the spider is on top of the hierarchy and works its way down. However, when you
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.spider import Spider
from scrapy.selector import Selector from thesis.items import ThesisItem class SiemensSpider(Spider): name = "thesis" allowed_domains = ["siemens.com"] start_urls = [ "http://www.siemens.com/entry/cc/en/", ]
rules = (Rule(SgmlLinkExtractor(allow=('/')), follow = True), Rule(SgmlLinkExtractor(allow=('/')), callback = 'parse'),)
def parse(self, response): sel = Selector(response) sites = sel.xpath('//ul/li') items = []
for site in sites: item = ThesisItem()
item['title'] = site.xpath('a/text()').extract() item['link'] = site.xpath('a/@href').extract() items.append(item)
want to crawl a specific page and move on from there (e.g. a sitemap), you can specify this at the variable.
The example of code 1 retrieves two elements from the webpages of Siemens: links located in a <ul><li> elements and their titles. If the crawler finds an <a> element within the <ul><li> elements, it saves the requested information in a JSON file.
The first tests on the Siemens website were performed with this script. To check if it worked, the crawler was released and after a minute, the terminal in which the Scrapy file was being executed showed it had already crawled over 800 files. After another 30 seconds, lots of error messages began to appear in the terminal. At first hand, it was believed to be a glitch on the website or a bug in the script, but it turned out that
Siemens had blacklisted the crawler. This made it impossible to crawl the entire website. Therefore we chose to download the website for offline use, so we could avoid the
problem of being blacklisted.
After the files of the Siemens website were ready for offline use, the next problem became clear: since these files were not located at an URL anymore, Scrapy could not reach them anymore. Running a local server on the computer could solve this problem, but it wouldn’t take away another problem: Scrapy has a template approach algorithm for crawling. This means that the script is focused on the elements on the website. As shown in Code 1, first we want to dive in the “<ul><li>” elements and then grab the “<a>” elements. This works for the Siemens website, but when crawling another website with a different structure, the outcome could be empty because of a different website
structure and a complete new crawler has to be made for that particular website.
Since the approach of this research needs to be compatible for much more websites than just the one from Siemens, the decision was made to drop Scrapy and make a simple script ourselves that would crawl trough the website documents and analyze if for the data required to perform further analyses without a template approach.
Python Script
The script made for this research is Python based. A small development set of 5 html files from the Siemens website was made to develop the first steps of the approach: get an overview of the websites files. Since it wasn’t necessary anymore to make a crawler that gets an overview of all the available links, it was possible to simply create a function that loops trough all the files in a local, offline directory and adds it to a list. This was being done for the .htm(l) files. The use of multiple levels of directories was no problem, so the structure derived from the Siemens website could easily be used, as seen in figure 5.
Figure 5: example of the offline, structured files set from www.energy.siemens.com
HTM(L) Files
The HTM(L) files were scanned for two separate elements: the title tag and the content of the webpage.
Title
The <title> tag was chosen because of the possible value it can have for a webpage: it is often a very small abstract for the page and can therefore be useful to scan (Xue, Y et al. 2007). For example, a title is could be presented as ‘Product Specification Product A’. By scanning the <title> tag for 7 words or combinations of words, coverage of common words for a product specification page is given, as can be seen in code 2.
When the script loops trough all the files, the BeautifulSoup (Richardson, 2014) Python library is used to extract the title tags from the document. First, a path to a file from the list with all the documents is picked. Second, with BeautifulSoup the header of the webpage is scanned for a title tag. If the tag is found, it is being returned by the function, so it can be used for other purposes, i.e. being scanned for the words in code 2.
The scanning is being done with the method of regular expressions (Clarke, C. L., & Cormack, G. V. 1997). The text from the title tags is derived by using the .get_text() function provided by BeautifulSoup and converted to an utf-‐8 format. This is being done to avoid problems with different formats of text. The following regular expression is used (code 3):
A variable w is given to this regular expression, which in this case is one of the 7
variables listed as shown in code 2. This is a simple expression, which checks if the given variable w is present with any type of content around it. If the variable is found, a True statement is returned. If not, a None statement is returned.
Content
The content was derived with the .get_text() function of BeautifulSoup, the same way as the title tag. The content derived within the body tags was also formatted to the utf-‐8 format to avoid formatting problems.
This text is scanned using a regular expression. The expression is as following (code 4):
The expression checks for the presence of units. It has two possibilities: a digit followed by a (white)space and a unit or a digit immediately followed by a unit. After the unit, there are three possible endings: a whitespace, an end of string or a non-‐alphanumeric. This last option to avoid it sees digits followed by normal words as a hit.
'\b({0})\b'.format(w) def titleWords():
l = ['specification', 'specifications', 'spec', 'specs', 'technical', 'data', 'technical data']
return l
Code 2: the function that returns the list with the variables the title content will be scanned for
Code 3: the regular expression for scanning title content
'(?P<metric>\d{0}(\s|\Z|\W)|\d\s{0}(\s|\Z|\W))'.format(w)
Again, variable w is given to the regular expression to check if the variable is present in the text. The list of variables is an external file called metrics.txt, with 38 variables. The variables are standard units, such as Hertz (Hz), Meter (m) and Volt (V), and their conjugations (such as mm and GHz). The list was loaded into the script and a for loop was used to check if the units were present in the content. If a hit was found, the function returned a True statement and would continue to check for other units. If a certain unit wasn’t found, the function would load another unit into the for loop to check its presence.
PDF files
To be able to crawl the content of the PDF files, the files had to be converted. Since PDF is a presentation oriented format and not content oriented, the flow of the text is not continuous. A PDF file is simply formatted to look good for the eyes, but the technical aspect could be a disaster. However the PDF format has become an ISO (International Organization for Standardization) standard (Lazarte 2008), there are still third party applications that create and process PDF files (Castiglione et al. 2010). It’s possible that they use their own formatting, which looks the same, but can be structural completely different. Therefore, it’s impossible to analyze a PDF document.
Because of this problem, the PDF files had to be converted to a structure that can be analyzed, like text or html files. At first hand the choice was for HTML files, but this option didn’t provide the expected results, so the focus was laid on the text files.
As a first step in the process, the choice was made to convert the PDF files to HMTL files. With the open source Python program PDFMiner (Shinyama, 2013) the PDF files were converted into HTML. It turned out that the software converts every piece of text into a span containing the text. So every line with text, every table element that is present in the PDF, gets converted into a span with a style that makes, for example, one of the lines needed to mimic a table.
Since the span couldn’t be used for the recognition of specification tables, we decided to drop the conversion to HTML pages and converted the PDF files to text files instead. The software provided this option as well. The sizes of the files were smaller, which is an advantage when handling large amounts of files. For example, a 735 KB PDF file was converted into a 406 KB HTML file. The text file was only 4 KB, which is significant decrease, especially when you’re working with thousands of files and should consider the overhead and storage of the program.
Now that the change was made to text files, the approach was a little bit different. Where as for the HTML files we were searching for table structures, it can’t be done in these text files. This made it impossible for the text files to use the table structure requirement as a constraint. Therefore, we had to think of other characteristics to recognize a
specification page, since the text files only contains text lines. When looking at the lines of specification files, technical data is often not presented in long text lines. There’re often very short lines containing a couple of words. So, as a constraint we used the maximum length of a line. Since we want to know the optimal maximum length, we’ll use a variable amount for a maximum words per line. With this new variable, the PDF
scanning process has now two variables: the minimum number of unique units per file and the maximum amount of words per line. The maximum amount of words per line only accounts to a line where a unit is located..
PDFMinder converted the PDF files into text files. The miner simply scraped the text from a PDF file and printed in that same format into a text file. This way, none of the textual layout went lost.
The converting process had to be done by hand. A bash script was written that looped though a directory and converted the found PDF files into text files. The content of the text file was only the text from the PDF file, with the columns layout of the PDF file. So for example, when there’s a column with multiple lines, that same column structure was visible in the text file, as shown in figure 6 and 7.
Figure 6: example of the PDF layout structure (Siemens, n.d.)
Figure 7: example of the text layout structure
This is good, because this structure can be used to identify possible representations of product specifications. For example, when describing the maximum temperature for a certain product, you often see mark-‐ups like this:
Maximum temperature: 35°C
This way, we can see that the line contains only a small number of words and it contains a colon. The constraints used to develop the restrictions for a specification page are as following:
• Have a minimum amount of unique units per page (Both types of files) • The line of a unit has a maximum amount of words (PDF files)
• There must be a table structure present in page (HTML files) • Colon present in line of unit (or previous line) (Both types of files)
Mixtures of these constraints with AND and OR statements were made and tested on the test setup.
When a file satisfied the conditions for identifying a specification page, it was added to a list. This list was then used for a function that looked for the breadcrumb structure on the page.
The first step of getting the breadcrumb structure was using a template approach. The advantage of a template approach is that it’s very effective: it’s possible to precisely extract the data you want. To do this, we looked up the HTML source code of the
Siemens website and examined the code to see what HTML elements contained the data we wanted. The following piece of code was found (figure 8):
Figure 8: example of the breadcrumb structure of Siemens. Source:
http://www.energy.siemens.com/hq/en/renewable-‐energy/wind-‐power/ As we can see, the breadcrumbs structure is contained within in div with the id
‘breadcrumb-‐zone’. Within this div, a lot of links are present. However, when looking at the breadcrumb trail on the website, only three links can be seen (figure 9).
Figure 9: visual representation of the breadcrumb trail. Source:
http://www.energy.siemens.com/hq/en/renewable-‐energy/wind-‐power/
All the other links are dropdown menu’s that appear when hovering Home, Energy or Renewable Energy. All these links are redundant for our purpose, so can be ignored. Therefore, we can see that the wanted URL’s are presented within a dd element and an a element that has the id ‘link’. But, when looking at figure 8 we can see that the last element which contains the Wind Power crumb doesn’t have a link, but a span element, so this element has to be retrieved independent. All the retrieved contents of the links are placed in a list. The structure of this list can be seen in figure 10:
Figure 10: example of the derived breadcrumb structure
The big disadvantage of this approach is that it only works on the Siemens website, or any website that uses the exact same structure to arrange their breadcrumb trail. For example, when looking at Samsung Netherlands’ website, we can see the following structure and code (figure 11 and 12):
Figure 11: example of the breadcrumb trail of Samsung Netherlands' website. Source: http://www.samsung.com/nl/consumer/mobile-‐phone/mobile-‐phones/tab/SM-‐T330NZWAPHN
Figure 12: example of the source code of the breadcrumb trail of Samsung Netherlands' website. Source:
http://www.samsung.com/nl/consumer/mobile-‐phone/mobile-‐phones/tab/SM-‐T330NZWAPHN Samsung uses another structure for their breadcrumbs and it’s most likely that other websites will do too. Therefore, it’s necessary to use another approach since the template approach will not work on every website.
The initial idea was to look for symbols like ‘>’, ‘//’ and ‘-‐‘. These symbols are often used as separators (Instone 2002), as can be seen in figure 9 and 11. But, when looking at the source codes of these websites, the symbol couldn’t be found as text in the code. It turned out that the symbols were images coded into the webpage by CSS code. This made it impossible to extract the breadcrumb trail with this approach.
But, what can be observed is that both websites use an element that has a class/id with the word ‘breadcrumb’ in it. When looking at other websites such as apple.com (figure 13 & 14) and uva.nl (figure 15 & 16), we can see the same occurrence:
Figure 13: example of the breadcrumb trail on Apple.com. Source:
http://www.apple.com/iphone-‐5s/app-‐store/
Figure 14: example of the breadcrumb structure on Apple.com. Source:
http://www.apple.com/iphone-‐5s/app-‐store/
Figure 15: example of the breadcrumb trail on uva.nl. Source:
http://www.uva.nl/onderwijs/bachelor/bacheloropleidingen/item/informatiekunde.html
Figure 16: example of the breadcrumb structure on uva.nl. Source:
http://www.uva.nl/onderwijs/bachelor/bacheloropleidingen/item/informatiekunde.html Both structures contain an id that has the word ‘breadcrumb’ in it. In the case of Apple it’s the plural version, ‘breadcrumbs’. With this knowledge, we decided a possible approach is to look for elements that have a class and/or id containing the word breadcrumb. A simple regular expression was used (code 5) which made it possible to accept the word in any content:
This piece of code searches a given variable for elements containing elements, which have a class with the word breadcrumb in it. The given variable is in this case everything
outcome = soup.body.find(x, {"class": re.compile('breadcrumb')})
that’s between the body tags of a HTML page. The soup variable contains the entire HTML page, generated with the BeautifulSoup package for Python. The script runs the same regular expression for id instead of class, and x has seven different variables (HTML elements): ‘span', 'div', 'p', 'ul', 'ol', 'li', 'nav'. These are all common elements in HTML files. For the purpose of this research we used these seven, but more elements can be added for further use. However, because it uses a for loop for every one of these elements, with every added element the script will need more time to run on file sets. When running the script on a large data set with thousands of files, it can affect the running time negatively.
For this research, breadcrumbs were available on the webpages. If a webpage doesn't contain breadcrumbs, the URL can be used as well. Remove the base of the websites URL and use the ‘/’ as a separating signal. You will get a structured sentence that can be used for comparison. Here is an example that uses figure 17 as an example:
Figure 17: example of search engine-‐friendly URL of the Siemens website
The structure that can be derived will be: ‘hq en fossil power generation gas turbines’. It may not be as accurate as a breadcrumb structure, but it’s a decent replacement if a breadcrumb is not available.
When a breadcrumb structure was found, the context was extracted with the .get_text() function from BeautifulSoup. The next step in the approach was then to compare this text to the NACE and PRODCOM structure. We approached this by comparing the complete breadcrumbs content to each description of a NACE/PRODCOM codes. The NACE/PRODOM list was downloaded from the Eurostat European Commission website. An excel sheet with all the information was downloaded and the NACE/PRODCOM codes and their descriptions were placed in a text file. This resulted in two text files, named nace.txt and prodcom.txt. The codes and descriptions were separated by a hyphen. This hyphen made it easier to separate the codes from the description for extraction
purposes.
In the script, as soon as it started running, the NACE and PRODCOM text files were read with the .readlines() function from Python. Each line was then placed in the global list that could be used in every function. The reason for this was that multiple functions needed the information of the list and by making a global variable of it, it could be used in multiple functions without creating a lot of overhead. With a for loop it was possible to iterate over the list.
The suggestions made for the NACE/PRODCOM categories are made by calculating the Levenshtein distance and normalizing this distance by dividing it by the sum of all the characters from the breadcrumbs and the description. The Levenshtein distance between two sentences or words is the minimum number of character edits, such as insertions, substitutions and removals, which are required to change a sentence or word into the other. The normalizing is being done to eliminate the disadvantage for
NACE/PRODCOM codes with long descriptions.
The calculations resulted in 996 distances for the NACE list since there are 996 entries in list and 5569 distances for the PRODCOM list, since there are 5569 entries in the PRODCOM list. Out of those 996 and 5569 normalized distances, the smallest ones were
chosen and printed out. This way a user can see both suggestions (NACE & PROCOM), including the percentage of differences (normalized value x 100%).
Hadoop
Although we used this method, there is another method possible: Hadoop, especially the Hadoop MapReduce framework. The biggest asset of this method is that it can handle and process vast amounts of data (sets of multiple terabytes worth of data) in-‐parallel on large clusters. This is useful when analyzing a lot of websites for product
specifications. The algorithm divides the tasks over multiple machines that can run their tasks independent from each other. So for example, each machine can crawl a website on it’s own, return the results and when it’s finished, start with a new website, whilst another machine is still crawling its current website.
Hadoop uses two functions: a map function and a reduce function. The basic idea is that the map function sorts and filters data, after which it returns a list of keys and values. The reduce function performs a summary operation and reduces this list to a minimum, such as a key and a value. This way, you can collect a large amount of data without making the outcome very complex.
The written code for this research uses the same method: the first step is to crawl a website, analyze their pages and decide if it’s a specification page. If so, the path to the file gets added to a list. This is the key. The value that’s being given is the sentence with the derived NACE or PRODCOM code of the breadcrumb section. Together they form a key and a value, and that is the mapping part.
For the reducing part, each value of a key is being scanned, and the smallest value is being returned. The smallest value means the smallest normalized distance, and the smallest normalized distance is the NACE and PRODCOM code we want to know.
By analyzing data to get a complete view of all the other products, you need to scan a lot of data. But as we pointed out, for this research we needed to download the website before we could scan it. If we wanted to download the entire web, it would take us an enormous amount of time, data storage and money. But there are organizations that have done this for us. One of the companies that does is, is Common Crawl. Common Crawl had created a dataset containing more than 6 billion pages (over 100 terabytes) and is updated regularly. The dataset can be accessed and analyzed by everyone (Common Crawl, n.d.). They offer different types of files:
• WARC files: store the raw crawl data
• WAT files: store computed metadata for the data stored in the WARC • WET files: store extracted plaintext from the data stored in the WARC
We are especially interested in the WARC files. These are the files with their original page structures, with the possibility of including breadcrumbs structures.
The MapReduce method can be applied on this dataset, which makes it possible to analyze its content in a very short period of time.
Test setup
To download the Siemens’ website file, DeepVacuum (DeepVacuum, n.d.) for Mac OS X was used. This program made it was possible to download a website for offline use. We used this tool to slowly download a part of the Energy division of the Siemens website, located on http://www.energy.siemens.com. We set the program to download files at a rate of 50 kb/sec, and made requests with a 5 second interval. Only files with the
extension .htm, .html and .pdf where downloaded. This resulted in a multi-‐day download session and a 10.45GB data set. This is not the complete website, since the download limit was set at 10GB. The reason for this large size is because of the different languages. Each language is located in a different directory, which all contain the same file, except it’s in their native language.
This dataset contained 13.142 files. There were 7.188 HTM(L) files and 5.954 PDF files. Of those files, 110 files were manually checked and placed in a test set. The test case contained 65 HTML files and 45 PDF files. Of those 65 HTML files, 15 were product specifications. Of the 45 PDF files, 15 files were product specifications. In total 30 files of 110 files were product specifications.
The PDF files were converted into text files. To distinguish the files, the text files and HTM(L) files were separated and placed in different directories. The text files had numeric file names. The specification files were placed in a map ‘yes’, the non-‐
specification files were placed in a map ‘no’. This was to make it easier to distinguish possible true/false positives/negatives. The file names of the HTM(L) specification pages all contained the letters ‘sgt’ and the non-‐specification pages didn’t. This
information was used to distinguish the possible true/false positives/negatives for the HTML files.
When the files were scanned and marked as a specification file, the path to the file was added to a list. With a regular expression, this path was checked to see if it contained the word ‘yes’ or ‘no’ for the PDF files and ‘sgt’ for the HTM(L) files. To give an example: of the 65 HTM(L) files, 15 were specification pages. If an execution of the script gave back 20 results of which 12 were found with ‘sgt’ in the path, we could say there were 12 true positives, 5 false positives and 3 false negatives. This way we could easily see how certain changes to the script performed. These results (the true/false
positives/negatives) were used to calculate the F-‐measure.
To make sure the regular expressions for finding specification patters didn’t miss any specifications or recognized non-‐specifications as specifications, a test file was made with multiple variations of specifications. The regular expressions were run on this test file and each discovery was printed to the screen. This way, the robustness was tested and validated.
Results
The results of the executions of the script are presented in this chapter. A series of different constraints were tested for HTM(L) and text files. First, these results will be presented, after which the results for the NACE/PRODCOM classification will be presented.
HTM(L) Files
The first execution of the script was with no constraints: if there was one unit found in the file, it was marked as a specification page. This gave the following results (table 1): Minimum # Units Hits Total positives True Positive (out of 15) False positives False negative True Negative 1 65 28 15 13 0 37
Table 1: results for the HTM(L) pages with a minimum of 1 unit per page
As we can see, all 15 specification pages have been marked correct, but also 13 non-‐ specification pages have been marked as specification pages. This is too much, and the constraints presented in the method will be used for the recognition of a specification page. The first constraint is to set a minimum of unique units that must be present on a page to be marked as a specification page. These results are presented in graph 1:
Graph 1: minimum numbers of unique hits per page
0 5 10 15 20 25 30 1 2 3 4 5 6 # h it s
Minimum # unique units
Minimum # units per page
Total positives
True Positive (out of 15) False positives