Sphider-plus    version 4.2024d  -  The PHP Search Engine




All required information.

[ Documentation Summary ]

Preamble: The info presented here is valid only for the latest release of Sphider-plus.

At present version 4.2024d published October 04, 2024 is the actual release.




[ Documentation ]

1. Settings, customizing and statistics

If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like:

Sites:

- Add Site

- Index only the new

- Re-index all

- Re-index only preferred URLs

- Erase & Re-index (available also for individual URLs)

- Import/export URL list

- Approve sites

- Banned domains


Categories:

- Add, edit, delete

- Create new subcategory under


Index:

- Basic indexing options

- Advanced options


Clean:

- Clean keywords not associated with any link

- Clean links not associated with any site

- Clean Category table not associated with any site

- Clean Media links

- Clear Temp table

- Clear Search log

- Clear 'Most Popular Page Links' log

- Clear 'Most Popular Media Links' log

- Clear Spider log, separate and bulk delete

- Clear Thumbnail images, separate and bulk delete

- Clear Text cache

- Clear Media cache

- Clear IDS log file

- Clear flood attempts log file

- Clear all entries in addurl or banned table

- Truncate all tables in database


Settings:

- General Settings

- Index Log Settings

- Spider Settings

- Search Settings

- Order of Result listing

- Suggest Options

- Page Indexing Weights


Database:

- Configure up to 5 databases with unlimited number of table sets

- Activate separately for 'Admin', 'Search' user and 'Suggest URL user'

- Backup / Restore

- Copy / Move

- Optimize

  

Templates:

In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for

Search form

Text result listing

Media result listing

Most popular queries

etc.

Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file

.../templates/My_template/adminstyle.css

.../templates/My_template/userstyle.css

  

Statistics output:

- Top keywords (Top 50 with hit counter).

- All indexed thumbnails with ID3 and EXIF info.

- Larges pages offering link URL and file size.

- Most Popular Searches for text links offering:

Link addr., total clicks, last clicked, last query (Top 50)

- Most Popular Searches for media links offering:

Link addr., total clicks, last clicked, last query (Top 50)

- Most Popular Links (click counter).

- Search log offering:

Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100)

- Index log offering:

File-name, index date and delete option

- sitemap log offering:

sitemap.xml output

sitemap list offering file/page suffixes

- IDS log offering:

IP, host, query, impact, involved tags, date and time of intrusion.

- Flood attempts log offering:

IP, query, date and time of flood attempt.

- Auto Re-index log file

- Server info offering:

Server software, environment, MySQL, PDF-converter, image functions, php.ini file

PHP integration, PHP security info. Each item holding lists of details.

  

All text links, media links and thumbnails are active linked.


As stated in chapter Introduction, this search engine uses some PHP libraries and extensions. When opening the Settings interface, the existence of these libraries are tested by software, and in case that a library is not part of the server environment, the according option is not presented in the Settings interface. For example, if the 'rar' extension is not available, it will not be possible to index RAR archives and the belonging checkbox will not be presented in 'Spider Settings'. In order to check the availability of all required libraries and extensions, the Debug mode will present the corresponding messages.



Top

2. Indexing

2.1 Various options

As part of the Admin Site settings you may select the following options:


Full:

Indexing continues until there are no further (permitted) links to follow.


To depth:

Indexes to a given depth, where depth means how many "clicks" away the page can be from the starting page. Depth 0 means that only the starting page is indexed, depth 1 indexes the starting page and all the pages linked from it etc.


Re-index:

By selecting this mod, indexing is forced even if the page already has been indexed. Re-index only detects changes of the pages to be re-indexed. Modifications in Admin settings are not recognized.


Erase & Re-index:

By selecting this mod, indexing is forced even if the page already has been indexed.

Additionally this mod will

Clear Sphider-plus database before the re-index process. It will leave the following untouched:

- Categories

- Query log

- Sites and all options: spider-depth, last indexed, can leave domain, title, description,

- URL Must include, URL must Not include.

If settings have been modified in Admin section this mod should be selected to update the database.


Spider can leave domain:

By default, Sphider never leaves a given domain, so that links from domain.com pointing to domain2.com are not followed. By checking this option Sphider can leave the domain, however in this case its highly advisable to define proper must include / must not include string lists to prevent the spider from going too far. This option must be activated, if a htaccess. file is used for redirect directives.


Must include / must not include:

Explained in chapter: 4.2 Must include / must not include string list.

 

Multithreaded indexing:

Explained in chapter: 2.6 Multithreaded indexing.

 

Follow and create sitemap files. See below for details.

 

Word stemming. See below for details.



2.2 Allow other hosts in same domain

This Admin selectable option allows to index other hosts with the same domain name and it also ignores TLD, SLD and www.

If e.g. calling from http://www.sphider-plus.eu links like:

- http://sphider-plus.eu                 (without www.)

- http://www.info/sphider-plus.eu    (additional subdomain)

- http://www.sphider-plus.com        (different TDL)

- http://www.sphider-plus.tec.eu     (additional SLD)

will be followed if this option is activated in Admin settings.

There are 2 different options available in Admin setting to cover this feature. The first one is following all links found during index procedure. The second one is only following the links to other hosts, if the found links are redirected.



2.3 Word stemming

Sphider-plus is offering language specific stemming algorithms for 15 languages:

Bulgarian, Chinese, Czech, Dutch, English, Finnish, French, German,

Greek, Hungarian, Italian, Portuguese, Russian, Spanish and Swedish

To be activated individually for the language that needs to be indexed. Automatically the according common word list (holding the stop words not to be stored in database) will be activated together with the stemming language. For Chinese, Greek and Russian, additionally the corresponding language support is automatically activated. These additional features remain activated, even if word stemming later on is reset to 'none' and need to be deselected manually.

On the other hand, if activated for indexing, the stemming selection must remain activated, because also the query input must be stemmed. As during the index procedure only the etymons are stored in database, this will create results independent whether the query 'walk', 'walks', walked' or 'walking' is entered (for English stemming).


2.4 Periodical Re-indexing

This mode offers automatically Re-indexing of all sites, or site specific, started periodically at a defined time interval.

In Admin backend the time interval is selectable to

3 hours, 12 hours, 1 day, 1 week or 1 month.

Also the count of periodically performed re-indexing procedures could be defined in Admin backend.

Once started, the Re-index procedures will silently work in the background without creating monitor output, but like in all other indexing modes, writing the index results intolog files. Additionally a log file showing the status of the periodical indexer is created, presenting all dates and times when the Re-index procedures were started, as well as the count. This additional log file is available for the Admin in 'Statistics' menu called 'Auto Re-index log file'.

The periodical Re-indexer could be started and aborted in Admin backend, by selecting the 'Periodical Re-index' submenu in 'Sites' view.

Instead for site individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site.


2.5 Preferred Re-indexing

Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while level 5 should be used for non-prioritized sites. In case that new URLs are not manually supplied with any priority, the level will automatically set to '1'.

While invoking the option 'Re-index only preferred sites', the admin may select a suitable level for the next index procedure. Thus, only those URLs, containing the according level, will be re-indexed.


2.6 Multithreaded indexing

The Admin setting:

Define number of threads allowed for index procedures (max. 10)

activates parallel indexing. For multiple site indexing, this option will speed up the procedure significant. If this option is activated, browser output of logging data, as well as real-time output in a second browser window (tab), is suppressed. Never the less all index results will be stored in log files in sub folder .../admin/log/

The names of the log files look like:

db2_100524-21.47.56_1.html (log file of first thread)

db2_100524-21.48.12_2.html (log file of second thread)

and is build by the following items:

db2 - Number of database.

100524 - Date (May 24, 2010)

21.47.56 - Time when this thread was started (hours.minutes.second).

1 - ID-number, which will be incremented by each thread.

If multithreaded indexing is not activated, the ID will be set to '0'.

 

The individual threads will be activated by means of the Admin dialog. Multi threaded indexation is foreseen for:

- Re-index all

- Erase & Re-index all

So, if 'Erase & Re-index' is selected, after the 'Erasing' dialog, the threads could be started in sequencing order. It is not necessary to invoke all predefined threads. The dialog (browser) window will always present the last started thread. It is strongly recommended not to close this window or to use the 'Return' button of the browser. If the thread finished indexing, a 'Ready' message will be shown and the 'Back to admin' button is presented. Never the less, a former started thread still might be busy to index another site. To be seen in Admin 'Sites' view by the 'Unfinished' message at the corresponding site. Refreshing the 'Sites' window offers the successful end for all threads; by replacing the 'Unfinished' message with the date of last index.

During multithreaded indexing browser options like caching, pre-fetching and turbo modus should be disabled.

Multithreaded indexing for command line operation is presented below.


2.7 Create thumbnails during index procedure

This Admin selectable option is supplied by a web service with an individual key, which needs to be signed up by the Sphider-plus admin. The web shots will be presented as thumbnails with the size of 130 X 174 pixels as part of the text result listing, individual for each result. It should be taken into account that this option will slow down the index procedure significantly. As delivered by a web service, this option is not available for Intranet and all 'localhost' applications.

In 'Sites' view of the Sphider-plus admin backend you may check ability of this web service. Also required time (seconds) to create one thumbnail is presented. So you may decide, whether you like to use this web service for all pages to be indexed.

More details are described in the readme.pdf documentation.

 

2.8 Prevent indexing of known malware and pishing pages

This feature is a service, provided by Google that enables applications to check Internet URLs against Google's constantly updated lists of suspected phishing and malware pages. Sphider-plus is using version 1 of this service by verifying each URL to be indexed on line. Thus the admin does not need to update any database and will receive the results of the verification at the most actual level from the Google database. As delivered by a web service, this option is not available for Intranet and all 'localhost' applications.

To be activated in Admin backend, this service additionally requires an individual key. To be signed up at Google.

More details are described in the readme.pdf documentation.

 

2.9 Follow Sitemap file

To be activated in Admin settings, Sphider-plus will use the links found in sitemap.xml or sitemap.xml.gz files. This significantly increases the speed for index and re-index, because the links will not have to be searched in text part of each page.

This option will also force Sphider-plus to re-index only links that are:

- New and not jet known in Sphider's link table

- Links with a 'last modified' date, which is newer than Sphider's 'last indexed' date in database.

Sitemaps are always expected in the root folder of the site to be indexed and must be named

sitemap.xml or sitemap.xml.gz

If 'Follow sitemap.xml' is activated and a valid sitemap was found, the log output

Links found: 0 - New links: 0

is no longer shown. Because all links are delivered from the sitemap file and new links are not searched during index / re-index.

 

If is detected in a sitemap.xml file, and if multiple Sitemap files are available, Sphider-plus will process the secondary Sitemaps and extract all links for index / re-index. Also gzip-compressed files (Index Sitemap files as well as the Sitemap files) will be processed, independent of their file suffix.

A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, the Sitemap index file must be UTF-8 encoded.

For individual Sitemaps with different names and/or Sitemaps that are stored in sub folders, Sphider-plus offers the option of defining their URL and name in 'Add site' menu, as well as in 'Edit site' menu. Never the less, links in these individual Sitemaps need to follow the rules as defined at http://www.sitemaps.org/ and are always treated as absolute links and must be from a single host. RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed Sitemaps are currently not supported by Sphider-plus

Also extensions of the Sitemaps protocol like creating an own namespace <! -- namespace extension -- > is not supported by Sphider-plus.


2.10 Use private sitemap instead of global sitemap.

Foreseen to hold a subset of the global sitemap, this optional selection in ‘Settings’ menu of the admin backend may be used to index only often modified pages of a site. Useful to fasten the index procedure by concentrating only on new page content.

The private Sphider-plus sitemaps are always expected in the root folder of the site to be indexed and must be named

sp-sitemap.xml  or  sp-sitemap.xml.gz

Over all, links in these individual sitemap need to follow the rules as defined at http://www.sitemaps.org/ and are always treated as absolute links and must be from a single host. RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed sitemaps are currently not supported by Sphider-plus. Also extensions of the sitemaps protocol like creating an own name-space are not supported by Sphider-plus.
Top

2.11 Create Sitemap file

Create a Sitemap during index/re-index could be activated in Admin settings. This option offers the following features:

- Compatible with http://www.sitemaps.org/schemas/sitemap/0.9 this module automatically creates a sitemap.xml file.

- In Admin settings the folder name for the Sitemaps can be defined.

- The xml files will be individually named like 'sitemap_www.abc.de.xml'

- When running a 'Re-index', 'Re-index all' or 'Erase & Re-index' existing Sitemaps will be overwritten with the actual data set.

- Additional option: Use a unique name (sitemap.xml) for all created sitemap files.

  Could be selected, if only one single Site is to be indexed.

  To be used in conjunction with selecting the destination folder for the sitemap files.

Additionally a list file could be created, sorted alphabetically also offering all file/page suffixes.


Top

3. Using the indexer from command line

(Matter of re-definition and re-coding, currently without guarantee for proper function)

For more information, please visit the Sphider-plus forum and have a look in section 'Tips & Tricks & Mods'

 

3.1 All options

It is possible to spider web pages from the command line, using the syntax:

php spider.php <options>

where <options> are:

-all Reindex everything in the database.

-eall Erase database and afterward re-index all.

-new Index all new URLs in database which had not jet been indexed.

-erase Erase the content of the database.

-erased Index all meanwhile erased sites.

-preferred <level> Index with respect to preference level.

-preall Set 'Last indexed' date and time to 0000.

-u <url> Set the URL to index.

-f Set indexing depth to full (unlimited depth).

-d <num> Set indexing depth to <num>.

-l Allow spider to leave the initial domain.

-r Set spider to reindex a site.

-m <string> Set the string(s) that an URL must include (use \\n as a delimiter between multiple strings).

-n <string> Set the string(s) that an URL must not include (use \\n as a delimiter between multiple strings).

For example, for spidering and indexing http://www.domain.com/test.html to depth 2, use:

php spider.php -u http://www.domain.com/test.html -d 2

If you want to reindex the same URL, use:

php spider.php -u http://www.domain.com/test.html -r

 

3.2 Multithreaded indexing

For command line operation parallel indexing has no restrictions for the count of threads. Just limited by the server resources. Parallel indexing is enabled for several different methods as described below.

3.2.1 Index only the new

Index all new URLs in database which had not jet been indexed <-new>

Simply start several threads and add individual IDs to the option parameter like

php spider.php -new1

php spider.php -new2

etc.

The IDs will be added to the name of the corresponding log files like:

db2_100524-21.47.56_ID1.html (log file of first thread)

db2_100524-21.48.12_ID2.html (log file of second thread)

IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken into consideration. There is no auto-increment of IDs like for multi indexing that is initialized by the Admin dialog.

If IDs are not added, it is obligatory to delay the start of each thread for 1 second. The names of the log files are created with a resolution of one second. If started too early, several threads will write into one log file. All unsynchronized, the resulting log file will be unreadable.

3.2.2 Re-index all

To be invoked by once preparing the database with the command

php spider.php <-preall>

This will reset all 'Last indexed' tables to '0000', but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure.

Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like:

php spider.php -erased1

php spider.php -erased2

etc.

The IDs will be added to the names of the log files as described above

3.2.3 Index erased sites

Index all meanwhile erased sites <-erased> will index only those sites that had been individually or bulk erased. Multithreaded indexing could be invoked by starting several threads and adding individual IDs to the option parameter like:

php spider.php -erased1

php spider.php -erased2

etc.

The IDs will be added to the names of the log files as described above


Top

4. Keeping pages, words and files from being indexed

4.1 robots.txt

The most common way to prevent pages from being indexed is using the robots.txt standard, by either putting a robots.txt file into the root directory of the server, or adding the necessary Meta tags into the page headers.

This directive could be temporary overwritten site specific for the next index procedure by the advanced option:

Temporary ignore 'robots.txt'

 

4.2 Must include / must not include string list

A powerful option Sphider-plus supports is defining a 'Must include / Must not include' string list for a site (to be found in Sites / Options / Edit). Any URL containing a string in the 'URL must Not include' list is ignored. Any URL that does not contain any string in the 'URL Must include' list is likewise ignored.

In any case all strings are taken case sensitive.

All strings in the string list should be separated by a new line (Enter). For example, to prevent a forum in your site from being indexed, you might add /forum to the 'URL must Not include' list. This means that all URLs containing the string /forum will be ignored and won't be indexed.

Using Perl style regular expressions instead of literal strings is also supported. But only a string starting with a '*' in front is considered to be a regular expression, so that

*/[a]+/

denotes a string with one or more letter a in it.

In case that all sites should contain a 'URL Must include' or 'URL must Not include' rule, the strings could be placed into the files:

.../include/common/must_include.txt

.../include/common/must_not_include.txt

As per default Sphider-plus download, the two files 'must include.txt' and 'must_not_include.txt' are empty.

While calling 'Settings' in admin backend and enabling in section 'General Settings' the option:

Store global values of string lists 'Must include' and 'Must Not include' for all URLs

And afterward pressing any 'Save' button, the content of these files are transferred into the corresponding option fields of all site.

Used only one time, so that later on individual values could be added to each URL in 'Sites' view. See 'Options' => 'Edit' for each URL individually.

 

4.3 Ignoring links

Sphider-plus respect rel="nofollow" attribute in <a href..> tags, so for example the link foo.html in

<a href="foo.html" rel="nofollow">

is ignored. Also if the nofollow flag is set in the header of a site, this link will not been followed.

This directive could be temporary overwritten site specific for the next index procedure by the advanced option:

Temporary ignore 'nofollow' directive

 

4.4 Ignoring parts of a page

Sphider-plus includes an option to exclude parts of pages from being indexed. This can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu).

Any part of a page between

<!--sphider_noindex--> and <!--/sphider_noindex-->

tags is not indexed, however links in it are followed.

 

4.5 Ignoring parts of a page by <ug id='abc'>

Ignoring parts of a page by the<!--sphider_noindex--> tags requires direct access to the page, because the tags need to be added (edited) to the page.

A more flexible method, which does not require direct access, is enabled by the Admin setting:

'Use list of ul classes to ignore the complete ul content during index/re-index'

If enabled in Admin settings, the values as defined in the list-file …/include/common/uls_not.txt

will be used to delete the content between <ul class='abc'< and </ul>

The same for<ul id='abc'> and </ul>

Also, the global attribute ‘inert’ will be used to ignore the div content.

Values in this common list may end with a wildcard, so that 'menu*' will work for classes like

menu1, menu2, menu_left, etc.

Multiple ul tags will be attended.

For even more flexibility, the file …/include/common/uls_not.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.

Example: */menu[0-5]/

As of Sphider-plus v.4.2024d also class values together with id values will be obeyed. For example

<ul class="submenu2" id="dropdown-africa"> text to be deleted </ul>

For this example the list-file …/include/common/uls_not.txt needs to contain in one row separated by a blank character:

submenu2 dropdown-africa

 

4.6 Ignoring parts of a page by <div id='abc'>

Ignoring parts of a page by the <!--sphider_noindex--> tags requires direct access to the page, because the tags need to be added (edited) to the page.

A more flexible method, which does not require direct access, is enabled by the Admin setting:

'Use list of div ids or classes to ignore the complete div content during index/re-index'

If enabled in Admin settings, the values as defined in the list-file .../include/common/divs_not.txt will be used to delete the content between

<div id='abc'> and </div>

Multiple and nested divs will be attended. Alternately, this option could also be used for <div class='abc'>.

Also, the global attribute 'inert' will be used to ignore the div content.

 

As of Sphider-plus v.4.2024d also multiple class or id values will be obeyed. For example

<div class="composer rachmaninov"> text to be ignored </div>

For this example the list-file …/include/common/divs_not.txt needs to contain in one row separated by a blank character:

composer rachmaninov

 

4.7 Indexing only parts of a page by <div id='abc'>

If enabled in Admin settings, the values as defined in the list-file .../include/common/divs_use.txt will be used to index only the content between

<div id='abc'> and </div>

Never the less links outside of the div tags will be followed. Values in this common list may end with a wildcard,

so that 'menu*' will work for ids like

menu1, menu2, menu_left, etc.

Multiple and nested divs will be attended.

For even more flexability, the file …/include/common/divs_use.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.

Example:    */table[0-5]/

 

4.8 Ignore HTML elements defined by <tagname> . . . . </tagname>

This option is foreseen to cooperate with the new HTML5 elements like

section, nav, aside, hgroup, article, header, footer, etc.

HTML elements are written with a start tag, with an end tag, with the content in between:

<tagname> this content </tagname>

For more details, please notice the HTML element and tag references at

http://www.w3schools.com/html/html_elements.asp

http://www.w3schools.com/tags/

If enabled in Admin settings, the values as defined in the list-file …/include/common/elements_not.txt will be used to delete the content between <tagname> and </tagname> .

Never the less, links inside of the tags will be followed. Values in this common list are automatically added with a wildcard, so that 'aside' will work for HTML elements like

aside1, aside2, aside_left, etc.

Also for elements like

<nav class="menu">

<ul>

<li> <a href="#"> Start </a></li>

<li><a href="#"> About us </a></li>

<li> <a href="#"> Contact </a></li>

</ul>

</nav>

only the name of the element (which is nav) needs to be added into the list-file.

For even more flexibility, the file …/include/common/elements_not.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.

Example: */nav[0-5]/

Please keep in mind that element names placed in …/include/common/elements_not.txt will be processed case-sensitive.

 

4.9 Index only HTML elements defined by <tagname> . . . . </tagname>

This is the vice versa function of what is described in the above chapter.

If enabled in Admin settings, the values as defined in the list-file …/include/common/elements_use.txt will be used to index only the content between <tagname> and </tagname> .

Please keep in mind that element names placed in …/include/common/elements_use.txt will be processed case-sensitive.

 

4.10 Ignored words

Beginning with version 1.7, Sphider-plus offers the capability to prepare language specific common files. Common words that are not to be indexed can be placed into individual files. The names of this files must start with 'common_' and end with the suffix '.txt', like "common_eng.txt ". The files must be placed into the folder:

.../include/common/

The common word files should not be used, if 'phrase search' is the standard type of search. Sphider-plus will become problems to find complete phrases. Therefore, in Admin / Settings/ Spider settings, the use of common word files may be activated / deactivated by the checkbox:

Use commonlist for words to be ignored during index / re-index?

Take notice, that the 'Ignored words' function for many languages is not case sensitive. So, you only need to include one spelling into the common_xyz.txt file.

Instead the common word list is case sensitive for the following languages:

- Arabic

- Chinese

- Cyrillic

 

4.11 Use of Whitelist

Sphider-plus offers the capability to control the index / re-index procedure by a list of words called 'whitelist'. Only if the text of the page contains words of the whitelist, the according page will be indexed / re-indexed. The list is placed in the file .../include/common/whitelist.txt
Text-content is defined by Admin settings by means of what to index: full text, title, keywords etc.
Content of links(URLs) is controlled separately by "Must include / must not include string list"

The use of the whitelist may be activated / deactivated by two different checkboxes in Admin / Settings/ Spider settings:

- Use whitelist in order to index / re-index only those pages
   that include ANY of the words in whitelist

- Use whitelist in order to index / re-index only those pages
   that include ALL the words in whitelist

Take notice, that these functions are not case sensitive. So, you only need to include one spelling into the whitelist.txt file.

Content of whitelist is treated as 'words'. So the word 'kinder' in your whitelist will not accept pages that contain the word 'kindergarten'.

Be aware not to place blank rows into the whitelist. Also the list should end with the last word; not with a line feed or a blank row.

- Each word in list must be in a separate row.

- One word per row.

- No blank rows.

- No blank row at the end of the file.

 

4.12 Use of Blacklist

Sphider-plus offers the capability to control the index / re-index procedure by a list of words called 'blacklist'. If the content of the page contains one word of the blacklist, it will not be indexed / re-indexed. The list is placed in the file .../include/common/blacklist.txt

In Admin / Settings/ Spider settings, the use of the blacklist may be activated / deactivated by the checkbox:

Use blacklist to prevent index / re-index of pages that contain any of the words in blacklist?


A second setting in the same settings section enables the rejection of queries that contain a word of the blacklist. Even if the evil word is only part of the query.

If the checkbox:

Use blacklist to delete queries that contain any of the words in blacklist?

is activated, the complete query is deleted and a blank search is performed.


Please keep in mind that 'Use of Blacklist is implemented in a different way than implementation of 'Use of whitelist'. Blacklist is interpreting its content as a string. So, the word 'kinder' in blacklist, will also prevent indexing of a page containing the word 'kindergarten'.

Be aware not to place blank rows into the blacklist. Also the list should end with the last word; not with a line feed or a blank row.

- Each word in list must be in a separate row.

- One word per row.

- No blank rows.

- No blank row at the end of the file.

 

4.13 Ignored files

The list of file types that are not checked for indexing are places in .../include/common/ext.txt. This file holds all file suffixes for those type of files that are to be ignored during index / re-index procedure.

The 'ext.txt' file is independent from the media files to be indexed. All file types not to be followed for text indexing must be placed in 'ext.txt'. To be seen as a blacklist for file suffixes.

While

  • image.txt
  • audio.txt
  • video.txt
  • are whitelists that include suffixed for files to be indexed, according to the type of media.


    4.14 Canonical <link> tag

    As defined by Google, Microsoft and Yahoo! in February 2009, also Sphider-plus will follow the instruction of a rel="canonical" link. You may simply add this <link> tag to specify your preferred page version:

    <link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

    inside the <head> section of all the duplicate content URLs:

    http://www.example.com/product.php?item=swedish-fish&category=gummy-candy

    http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678

    and Sphider-plus will understand that the duplicates all refer to the canonical URL:

    http://www.example.com/product.php?item=swedish-fish.

    The duplicate pages will be ignored and not indexed. Sphider-plus takes the rel="canonical" as a directive, not a hint. The canonical link may also be a relative path, but is not allowed to refer to a different domain. Unfortunately the creation of canonical link tags needs to be done manually. So special care has to be taken that other directives like robots.txt or rel="nofollow" will not prevent the crawling of the canonical origin.


    Top

    5. UTF-8 Support and 'Preferred charset'

    Starting with version 1.2, Sphider-plus provides Unicode assistance and starting with version 2.1 the conversion is obligatory. In consequence, the impact will be powerful.

    First of all: the complete full text, and all header information like title, keywords and description tags need to be converted into Unicode. Consequence is an increase of time required for indexing.

    As also suggested by Yiannes [pikos], three steps are integrated to realize this procedure:

    1. Detect the charset of site, page or file.

        This information is normally presented as part of the HTML header.

        If not available, or for files without header like .doc, .rtf, .pdf, .xls and .ptt files,

        the 'Preferred charset' (as defined in Admin settings) will be used to convert the file into Unicode.

        In other words: it is not possible to convert DOCs, PDFs etc. that are coded in 'foreign' charset.

        Only those with your personal charset will be converted correctly. Also it is not possible

        to convert a Chinese and a Cyrillic coded PDF document at the same time. It is necessary to

        adapt the 'Preferred charset' before invoking the index procedure for the sites and their

        links to these documents.

    2. By means of the PHP function 'iconv()' all texts will be converted into UTF-8.

        This step is successful, if the required charset (for the content to be converted)

        is part of your local PHP installation. In order to find out which charset are available

        in your installation, please notice the files in server folder:

        .../apache/bin/iconv/

        Depending of the installation you will find about 200 charset files that iconv()

        is able to use for converting.

    3. If the PHP function fails, finally the class 'ConvertCharset' is invoked. This class,

        originally designed by Mikolaj Jedrzejak, enables converting for a lot of charset.

        But it takes more time than the compiled PHP function 'iconv()'.

    As result of the charset conversion, the user is enabled to search also for words that contain non-Latin characters.

    In order to enable converting of all charsets into UTF-8, upper and lower case characters are required. So (normally) the query 'html' will not deliver results for sites and files that contain the string 'HTML'. Both are different keywords and stored separately in the Sphider-plus database.

    Starting with version 1.6 Sphider-plus offers the additional option:

    'Enable distinct results for upper- and lowercase queries'

    If enabled in Admin settings, everything remains as descibed above. But if this checkbox is unchecked, result listing will deliver all results; independent of the query input. HTML, html or even hTmL queries will deliver the same (all) results.

    The checkbox for this option is placed with full intention in section 'Spider settings', as activating and also deactivating always requires an 'Erase & Re-index' procedure.

    The following 63 charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode:


    WINDOWS
    windows-1250 - Central Europe
    windows-1251 - Cyrillic
    windows-1252 - Latin I
    windows-1253 - Greek
    windows-1254 - Turkish
    windows-1255 - Hebrew
    windows-1256 - Arabic
    windows-1257 - Baltic
    windows-1258 - Viet Nam
    cp874 - Thai - this file is also for DOS


    DOS
    cp437 - Latin US
    cp737 - Greek
    cp775 - BaltRim
    cp850 - Latin1
    cp852 - Latin2
    cp855 - Cyrylic
    cp857 - Turkish
    cp860 - Portuguese
    cp861 - Iceland
    cp862 - Hebrew
    cp863 - Canada
    cp864 - Arabic
    cp865 - Nordic
    cp866 - Cyrylic Russian (this is the one,

    used in IE "Cyrillic (DOS)" )
    cp869 - Greek2


    MAC (Apple)
    x-mac-cyrillic
    x-mac-greek
    x-mac-icelandic
    x-mac-ce
    x-mac-roman

    ISO
    iso-8859-1
    iso-8859-2
    iso-8859-3
    iso-8859-4
    iso-8859-5
    iso-8859-6
    iso-8859-7
    iso-8859-8
    iso-8859-9
    iso-8859-10
    iso-8859-11
    iso-8859-12
    iso-8859-13
    iso-8859-14
    iso-8859-15
    iso-8859-16


    MISCELLANEOUS
    gsm0338 (ETSI GSM 03.38)
    cp037
    cp424
    cp500
    cp856
    cp875
    cp1006
    cp1026
    koi8-r (Cyrillic)
    koi8-u (Cyrillic Ukrainian)
    nextstep
    us-ascii
    us-ascii-quotes

    DSP implementation for NeXT
    stdenc
    symbol
    zdingbat

    And specially for old Polish programs:
    mazovia

     


    This list is to be read only as completion to the list of charsets as to be found in the subfolder /iconv/ of your server.


    Top

    6. Search modes

    Beside the original Sphider search queries like:

    - Search for a single word

    - AND and OR search

    - Search for a phrase

    Sphider-plus offers 7 additional modes to enter queries:

    - Search with wildcard

    - Strict search

    - Tolerant search

    - Link search

    - Media search

    - Search only in one domain

    - Search in suggestrd categories

    Wildcard, strict and tolerant search modes are available only for single query word input.

     

    Search with wildcards *

    This mod enhances the Sphider-plus capabilities to search also for parts of a word. The mod is invoked by entering a * as wildcard for the unknown part of the search query.

    Implemented for single-word queries, wildcards could be used like:

    *searchme

    *searchall*

    *search*more*

    Depending on the Sphider-plus database, a lot more results may appear using this search mode. In order to limit the amount in the result listing, there is an option in the admin backend named

    Define maximum count of result hits for queries with wildcards

    To be found in ‘Settings’ => ‘Search Settings’.

    If you like to know the multiple words found in the database to be highlighted:

    In your editor open the script:

    .../include/searchfuncs.php

    Find the row containing the text:

    Multiple words found in the database to be highlighted: '$hi'

    Uncomment this row. Now, if there is more than one result word, on top of the result listing, all the multiple words found in the database will be presented.

     

    Strict search !

    This variant is invoked by entering a ! as first character of the search query. If you search for '!plus' only results for the word 'plus' will be presented in the result pages. No results for words like 'spider-plus' or 'spiderplustec' will be shown. This is the reverse function of 'Search for part of a word by means of * wildcards'. Strict search only indicates results in the text part of the indexed pages and will respond only on a single word query. Strict search will overwrite the 'Phrase Search' option, which is nullified.

     

    Tolerant search

    This mod enables a tolerant search for Sphider. Selectable in Search-form like AND, OR and Phrase Search a new item "Tolerant Search" is added.

    If this item is selected, query input "perdida" will also deliver results for all sites that contain the word "pérdida". Inverse function is also implemented: "pérdida" input will deliver all results for "perdida".

    If enabled, this mod equalizes search input for e=é=è=ê and all the other vowels like: ä=a=à=â , ü=u , o=ö etc. The upper-case letters like Ä=A are also taken into account. Tolerant search overwrites the 'Distinct results for upper- and lowercase queries' setting and will mark all results.

    Natively developed to deliver most possible results for queries with entities and accents and also to simplify user input, this mod also delivers results that are "like" the query input. So something as the "Did you mean" facility is already integral part of this search method.

     

    Link search

    Invoked by starting the query input with ' site: ', the user is enabled to search for all pages of a domain. It is not necessary to enter the full domain address. For example if you enter 'site:sphider-plus.eu' you will get a list of all pages that belong to the domain http://www.sphider-plus.eu

    If the search query is part of more than one domain address in Sphiders site table, a list of these domains will be presented as intermediate result. If you then click on the desired domain of this list, all links (pages) of this domain will be presented as final result listing.

     

    Media search

    Media search is invoked by an additional checkbox in the Search Form. Media will be found individually for:

    - Images

    - Audio

    - Video

    Entering 'media:' (without quotes) will present all media stored in the database. For more details please notice the chapter Media Search

     

    Search only in one domain

    This mode is invoked by entering:

    site:www.abcd.de query

    and will present only results delivered by the (example) domain http://www.abc.de. Search input does not need a blank between site: and the URL of the domain, but between the URL and the query. In contrast to the 'Link search', this mode requires the full URL to be entered into the search form (inclusive http://).

    Beside www domains, this mode is usable also for localhost applications. The Admin settings 'Address to localhost document root' is used to rebuild the basic address of the local domains.

     

    Search in categories

    After the Admin assigned the indexed sites to different categories, the user of the search engine may rarefy the result listing to manually selectable categories or follow the suggestions of Sphider-plus, offerimg alternate categories and subcategories that would also deliver results.

    For more details please notice the chapter Search in categories

     

    Greek language support

    In order to support international languages, Sphider-plus stores all data UTF-8 coded into the database. But some languages require some special attention. For example the Greek language is supported as following.

    In Admin backend there are some special settings, controlling the Greek language support:

    - Support Greek language. Offers correct support for upper and lower case Greek letters.

    - Convert all kind of Greek accents (upper and lower case) into their basic vowels

    Will present e.g. the same query results, when searching with the letter α as well as using:

    ἀ, ἁ, ἂ, ἃ, ἄ, ἅ, ἆ, ἇ, ὰ, ά, ά, ᾀ, ᾁ, ᾂ, ᾃ, ᾄ, ᾅ, ᾆ, ᾇ, ᾰ, ᾱ, ᾲ, ᾳ, ᾴ, ᾶ or ᾷ

    - Transliterate queries with Latin characters into their Greek equivalents (Supports the old and new Greek alphabet)


    By activating the above 3 options, the following behaviour is available:


    Query input Find
    with/without accents Latin transliteration  
      old (Perseus) new Greek  
    κυπρος   kypros κύπρος
    ῥύσασθαι , ρυσασθαι rusasqai rusasthai ῥύσασθαι
    βαπτίσματος, βαπτισματος baptismatos baptismatos βαπτίσματος
    ἀλλὰ, αλλα alla alla ἀλλὰ
    ψυχρω   psuchrw ψυχρῷ
    αμφοτερα amfotera amphotera ἀμφότερα
    θέλημά qelhma thelhma θέλημά
    δοξα doca doxa δόξα
    προσεύχεσθε   proseuchesthe προσεύχεσθε
    απορρήξας aporrhcas aporrhxas ἀποῤῥήξας
           
    Queries with wildcards are followed   anthrwp* ἄνθρωπος
    άνθρωποι
    Άνθρωπος
        *noia* Ομόνοια
    έννοια

    Additional remarks:

    Searching for "αλλα" will not present any results for words written in Latin characters.

    Instead searching for "alla" will present results for:

    alla, ἀλλα, ἀλλὰ, Αλλα, Άλλα, etc.

    Depending on which result was found first in full text, the according text extract would show the first hit. In order to see all available results in result listing (Latin + transliterated Greek), it is suggested to increase the value in Admin setting:

    Define maximum count of result hits per page, displayed in search results.

    If a "strict" search is invoked (!ἀλλὰ), the conversion of Greek accents into their basic vowels will not be obeyed. Also the "strict" search (!alla) will overwrite the transliteration option, so that αλλα, ἀλλὰ, Αλλα, Άλλα, etc. will not be found for the query "!alla".

    If the option " Transliterate queries with Latin characters into their Greek equivalents" is activated, only Greek suggestions will be offered. It is assumed that this option is activated, because Greek results are preferred.


    Top


    Block queries

    Three different methods of blocking queries are offered:

    - Block all queries sent by harvester, bots and known evil user-agents (about 1.000 UAs)

    - Block all queries sent by Meta search engines like Google, MSN, Amazon, etc. (about 4.000 IPs)

    - Block all queries sent by known spammer (about 190.000 IPs)

    Each option needs to be activated separately in admin backend.

    The lists holding the bad bots, as well as the Meta search engines are editable .txt files.

    In contrast, the spammer IPs that persist in abusing forums and blogs with their scams, ripoffs, exploits and other annoyances, are automatically updated every 24 hours by means of a web service, and are added to the Meta search engine IPs.


    Top

     

    7. Chronological order for result listing

     

    7.1 Text result listing

    Sphider-plus offers 11 methods how to sort the text results:

    - By relevance (weight %)

    - By count of hits in full text

    - By last indexed (date and time)

    - Main URLs (domains) on top

    - By URL names

    - By file suffix

    - Only top 2 per URL (like Google)

    - Most Popular Links on top

    - Promoted domains on top

    - Pages holding a catchword on top

    - Single result per page (ignoring arguments in URL)


    The current selection of 'order for result listing' is visible for the users as additional headline on all result pages. If not desired, this option can be de-selected in Admin setting:

    'Show mode of chronological order for result listing as headline'.

    In order not to confuse the user, for the 3 methods

    - By URL names and then weight

    - Only top 2 per URL (like Google)

    - Most Popular Links on top

    the output of relevance (weight %) is suppressed in result listing.

    For the method 'By URL names' an additional setting is available called:

    'Define number of results shown per domain in result listing'

    Using this option, the result listing will present an output similar to 'Like Google', but the count of links added to the main domain is selectable. Instead 'Like Google' will always present one additional link beyond the main domain.

    For the 'Most Popular Links on top' method, Sphider-plus uses the before learned link acceptance. If a user leaves the result listing by clicking on any of the offered links, Sphider-plus will memorize this decision. The user is temporary redirected to the script .../include/click_counter.php, which stores the users link decision, last query, time and date before leading the user to the real destination.

    This link specific 'best click' counter is used as teach-in to define the chronological order of result listing. In order to prevent promoted clicks on a specific link, there is a delay timer before the next user click will be accepted. To be set in Admin /Settings/ Index Log Settings, the setting defines the idle-time in seconds.

    If there are more results as rated links, the rest of the result listing will be presented by relevance (weight), using the weighting of the last index / re-index.

    As the 'Most Popular Links On Top' item overwrites all other order of result listing.

    For the method of result ordering ' By relevance (weight %)' the weight is calculated. Situation changes, if in Admin settings the item 'Instead of weighting %, show count of query hits in full text' is activated. Now only the hits in full text are used to calculate the order of result listing. Keyword hits in URL name, path, title tag etc. are not taken into consideration.

    The weighting of:

    - Word in web page title tag

    - Word in the domain name

    - Word in the path name

    - Word in web page keywords tag

    may be influenced individually for personal preferences. The according settings could be performed in section 'Page indexing weights' of the Admin settings.

    For the method 'Single result per page', result listing will ignore arguments in URL. This will present only one result for URLs like:

    . . . /pizza-restaurants.html

    . . . /pizza-restaurants.html?page=1

    . . . /pizza-restaurants.html?page=2

     

    Sorting search results by means of 'file suffix' allows to sort all results with respect to the page/file suffix.

    So, for example, it could be selected, that all results from .html pages are shown first, all results from .php pages second, and finally all results from pdf documents at the end of the result listing.

    Controlled by the file

    .../include/common/file_suffix.txt

    which contains a list of suffixes to be expected in result listing. The order of all suffixes in this list determines the importance in result listing. The list is free editable.

    Might be helpful to select in admin backend => Settings => Spider Settings to select the option:

    'Additionally create a list file, sorted alphabetically by file suffixes'

    before indexing your site, as this list contains all suffixes found during index procedure. To be observed by

    admin backend => Statistics => List of sitemaps

    As stated, the file

    .../include/common/file_suffix.txt

    is free editable and could be created by the suffixes presented in admin backend.

     

    Additionally it is possible to create two different methods for 'promoted sorting' of the result listing:

    1. As part of the Admin settings, a domain name or part of the name could be entered. All search results belonging to this domain will be placed on top of result listing.

    2. All pages containing a catchword will be displayed on top of the search result listing. As part of the Admin settings, the catchword could be entered.

    Both methods of promoted sorting can be combined. If domain name and also the catchword are entered in Admin settings, both conditions must be fulfilled to become a promoted link in result listing.


    7.2 Media result listing

    Independent from sorting the text results, 5 different modes of sorting the media results are Admin definable:

    - By title (alphabetic)

    - By file suffix

    - By image size

    - By 'Last queried'

    - By 'Most popular'


    Top

    8. PDF converter

    Starting with version 4 a new PDF converter is implemented in Sphider-plus. Realized as pure PHP script, the new script does no longer require the definition to individual path.

    The new converter indexes text and images in not encoded PDFs.


    Top

    9. Clean resources during index / re-index.

    In order to prevent performance problems and memory overflow for large amount of URLs, Sphider-plus may clean unused resources during index / re-index. Selectable in Admin settings, this item periodical will:

    - Free memory that is allocated to unused MySQL recourses.

    - Unset PHP variables, which are no longer required.

    As this clearing work is done several times during index / re-index of every URL, additional capacity is required. Consequently overall indexing time will increase. So this item should be selected only for huge amount of URLs. Depending on

    - Memory size allocated to PHP

    - Total number of URLs

    - Number of internal and external links

    - Size of text to be indexed for each page

    - CPU clock rate

    - System RAM

    there will be an individual limit when to enable this feature. Following the discussion on the Sphider forum this feature should be activated only if > 100 sites are to be indexed, or when Sphider-plus dies a silent death during index prodedure, not indexing any more sites.

    Please take notice of the FAQ chapter:

    Error message: "Unable to flush table 'addurl'

    and

    Error message: " Access denied; you need the RELOAD privilege. . .


    Top

    10. Enable real-time output of logging data

    Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because:

    - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser.

    - Server modules for Apache do buffering of their own that will cause flush() to not result in data being sent immediately to the client.

    - Browser may buffer its input before displaying it. Netscape, for example, buffers text until it receives an end-of-line or the beginning of a tag, and it won't render tables until the </table> tag of the outermost table is seen.

    - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output.

    As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was the approach to realize this feature.

    Pressing one of the 'Start index / re-index' buttons, three additional scripts are involved.

    ( onclick=\"window.open('real_log.php')\" )

    .../admin/real_log.php

    By opening a new browser window / tab, this script takes over to display latest logging data. Requesting fresh data from the JavaScript file 'real_ping.js' , all new logging data will always been placed into <div id='realLogContainer' /> So, better not to press the 'Reload' button of your browser. The current <div /> might be already empty.

    .../admin/real_ping.js

    Script that transfers requests from HTML client to PHP server script and vice versa. Handling refresh for real-time logging during index and re-index procedure by means of asynchronous requests (AJAX) to the server.

    .../admin/real_get.php

    This script delivers 'refresh rate' and latest 'logging data', requested from the JavaScript file 'real_ping.js'. Also performs the reset of the 'real_log' table in Sphiders database.


    Latest logging data is delivered by the .../admin/messages.php script that, besides writing into the normal log file, feeds the table 'real_log' in Sphiders database. This is the buffer for latest logging data.

    Prerequisites are the enabled 'Log spidering dates' and 'Log file format = HTML'. When activating the real-time output, both pre-conditions are automatically selected.


    Top

    11. Error messages and Debug mode

    Starting with version 1.7, Sphider-plus offers the capability to enable / disable the output of MySQL error messages as well as PHP error messages, warnings and notices. To be activated in Admin / Settings / Admin Settings, this capability should only be used for debug purpose. It is recommended to disable the output of these messages for production systems, as they could reveal sensitive information.

    Debug modes are individual available for the Admin backend, as well as for the 'Search User'.

    Selection of the 'Debug mode' is implemented in Admin settings. If the Debug mod is enabled, for all pages that are indexed the found links and keywords are presented in Log file output and also in Log file real time output. It has to take into consideration, that only the new links and keywords found on the respective page will be presented. Links and keywords already stored in Sphider-plus database (because they were already detected on a former page) will not be presented again.

    The 'Debug mode' adds a comma and a blank to each keyword. So, debug output will be something like:

    New keywords found here:

    abc, defg, hijklm, nop, . . .

    As Sphider-plus also indexes special characters like commas and dots, keywords like defg, and hijklm. will be presented like:

    New keywords found here:

    abc, defg,, hijklm., nop, . . .

    The 'Debug mode' only modifies the log files. Sphider-plus database remains unaffected and will hold the same values as indexing without Debug mode. In other words, activating / deactivating this mode has no effect on the later search results.

    For activated 'Debug mode', also the output of MySQL and PHP error messages is activated. Debug mode overwrites the according setting. When deselecting a debug session, also the output of error messages must be disabled manually.

    In order to check the availability of all required libraries and extensions Sphider-plus is using, the Debug mode will present the corresponding messages on top of the 'Settings' menu.

    If 'Debug mode' is enabled for the 'Search User', the cache activity is presented above the result listing in the form of status messages, as well as automatically performed mode settings for 'Strict search' and 'Search with wildcards'.

     

    Top

    12. Delete secondary characters

    This feature was implemented in order to kill unimportant (secondary) characters at the end of words and also as trailing characters of words. If activated in Admin / Settings / Spider Settings, the query input sphider will deliver results for all keywords like:

    sphider

    sphider:

    "sphider"

    (sphider).

    sphider.

    sphider?

    sphider!

    The following characters in front of words are deleted:

    "  (

    Also, if at the end of words, these characters are deleted:

    )  "  ).  ,  .  :  ?  !

    If placed at the end of words that contains only digits, the dots are not deleted ( 27. ). So the search for

    27. November 2008

    remains available.

    For personal requirements the following two rows in .../admin/spiderfuncs.php may be edited.

    $file = preg_replace('/, |[^0-9]\. |! |\? |" |: |\) |\), |\)./', " ", $file); // kill characters at the end

    $file = preg_replace('/ "| \(/', " ", $file); // kill special characters in front of words

     

    Warning: This option should be used with special care and not be activated for non ISO-8859 charsets.
    Some special characters as part of the word ending might be erased by accidental.

     

    Top

    13. Media search for images, audio streams and videos


    13.1 Media indexing

    Index of media files is enabled by separated Admin settings for:

    - Images

    - Audio streams

    - Videos

    Three separate files in subfolder .../include/common/ that are named

    image.txt

    audio.txt

    video.tx

    hold a list of associated file suffixes. Only media files with the corresponding suffix will be taken into account during index / re-index procedure. These three files may be edited for personal purpose.

    In order to be indexed, for images additionally the minimum width and height (H x V pixel) may be defined in Admin settings. Image size will be observed for the following image types:

    .bmp .gif .j2c .j2k .jp2 .jpc .jpeg .jpeg2000 .jpg .jpx .png .swc .tif .tiff .wbmp

    Admin settings also allow selecting whether embed and nested media files should be indexed. This was implemented, as some server hide their media files as embedded objects.

    Another Admin setting is used to enable indexing of external hosted media content. When linked by the currently indexed page, also external media files will be indexed. This setting is independent from the Sites / Advanced Option setting 'Sphider can leave domain', which is used for text links only.

    Depending of the installed GD-library, during index / re-index procedure Sphider-plus will create thumbnails for the following image types:

    gif, png, jpg, ipeg, jif, jpe, gd, gd2 and wbmp

    Details about the currently installed GD-library (as part of the PHP environment) and the supported image formats are available at:

    Admin / Statistics / Server Info / Image funcs.

    Thumbnails will be created as 'gif' or 'png' files. To be selected in Admin settings, the gif files do have a lower quality, but will reduce the required memory for about 50%. Re-sampling the original images, size of thumbnail is defined to a maximum of 160 x 100 pixel. In result listing these stored thumbnails are used as preview.

    As far as available the Meta data ID3 and for images EXIF information is indexed and herewith become searchable. In admin backend, indexing and searching EXIF and ID3 info is separately selectable.

    In order to create thumbnails and to index ID3 and EXIF information, it is necessary to download the media file. For pages with multiple media content, the time for index /re-index procedure may increase dramatically.

    As ID3 information is not available for all audio and video files, the minimum play time in order to be indexed was not yet implemented.

    In order to save memory resources, Sphider-plus does not store the media content. Only the links, thumbnails and Meta information are stored.

    The limit in Admin settings " Max. links to be followed for each Site" is not taken into account for media links. Only page links are counted and the limitation is valid only for page links.



    13.2 Not supported media content


    The following examples demonstrate the currently existing limitations for media data that will not be indexed:

    - If inserted in documents like pdf, doc, ppt, etc.

    - If inserted in Java or applets and also direct applet implementations.

    - Image maps that are server-side or client-side included.



    13.3 Search for media content


    The search mode is enabled by the checkbox

    'Beside text results also show media results in result page'

    in Admin / Settings / Search Settings

    Once activated, the result listing for each keyword match will be separated into the 4 sections:

    - text results

    - image results

    - audio resuls

    - video results

    Each section is marked with an according thumbnail. Result listing will present only those sections that contain results.

    Each section will present result number, media title and the page address (link) at which the media was found. The text section will show the results as previous with highlighted keywords and surrounding text.

    The image result section additionally presents a thumbnail, the image size (H x V pixel) and a link to EXIF information for each found image. Clicking on the thumbnails will open the original image in a new window / tab.

    Video and audio results are presented with title, play time and a link to ID3 information. Media content will be opened with the belonging software by clicking on the media title.

    As the media sections are presented separatley for each keyword match, an additionl link called 'All media' is shown. Clicking here will force Sphider-plus to present all media results of the corresponding page (link). In order to return to the standard search modus, the section thumbnails could be clicked.

    The search function at first will look for text results (keyword match) and receive the according pages (links). Afterward media files are searched for the pages defined by the text results. So, only those media results that also generate text results will be presented in result listing.

    To get all media results (independent of the text results) another search mode is available:

    If in Admin / Settings / Search Settings the checkbox

    'Advanced search? (Shows 'AND/OR/PHRASE/TOLERANT/MEDIA' etc.)'

    is activated, the Search Form will present the additional checkbox

    'Search only Media'

    If this checkbox is activated, only media results will be presented in result listing, while possible text results will be ignored.

    Media search follows the rules of pre-defined categories. If 'Search only in category xyz' is selected in Search Form, media results will be presented only as found in the particular category.

    Search input for media queries is always interpreted as tolerant. So the query 'logo' will present results e.g. for the image 'sphider-logo.gif', while the input 'gif' will show all available gif files.

    Additionally the AND, OR and TOLERANT modes are selectable for media search, while the PHRASE mode will be interpreted as an AND search.

    The query 'media:' (without quotes) forces Sphider-plus to search for all media stored in its database. If used together with a category selection, all media content of the particular category will be presented.

    If in Search Form the checkbox 'Search only Media' is activated, also the suggest framework will present only media suggestions; taking into account also the eventually pre-selected limitation for category search.

    In admin backend, searching for media not only by 'title', but also by EXIF and ID3 info is selectable

    An additional Admin setting in section 'Suggest options' allows selection whether suggestions should be taken also from EXIF info and ID3 tags. Never the less suggested keywords will always be the title of the media file.

    For media search the Admin setting 'Enable distinct results for upper- and lower-case queries' is also taken into account.

    Additionally there is an Admin setting called

    'If found on different pages, index also duplicate media content'

    If activated, all images, audio and video stream will be presented in result listing. Otherwise only the first occurrence (page/link) will be presented.

    5 different modes of sorting the media result listing are Admin selectable:

    - By title(alphabetic)

    - By file suffix

    - By image size

    - By 'Last queried'

    - By 'Most popular'



    13.4 Statistics for media content

    In Admin / Statistics the following tables are available:

    'Most Popular Media' presenting:

    - Thumbnail

    - Details like 'Title' and 'Found at'

    - Total clicks

    - Last clicked

    - Query input

    'Indexed Image Thumbnails' presenting:

    - Thumbnail 150 x 100 pixel

    - Image details like title, filename size of original image, link- and thumb-id

    - Option to delete the thumbnail

    In order to open the media files all tables contain active links.

    Media results are also stored in 'Search log', and are presented like the keyword results with:

    - Query

    - Result count

    - Queried at

    - Time taken

    - User IP

    - Users country code

    - Users host name


    Top

    14. Feed support


    14.1 XML product feeds

    If activated in 'Spider Settings' menu of the admin backend, XML product feeds are indexed.

    At present Sphider-plus supports:

    Content API v2 (XML)  like    <condition>used</condition>

    Currently not supported:

    <product ID="1234">used</product>

    <condition name="deliveryCosts">used</condition>

    Content API v2 (JSON)  like    "condition" : "used"

    Basic design and rules for product feeds are explainedhere

    But as there is no well defined specification for product feeds, each user may define own feed details for their specific requirements. Sphider-plus accepts various attribute names, as well as the amount of attributes describing one product may vary for the feeds to be indexed.

    The Sphider-plus file

    .../include/common/xml_product_feeds.txt

    contains the list of attributes to be indexed. Each attribute in a separate line. Each line (attribute) may contain various names for the same product attribute, separated by a comma.

    As an example of the content for this file:

    id, ID, product_id, produktid

    title, name, product_name, Name, produktnavn

    description, Bescheibung, beskrivelse

    link, URL, productUrl, URL Hersteller, billedurl

    link reseller, URL Wiederverkäufer, vareurl

    image_​​link, image, images, imageUrl, Big_Image

    availability, Verfügbarkeit, Lagerbestand, lagerantal

    price, Neupreis, nypris

    . . .

    etc.

    The content of this list is free editable, so that individually (differently) created XML product feeds could be indexed together in one index procedure. In order to speed up the indexation, only the really involved attribute names, and only the exiting product attributes should become part of this list.

    Also it should taken into account that only the first attribute name (in each line) will be used in search result listing of Sphider-plus.

    So, even if an attribute name like 'product_id' is indexed (because the according product feed used it), search results will present 'id' as name of this attribute.


    In order to control the result listing of product feeds, there are 2 relevant options selectable in 'Search Settings' menu of the admin backend:

    First option:

    For results of XML product feeds, present the text extract of 350 characters as part of product attributes.

    If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products.


    The second option should be set to a high value, if multiple search results are to be expected in the feeds.

    Define maximum count of result hits per page, displayed in search results


    Also it should taken into consideration, that a case sensitive search is not supported for XML product feeds. The often individually molded product feeds would suppress several results.

    If you like to to search for prices like '379.00' Dollar, do not search for '379', because this query will deliver also all the other eventually existing results found as part of links containing the string ‘379’. Please follow the suggest framework, which offers '379.00' directly beyond the search form.



    Top

    14.2 RDF, RSD, RSS and Atom feeds

    To be activated in Admin / Settings / section 'Spider settings', the content of the following feeds will be indexed / re-indexed:

    RDF(v.1.0)    RSD (v.1.0)    RSS (v.0.91 / v0.92 / v.2.0)    Atom (v.1.0)

    Feed content and also the links found as part of the feeds will be followed. Before indexing the feeds, a validation check for a well-formed XML is performed. Corresponding log output is generated to inform the admin.


    Depending of the feed content, the following tags are indexed and herewith become searchable:


    - For RDF and RSS feeds the following standard tags are processed:

    Channeltags: 'title', 'link', 'description', 'language', 'copyright', 'managingEditor', 'webMaster', 'pubDate', 'lastBuildDate', 'category', 'generator', 'rating', 'docs'.

    Itemtags: 'title', 'link', 'description', 'author', 'category', 'comments', 'enclosure', 'guid', 'pubDate', 'source'.

    Textinputtags: 'title', 'description', 'name', 'link'.

    Imagetags: 'title', 'url', 'link', 'width', 'height'.

    Additional remark for RSS feeds: the optional sub-elements of the CATEGORY element (that identifies a categorization taxonomy) are currently not supported.


    - For RDF feeds, the following individual tags are additionally processed:

    Dublin Core tags: 'dc:', 'sy:', 'prn:'.

    Personal channel tags: 'publisher', 'rights', 'date'.

    Personal item tags: 'country', 'coverage', 'contributor', 'date', 'industry', 'language', 'publisher', 'state', 'subject'.


    - For Atom feeds the following tags are processed:

    Metatags: 'author', 'category', 'contributor', 'title', 'subtitle', 'link', 'id', 'published', 'updated', 'summary', 'rights', 'generator', 'icon','logo'.

    Entrytags: 'author', 'category', 'contributor', 'title', 'link', 'id', 'published', 'updated', 'summary', 'rights'.

    Authortags: 'name', 'uri', 'email'.

    Contributortags: 'name', 'uri', 'email'.

    Categorytags: 'term', 'scheme', 'label'.

    Generatortags: 'uri', 'version'.

    Additional remark for Atom feeds: SOURCE elements are currently not supported


    - For RSD feeds the following tags are processed:

    Service tags: 'engineName', 'engineLink', 'homePageLink'.

    API tags: 'name', 'apiLink', 'blogID'.

    Settings: 'docs', 'notes'.

    There is an Admin setting to strip CDATA tags. It is called: 'Follow CDATA directives'. A blank checkbox in Admin settings will ignore the CDATA directives in RSS and RDF feeds.

    An additional Admin setting enables/disables, whether 'Dublin Core' and other individually marked tags in RDF feeds should be indexed.

    Another Admin setting allows defining that the 'preferred' directive in RSD feeds should be followed. If activated in Admin settings, only those API tags with 'preferred = true' will be indexed. If the checkbox remains blank, all API tags will be indexed, even if 'preferred = false is encountered.

    Feed links are treated like the standard page links, so that the limit in Admin settings "Max. links to be followed for each Site" is influenced also by feed links (they count).

    After indexing the feeds, they are treated like other (HTML) pages. The suggest framework will offer keyword proposals. Also pre-selection of categories is taken into account.



    Top

    15. Cache for text and media queries

    To be activated in Admin settings, section 'Search Settings', the cache will store the results of the 'Most Popular Queries'. Before connecting to the database, each query will request the cache for results. If available, results are presented extremely fast. On the other hand each query, necessary to get results from the database, will automatically store its result into the cache.

    Individual cache results are stored following the different Search selections (AND, OR, Phrase, Tolerant). Also individualized cache results are stored for each category and all-sites search requests.

    Text and media queries cooperate with different caches. Size of each cache is definable in Admin settings [MByte]. On overflow of a cache, the least important result is deleted from the cache, while 'Most Popular Queries' is updated with each search input.

    If in Admin settings the 'Debug mode' is enabled, cache activity is presented on top of the result listing in the form of status messages. Text cache and media cache could be manually cleaned in Admin 'Clean' section, also offering the count of files in each cache and the consumed memory space separately for each cache. Another selectable cache setting allows automatic cache reset, performed on 'Erase & Re-index' procedures.

    Definition of required cache size depends on personal preferences. There is a conflict between two opposed requirements: the cache should hold as much as possible 'Most Popular Queries' but not consume too many resources by controlling hundreds of files in a big memory. For a first assumption, size per result should be defined to 2 Kbyte. Multiplied with the matches in database (e.g. found in 20 pages), each result requires approximated 40 Kbytes of RAM. So, a cache of 2 MByte could hold the results for 40 to 50 'Most Popular Queries'. After some time of usage, it might be helpful to observe the information given in 'Clear' section of Admin. Count of result files in cache and consumed memory space are presented. Depending on personal preferences, consumed result size and count of query hits in x pages, it might be necessary to adapt the size for text and media cache.



    Top

    16. Multiple database support

    16.1 Overview

    Starting with version 2.0, Sphider-plus offers the capability to cooperate with multiple databases. Currently prepared to work with up to five databases, the development was done under the following aims:


    Independent allocation of different databases for the tasks:

    - Admin

    - Search user

    - Suggest URL user

    This offers the capability to assign the 'Search' user to database1 and let him use the search engine. Meanwhile the 'Admin' may re-index database2. Also 'add new sites' and index them into database2 is performed by the Admin without disturbing the 'Search' user. Also backup, restore and copy functions could be done by the Admin without influence on the availability of the search engine. Later on the Admin may switch the 'Search' user to the updated database, or copy the fresh database content into the 'Search' user database.


    With respect to the database, the Sphider-plus scripts create automatically individual settings. These settings might be individualized for each database with respect to the personal requirements.


    As Sphider-plus has to survive also in Shared Hosting applications there are some limitations for multiple database support:

    - It is not possible to cooperate with a cluster of databases.

    - Master/Slave Replications are not supported,

       because the MySQL configuration file my.cnf is not accessible.

    - Sharding by scaling data-tables is not supported.

    - Dynamical allocating as a pro-shared assignment is not possible.


    Sphider-plus Admin interface offers the management of multiple databases. There are different menus in section 'Database' as described below.


    16.2 Definition and configuration

    Sphider-plus version 2.0 (and following) does not require the install_all.php script any longer. Database assigning and table installation is integrated into the Admin interface.

    The menu for database definition and configuration is protected by an additional login. Independent from the Admin login, a username and password is required to enter into this section. Username and password are defined in the file .../admin/auth_db.php. As per default download, username and password are both set to 'admin'.

    Entering the first time into this section, there will be several warning messages. At minimum one database has to be defined by:

    Name of database

    Username

    Password

    Database host

    Prefix for Tables

    Pressing the 'Save' button will assign Sphider-plus to these database definitions. Never the less, the warning message 'Tables are not installed for database x' will remain in the Database settings overview.

    The 'Install all tables for database x' is an independent procedure, which has to be invoked by the Admin after the database has been allocated. Chapter Enhancing functionality of multiple database support will describe the reason for these two independent steps.

    If the database is allocated and the tables are installed, the message ' Database x settings are okay.' are displayed in the settings overview; showing separately the situation for each of the five databases.

    If the application should work with only one or two databases, the settings for the non-required databases may remain blank. A corresponding message will be displayed:

    Mysql server for database 3 is not available!

    Trying to reconnect to database 3 . . .

    Cannot connect to this database.

    Never mind if you don't need it.

    So the Admin may assign up to five databasees, as required for the application. Assining of another (the next) database will be possible only, if the settings for the previous database are okay and the tables are installed. Further database setting fields are suppresed.


    16.3 Activate / Disable databases

    Next step to get multiple databases to work will be the activation of the databases. This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration.

    There are four settings available in the 'Activate / Disable' section:

    - Select active database for Admin

    - Select active database for 'Search' user

    - Select all databases that should deliver search results

    - Select active database for 'Suggest URL' user

    These settings enable independent use of different databases for:

    Admin

    Search User

    Suggest URL User

    Select all databases that should deliver search results offers the additional capability to fetch results from more than one database. In any case the active database for 'Search User' will be activated to fetch results, as this database is defined to be the default user database. Searching for results in several databases is available for text and media search and all search modes, taking into account all eventually pre-defined categories Search results are logged with respect to the database that delivered results. Consequently the table 'Most popular searches' at the bottom of result listing is offering results for the currently allocated databases, so that clicking on any of these most popular searches will again deliver results from one or several of the currently available databases.

    If multiple sets of tables are available, because they have been created for a database before, you will be able to activate any of these table sets by selecting the corresponding prefix. The selection will be presented below the 'Store all selections' button for all databases containing more than one table set. The selected prefix will be commonly used for Admin, Search user, suggest framework etc.

    If the table prefix is modified as described in 'Enhancing functionality of multiple database support, this modification is valid for all databases, which are activated to deliver results. In other words, all databases that are used to deliver results and the prefix is manually modified in .../templates/html/020_search-form.html need to contain table sets with the same prefix names.

    Consequently the corresponding settings are activated with respect to the database and the activated set of tables.

    After activating the databases for the different tasks, multiple database support is ready to use. The currently activated database and the prefix (name) of the actual selected table set (for the Admin) is displayed in 'Sites' table like:

    Database 1 with table prefix 'search1_' - Displaying URLs 1 - 10 from 25

    If the 'Debug' mode is activated in Admin settings, also the result listing will inform the user about the actual situation:

    Results from database 2

    When 'Store all selections' is activated to complete the database activation procedure, also the text cache and media cache will be cleared.


    16.4 Backup & Restore of databases

    This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration.

    This section enables the Admin to create backups from the current situation of a selectable database. Vise versa the backup files may be restored into the database.

    Backup files are compatible to phpMyAdmin structure and contain the table prefix and date + time of creation as part of the file names. Backup files are stored in subfolders (.../admin/backup/dbx), separated for each database.

    Restore of backup files is only possible into that database, which had been used before to create the backup files. Current content of the database tables (those with the same table prefix) will be destroyed by the restore procedure.


    16.5 Copy & Move

    This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration.

    This section allows copying the content from one database to another. By selecting:

    - Source database

    - Destination database

      and

    - Define Copy or Move utility

    it is possible to copy / move the content from one database to any other database. Beside the table content, both utilities inevitably will also copy the table suffix (of the source db) into the destination database. If tables with the same prefix already exist in the destination database, the content of these tables will be overwritten. Beside the table content also the corresponding thumbnails will be copied.

    In contrast to the 'Copy' utility, the 'Move' function additionally will clear the source database and delete the corresponding (source) thumbnails.


    16.6 Enhancing functionality of multiple database support

    1. 'Backup & Restore' as well as the 'Copy / Move' function will always work with all tables of a selected database. In contrast to these gloabl actions, the 'Import / Export URL list' function is only acting with the currently (for the Admin) activated table prefix. This allows a selective import and export of only those URLs, used for the activated tables as defined by the prefix. The name of the exported URL list contains the (source) database number, the table prefix and the date of creation. Crossover usage of URL lists is enabling to import any URL list (created from database x) into database y


    2. When configuring databases, it is strongly recommended to create and use prefixes for the tables. Table prefixes are the key for creating new sets of tables in each database. As described in chapter Definition and configuration, the tables need to be installed separately; after the configuration of the database was saved. After these settings are finished and the database is assigned, Admin may use this database and index sites into the database tables with the given table prefix.

    It is evident that one database could be configured with several table prefixes. That is the key for additional 'virtual' databases. By configuring the given database with a new table prefix, Admin is able to install another set of tables into the same database. This set of tables (with the new prefix), may be used to index another set of sites into the same database. This is performed without destroying the content of the prior used tables.


    3. The above mentioned allows to add quasi-additional databases without really creating new databases. It was also mentioned before that Sphider-plus has to survive in 'Shared Hosting' applications. Consequently Admin may assign one database to the 'Search' user.

    But there is a feature integrated into Sphider-plus to bypass this restriction. Assuming that result listing should be offered in two (or even more) versions. For example in English and another language. One result listing for global users, the other for registered users. One info result, one shopping result listing etc.

    To enable such a feature, the search form of Sphider-plus contains two hidden variables called 'db' and 'prefix':

    <input type="hidden" name="db" value="$user_db" />

    <input type="hidden" name="prefix" value="$user_prefix" />

    As long as the variables are set to '0' (how to alter, see below), the search script will use the settings as defined in the Admin settings:

    "Select database for 'Search' user". This standard setting may be used for the first search form, offering the results of the first set of tables (which e.g. holds the English results). But for a second search form, the value for 'prefix' may be set (name of prefix) for another set of tables that hold the results of the second language. The setting of the second search form will (temporary) overwrite the Admin settings for its own result listing.

    Selection of different sets of tables could be performed in the Database => Activate / Disable menu. If multiple sets of tables are available, because they have been created for a database before, you will be able to activate any of these table sets by selecting the corresponding prefix. The selection will be presented for all databases containing more than one table set.

    The selected prefix will be commonly used for Admin, Search user, suggest framework etc.

    Multiple database enhancements are assisted by the fact that Sphider-plus is supporting multiple settings. Each database and each set of tables contains an individual Admin setting.

    The selected set of tables could be overwritten individual for a search form by modifying the variable:

    $user_prefix = "0";

    Here the prefix name of the table set, which should be used instead of the default table set, needs to be entered

    The selected database could be overwritten individual for a search form by modifying the variable:

    $user_db = "0";

    Here the number of the database, which should be used instead of the default db, needs to be entered

    Both variables are to be found and must be changed if necessary in the script    .../search_ini.php

    This implementation could be interpreted as a super category feature. Not requiring the selection of a category, or even a subcategory, by the 'Search' user. Not predicating that the normal category function would be lost by use of multi database support and its extended features.

    Another useful application for multiple table sets would be the support of several languages. By indexing language specific sites into different sets of tables, the 2 hidden fields in the search form will define, which language is presented in result listing.

    Usually the language for the user dialog is defined in the Admin backend and could be automatically adapted to the user's language by an additional Admin setting.

    In order to overwrite these Admin settings individually for one search form and the according result listing, there is a variable placed in the script   .../search_ini.php

    $user_lng = "0";

    As long as the value is set to "0", the Admin settings will be used. Entering "fr", "it", etc., will force this search form to use the languages French, Italian, etc.


    Top

    17. Search in categories

    17.1 Hierachical structur

    In order to prepare Sphider-plus for category search, the categories may be defined in Admin 'Categories' menu. Different categories (top level) as well as subcategories (Create new subcategory under . . .) are added here.

    Second step to prepare Sphider-plus, is assigning the sites to the different categories and sub-categories. To be found at Sites / Options / Edit / Category.

    Assigning and even change of category affiliation may be done also after the index procedure.

    Third step is to activate one or both of the Admin settings:

    - Show category selection in search form.

    - If available, user may select 'More results of category . . . ' at each result in results listing.

    The first setting will present all prior defined categories and their subcategories as part of the search form and the user may select one top-level category or a subcategory to limit the search results.

    The Admin setting

    'If available, user may select 'More results of category . . . ' at each result in results listing'

    will present all additional categories that would also deliver results for the search query. Presented under the link URL, the user may click on any of these suggestions to automatically perform a new search in the suggested category.

    The new result listing will present all results of that category and, if available, will also suggest subcategories that would be able to deliver results for the current query. Again the user may rarefy the result listing. Clicking on any suggested subcategory, will again perform a search for the query; now in the selected subcategory.

    As additional headline of the result listing the user will be informed about the source if the results like:

    Presented results are captured from category: abc

    In order to return to the standard search without category-selection, the user only needs to activate the checkbox

    'Search in all sites'

    as part of the Search form.


    17.2 Parallel structur

    Starting with version 3.0, Sphider-plus offers the feature to restrict the search results by means of up to 5 categories simultaneously. The titles (names) for the 5 categories could be defined in the Settings menu of the Admin backend (see section 'General Settings').

    If the title for category0 or category1 is defined with any value, the hierarchical category structure is disabled, and the search results will be restricted by the parallel category structure. After defining the names of all categories that should be used to restrict the search results, any of the 'Save' buttons in the 'Settings' menu of the admin backend needs to be activated. Afterward the category names are available for all Sphider-plus scripts.

    Undefined categories, because the regarding title is left blank, will remain uncared by the search algorithm, and also will not be presented in the search form.

    The relationship between URL and all involved categories is to be defined in a file, which is imported into Sphider-plus by means of the 'Import / export URL list' feature of the Admin backend.

    As per default, the import file needs to be placed into

    .../admin/urls/

    as a subfolder of the Sphider-plus installation. If the import file is to be found in another folder, the regarding address needs to be edited in the script

    .../admin/configset.php

    The import file could be a pure text file or an EXEL table. The name of the file is user-defined, but must contain at minimum one underscore. Examples:

    2012.12.07_import.txt

    My_friends.xml

    More details regarding the import of parallel usable categories are described in th readme.pdf documentation.

    One category variable could be used to build up a range selection (min. – max. values) in the search form. This might be useful for postal codes, etc. The search form will present 2 selection fields for $cat_sel0, which could be used by the user of the search engine to define the minimum and maximum values. All results matching this range will be presented in result listing. Additional the categories $cat_sel1 - $catsel4 could be used to improve the search results.




    18. User suggested sites

    Reachable via a link at the bottom of result listing, a form is presented that allows users to suggest URLs to become indexed by the search engine. The user needs to enter:

    - URL

    - Title

    - Short description

    - Dispatcher e-mail account.

    In order to prevent spam proposals, the form optionally will present a Captcha.

    The admin of the search engine may

    - approve

    - reject

    - bann

    the suggested sites by means of a menu, presented in Admin backend. A corresponding e-mail is automatically generated and sent to the dispatcher.

    All features of 'User suggested sites' are optional and could be defined as part of the Admin settings.

    An additional option offers the function:

    Suggested sites require authentication tags

    If activated, all suggested sites will need an additional meta tag in their header. This authentication tag needs to be written as:

    <meta name='Sphider-plus' content='1234'>

    The content value (here e.g. 1234) is defined by the administrator of the search engine. As part of the approval form, an additional field needs to be filled in by the admin. So, individual values could be defined for each suggested site. The text of the automatically generated acknowledgment e-mail, sent to the dispatcher, is altered to:

    Your suggestion was accepted by the system administrator and will be indexed shortly.

    Please add the following tag into the header of the suggested site:

    <meta name='Sphider-plus' content='1234'>

    In order to enable indexing of your site, this tag is mandatory

    and is tested periodically by the indexer of Sphider-plus.

    We appreciate your help and effort in building this search engine.

    This mail was automatically generated by Sphider-plus.

    The meta tag needs to be implemented only into the suggested site. It is not necessary to add this tag into all pages of the site. Only the header of the suggested URL will be verified for existence of the tag and correct content value.

    The authentication value may be altered by the admin of the search engine later on.

    In

    Sites view => Site Options => Edit

    an additional input field is presented. If the value is left empty, the site will be indexed without verification of the header tag. The dispatcher will not be informed about any modification done by the admin.

    The additional input field to enter/modify the authentication value is offered for all sites stored in the database of Sphider-plus, so that an authentication value could be added also subsequently by the admin.

    If the tag is missing or contains an invalid authentication value, a corresponding warning message is created during index procedure. The complete site with all their pages will be skipped by the index procedure, but the former content as well as the known links will remain part of the Sphider-plus database. This behavior offers the capability to reactivate the site by the admin later on.




    Top

    19. Vulnerability protection

    19.1 Prevent queries from Meta search engines and crawler known to be evil

    In order to reduce Internet traffic and server load, there are several settings available in Admin backend

    in section 'Search Settings' called:

    - Block all queries sent by harvester, bots and known evil user-agents

    - Block all queries sent by Meta search engines like Google, MSN, Amazon, etc.

    - Block all queries sent by known spammer (IPs) that persist in abusing forums and blogs

    with their scams, ripoffs, exploits and other annoyances

    - Block all queries, which could cause an XSS attack, shell execution, tag inclusion,

    SQL injections, directory traversals, XSRF attacks, or a JavaScript execution

    If activated, the corresponding search queries will be rejected.

    The first option is controlled by the file

    .../include/common/black_uas.txt

    holding lists of user agents known to be evil. Here well known evil bot UAs are stored.

    Additionally there is a list of well known brave bots, which s stored in

    .../include/common/white_uas.txt

    If the user UA is part of this white list, the comparison with all the black listed UAs will be skipped.


    Meta search engines are identified by their IP. The IPs could be entered as single IP, as well as IP ranges into the file

    .../include/common/black_ips.txt

    Prevented queries are answered with the text 'No results found'.

    Instead, if the option 'Enable Debug mode for User interface' is activated in Admin backend, the IP (which caused the query) is also presented as result. If an evil user agent sent the query, the client user agent string is presented.



    19.2 Basic input validation against vulnerability attacks

      The following protections are implemented:

    - Prevent SQL-injections

    - Prevent XSS-attacks

    - Prevent Shell-executes

    - Suppress JavaScript executions

    - Suppress Tag inclusions

    - Prevent Directory Traversal attacks

    - Delete input if query contains any word of (editable) blacklist

    - Prevent buffer overflow errors.

    - Suppress JavaScript execution and tag inclusions masked as XSS attacks.

    - Prevent C-function 'format-string' vulnerability.

    As the protections against XSS attacks, Shell execution, Tag inclusions, as well as the suppression of JavaScript executions do avoid some words in the search query, a special Admin setting is used to activate this protection. The setting is to be found in section "Search Settings" and is called:

    Block all queries, which could cause an XSS attack, Shell execution, Tag inclusion, or a JavaScript execution




    19.3 Admin backend protection against remote access

    In order to hedge the admin backend of Sphider-plus against use as directed, you may prevent usage of the admin backend for remote operation. This means:

    'Lock in' as admin will only be possible, if installation address (IP) of Sphider-plus and the IP, used by the admin lock in request, are equal.

    In section 'General Settings' an option is available to enable/prevent remote access to the admin backend.




    19.4 Log file reporting attempts to abuse Sphider-plus

    Each attempt to intrude the user interface is stored in a log file. If the according option for vulnerability protection is activated in admin backend, the following events will trigger a new entry in log file:

    - Queries sent by harvester, bots and known evil user-agents (UAS).

    - Queries sent by Meta search engines like Google, MSN, Amazon, etc. (IPs).

    - Queries sent by known spammer (IPs) that persist in abusing forums and blogs with their scams.

    - Queries, which could cause an XSS attack, shell execution, tag inclusion,

    - SQL injection, directory traversal, XSRF attack, or a JavaScript execution.

    - Attempts to flood the search form by too many queries per unit of time.

    - Blocked Internet traffic of IP's, which already caused intrusion attempts (IDS).

    The log file is available at the admin backend in menu 'Statistics' => 'Report log file', offering all details about each event. The log file could be flushed in menu 'Clear'.

    There is an additional option called:

    On occurrence, send e-mail report about above attempts to harm Sphider-plus

    If this option is activated, each event will be reported by e-mail to the administrator account as defined at:

    Administrator e-mail address




    More details about vulnerability protection of Sphider-plus are available in the readme.pdf documentation.



    Top

    20. Bound database

    Entering a word into the search form will force Sphider-plus to scan the database for all links offering a result for this query. Already integrated had been a limit to present only x results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x results. This option will be most useful especially for huge databases, holding the content of many links.

    The setting in section 'Search Settings' called:

    Define max. amount of results presented in result listing:

    will define the limitation for the database volume. In order to activate this limitation the 'Clean' menu presents the option:

    Bound database

    When activated, all keyword / link relationships - stored during index procedure - will be reduced to the above defined amount. All overhanging relationships will be deleted from the database. Consequently all further search enquires will be responded much faster, because only the relevant amount of results are available.

    It is up to the admin to define how many results are relevant for the application Sphider-plus is integrated into. The 'Top keywords' table, as part of the 'Statistics' menu, could be helpful to define the limit. Once the database is bounded, also this table will only show the 'bounded' available results.

    The 'Bound database' option should not be invoked, if the chronological order of result listing is defined to 'By hit counts in full text' or to 'By index date', because limitation of the database is performed by weighting.

    Some patient is required for this option. Once activated, the following steps need to carried out:

    - Get all keywords from database.

    - Get all results from db for each keyword.

    - Bound the results of each keyword to the defined limitation.

    - Delete all possible results, exceeding the limitation (for each keyword).




    Top

    21. Suggest framework

    Sphider-plus offers an auto-complete function as part of the search form. Starting with version 3, the suggest framework was switched over from 'Prototype' to the JavaScript library 'jQuery'.

    Suggestions are presented for single word queries, as well as for phrases, and also for media search. The suggest framework is configurable in Admin backend. As part of the 'Settings' menu in section 'Suggest Options', the following items are definable:

    - Minimum count of query letters in order to get a suggestion.

    - Search for suggestions in query log.

    - Search for suggestions in keywords.

    - Build suggestions also for 'Phrase' search.

    - Obey the list of words not to be suggested.

    - For 'Media' search get suggestions also from EXIF info and ID3 tags.

    - Limit count of suggestions.

    The option 'Obey the list of words not to be suggested' is based on a list of words placed in the file

    .../include/common/suggest_not.txt

    All words placed in this file will be used to suppress this suggestion. All other suggestions will be presented. Even if the stop word in the list is only part of a suggestion, the suggestion will be also suppressed. Example: If the list contains the word 'kinder', also 'kindergarten' will not be presented as a suggestion.

    With respect to the different templates of Sphider-plus, also the design of the suggest field is adapted to the regarding template. Responsible for this adoption is the style sheet file

    /templates/your_template/jquery-ui-1.10.2.custom.css

    These style sheet files are based on the themes of jQuery. They are edited individually for the Sphider-plus templates 'Pure', 'Slade' and 'Sphider-plus'. Each of these style sheet files contains a link (see row 4) to the jQuery 'ThemeRoller'. Thus individual adoptions to personalized Sphider-plus templates are enabled by just editing the style sheet …/templates/your_template/jquery-ui-1.10.2.custom.css by means of the jQuery 'ThemeRoller'.




    Top

    22. Integration of Sphider-plus into existing sites


    There are 2 different ways of integrating Sphider-plus into existing sites:

    1. Use layout and templates of Sphider-plus.

    2. Embed the search engine into existing HTML code.


    22.1 Integration into existing site by use of Sphider-plus templates

    This mode is simply invoked by calling the script 'search.php' in the root folder of the Sphider-plus installation.

    Assuming that the search engine is placed in a subfolder called 'sphider-plus', the according call would be something like:

    http://www.abc.de/sphider-plus/search.php

    Once called, the search engine will build up a complete HTML page with

    - Headline

    - Search form

    - Result listing

    - Footer

    The design of this page is defined by one of the 3 templates delivered together with Sphider-plus. They are named:

    - Pure (close to Google design)

    - Slade (dark shadow design)

    - Sphider-plus (default, as on this project page)

    and are selectable by the Admin backend of Sphider-plus. Usually no one of these 3 templates will fulfil the requirements of an existing site design. Consequently the (activated) style sheet

    .../templates/Pure/thisstyle.css

    needs to be individualized.

    In order to create a new template it might be useful to copy one of the Sphider-plus template folders completely with all files, rename the folder with a new name and afterward edit the personal style sheet

    .../templates/your_template_name/thisstyle.css

    The new template will also be presented in the Admin settings, as one of the available selections.


    22.2 Embed the search engine into existing HTML code

    This mode extends the capabilities as described in the above chapter. There is an Admin setting called:

    - Embed 'Search form' and 'Result listing' into an existing HTML page

    If this checkbox is activated, the search engine will not create a complete HTML page, but needs to be embedded into an existing HTML code.

    As the Sphider-plus scripts are based on charset UTF-8, in any case, the scripts for search form, result listing, etc. could only be embedded into UTF-8 coded HTML pages. No problem to index other coded content, but embedding is possible only on base of charset UTF-8.

    As described in above chapter, the script 'search.php' is used to embed the search engine into the existing page. In general the script 'search.php' only consists of several include directives.

    The 'search.php' script contains complete documentation (comments) how and where to use the different include directives of this script, so that it needs not to be repeated here.

    In any case the link

    <link rel='stylesheet' href='http://abc.com/search/templates/Pure/userstyle.css' type='text/css'>

    should be placed into the HTML head tag, before an already existing stylesheet.css file is called. This approach will ensure that the existing css will automatically overwrite the Sphider-plus style sheet. Consequently only the search engine specific settings need to be modified in the Sphider-plus userstyle.css.

    Replace in above example

    http://abc.com/search/

    with your personal address to the Sphider-plus installation folder on your server.

    On the same subject another Admin setting might be also helpful:

    Name of search script

    By means of this option, an individual script could be defined and used to control the search engine, by containing the required 'includes'. Sphider plus will automatically reference on this script, when searching and presenting the results.


    The Sphider-plus functions

    - Intrusion Detection System

    - User may suggest URLs form

    - Admin backend

    are always using the template as defined in the Sphider-plus Admin backend and (at present) are working non embedded.


    22.3 The different style sheet files.

    Sphider-plus is delivered with two style sheet files:

    - adminstyle.css

    - userstyle.css

    Both files are part of each template design (Pure, Slade and Sphider-plus) In order to adapt one of the three templates to an existing site design; only the userstyle.css file needs to be individualized. So the Admin backend remains stable and executable even during development of the final design for the user interface.

    With respect to the different templates of Sphider-plus, also the design of the suggest field is adapted to the regarding template. Responsible for this adoption is the style sheet file

    .../templates/your_template/jquery-ui-1.10.2.custom.css

    These style sheet files are based on the themes of jQuery. They are edited individually for the Sphider-plus templates 'Pure', 'Slade' and 'Sphider-plus'. Each of these style sheet files contains a link (see row 4) to the jQuery 'ThemeRoller'. Thus individual adoptions to personalized Sphider-plus templates are enabled by just editing the style sheet …/templates/your_template/jquery-ui-1.10.2.custom.css by means of the jQuery 'ThemeRoller'.




    Top

    22. JSON, XML and RSS result output

    Usually the result listing is presented as HTML output for the client that has sent the query to the Sphider-plus scripts. Additional output is available as JSON, and XML files, as well as RSS feed. If requested by the search_ini.php script, the results will be presented as separated files in subfolder …/xml/

    With respect to the query type, JSON, XML and RSS files will contain text, media, or link results. The according output files are called:

    text_results.txt => JSON output file

    text_results.xml => XML output file

    text_results.rss => RSS output feedd

    media_results.txt

    media_results.xml

    media_results.rss

    link_results.txt

    link_results.xml

    multiple_link_results.txt

    multiple_link_results.xml

    As media results could be images, audio streams, as well as videos, the media type is marked by a type tag in the output files.

    The additional output files are activated by the variable $out . In order to create the additional output files, the variable needs to be set to ‘xml’. To be found in the script …/search_ini.php

    The same script also offers the variable $xml_name (above set to 'result'), which may be used to define individual names for the output files. Individual for each search form. The names are always completed by one of the prefixes

    text, media, link, multiple_link

    so that the complete name will become

    text_your-choice.xml

    media_your-choice.txt

    etc.

    If a new query is sent to Sphider-plus, first of all the old JSON, XML and RSS result files are deleted from the sub folder .../xml. This is performed before searching the database for possible results. When new results are found in db, the search-script will store the results in new JSON, XML and RSS file, and also present the results as HTML output to the client's browser.

    Additionally all JSON,XML and RSS output files are saved in sub folder

    .../xml/stored/

    The file names in this folder additionally contain date and time of storage like

    _2014.03.02_04-47-10_PM_text_results.xml

    The files in sub folder . . ./xml/stored/ will not be deleted automatically, but remain available until manually deleted by the admin backend. To be performed in menu 'Clear' by means of the item:

    Delete all files in 'XML' sub folder


    An XML media result file may look like:


    <?xml version="1.0" encoding="utf-8" ?>

    <media_results>

    <query>warp</query>

    <ip>127.0.0.1</ip>

    <host_name>www.007guard.com</host_name>

    <query_time>2014-03-02 04:47:10 PM;</query_time>

    <consumed>0.043</consumed>

    <total_results>2</total_results>

    <media_result>

    <num>1</num>

    <type>image</type>

    <url>http://www.abc.de/index.php</url>

    <link>http://www.abc.de/images/warp.gif</link>

    <title>warp.gif</title>

    <x_size>635</x_size>

    <y_size>98</y_size>

    </media_result>

    <media_result>

    <num>2</num>

    <type>audio</type>

    <url>http://www.abc.de/index.php</url>

    <link>http://www.abc.de/my_music/warp.m4a</link>

    <title>Warp.m4a</title>

    </media_result>

    </media_results>




    Instead a JSON text file may contain the following content:

    {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . .  Preamplifier and driver for connecting cable. HDTV receiver for digital TV and radio. DVB-S\/S2 and DVB-T, digital colour TV set with 24" LED back light illuminated display. 7 Exterior Views 8 9 Interior Views Former, the cow and pig stable were placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of  . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . .  cable. Additional active driven DVB-T antenna. HDTV receiver for digital TV and radio. DVB-S\/S2 and DVB-T digital colour TV set. First floor Upper floor Plot: Size: about 6,2 ha Hilly side, partly wooded View over the hills of the Chianti D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the  . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]}




    Top

    The Sphider-plus honeybee