Preamble: The info presented here is valid only for the latest release of Sphider-plus.
At present version 4.2024d published October 04, 2024 is the actual release.
1. Settings, customizing and statistics
If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like:
Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase & Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table - Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file .../templates/My_template/adminstyle.css .../templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails with ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked.
Sites:
- Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase & Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains
- Add Site
- Index only the new
- Re-index all
- Re-index only preferred URLs
- Erase & Re-index (available also for individual URLs)
- Import/export URL list
- Approve sites
- Banned domains
Categories:
- Add, edit, delete - Create new subcategory under
- Add, edit, delete
- Create new subcategory under
Index:
- Basic indexing options - Advanced options
- Basic indexing options
- Advanced options
Clean:
- Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table - Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database
- Clean keywords not associated with any link
- Clean links not associated with any site
- Clean Category table not associated with any site
- Clean Media links
- Clear Temp table
- Clear Search log
- Clear 'Most Popular Page Links' log
- Clear 'Most Popular Media Links' log
- Clear Spider log, separate and bulk delete
- Clear Thumbnail images, separate and bulk delete
- Clear Text cache
- Clear Media cache
- Clear IDS log file
- Clear flood attempts log file
- Clear all entries in addurl or banned table
- Truncate all tables in database
Settings:
- General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights
- General Settings
- Index Log Settings
- Spider Settings
- Search Settings
- Order of Result listing
- Suggest Options
- Page Indexing Weights
Database:
- Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize
- Configure up to 5 databases with unlimited number of table sets
- Activate separately for 'Admin', 'Search' user and 'Suggest URL user'
- Backup / Restore
- Copy / Move
- Optimize
Templates:
In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file .../templates/My_template/adminstyle.css .../templates/My_template/userstyle.css
In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for
Search form Text result listing Media result listing Most popular queries etc.
Search form
Text result listing
Media result listing
Most popular queries
etc.
Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file
.../templates/My_template/adminstyle.css .../templates/My_template/userstyle.css
.../templates/My_template/adminstyle.css
.../templates/My_template/userstyle.css
Statistics output:
- Top keywords (Top 50 with hit counter). - All indexed thumbnails with ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked.
- Top keywords (Top 50 with hit counter).
- All indexed thumbnails with ID3 and EXIF info.
- Larges pages offering link URL and file size.
- Most Popular Searches for text links offering:
Link addr., total clicks, last clicked, last query (Top 50)
- Most Popular Searches for media links offering:
- Most Popular Links (click counter).
- Search log offering:
Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100)
- Index log offering:
File-name, index date and delete option
- sitemap log offering:
sitemap.xml output sitemap list offering file/page suffixes
sitemap.xml output
sitemap list offering file/page suffixes
- IDS log offering:
IP, host, query, impact, involved tags, date and time of intrusion.
- Flood attempts log offering:
IP, query, date and time of flood attempt.
- Auto Re-index log file
- Server info offering:
Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details.
Server software, environment, MySQL, PDF-converter, image functions, php.ini file
PHP integration, PHP security info. Each item holding lists of details.
All text links, media links and thumbnails are active linked.
As stated in chapter Introduction, this search engine uses some PHP libraries and extensions. When opening the Settings interface, the existence of these libraries are tested by software, and in case that a library is not part of the server environment, the according option is not presented in the Settings interface. For example, if the 'rar' extension is not available, it will not be possible to index RAR archives and the belonging checkbox will not be presented in 'Spider Settings'. In order to check the availability of all required libraries and extensions, the Debug mode will present the corresponding messages.
2. Indexing
2.1 Various options
As part of the Admin Site settings you may select the following options:
Full:
Indexing continues until there are no further (permitted) links to follow.
To depth:
Indexes to a given depth, where depth means how many "clicks" away the page can be from the starting page. Depth 0 means that only the starting page is indexed, depth 1 indexes the starting page and all the pages linked from it etc.
Re-index:
By selecting this mod, indexing is forced even if the page already has been indexed. Re-index only detects changes of the pages to be re-indexed. Modifications in Admin settings are not recognized.
Erase & Re-index:
By selecting this mod, indexing is forced even if the page already has been indexed. Additionally this mod will Clear Sphider-plus database before the re-index process. It will leave the following untouched: - Categories - Query log - Sites and all options: spider-depth, last indexed, can leave domain, title, description, - URL Must include, URL must Not include. If settings have been modified in Admin section this mod should be selected to update the database.
By selecting this mod, indexing is forced even if the page already has been indexed.
Additionally this mod will
Clear Sphider-plus database before the re-index process. It will leave the following untouched:
- Categories
- Query log
- Sites and all options: spider-depth, last indexed, can leave domain, title, description,
- URL Must include, URL must Not include.
If settings have been modified in Admin section this mod should be selected to update the database.
Spider can leave domain:
By default, Sphider never leaves a given domain, so that links from domain.com pointing to domain2.com are not followed. By checking this option Sphider can leave the domain, however in this case its highly advisable to define proper must include / must not include string lists to prevent the spider from going too far. This option must be activated, if a htaccess. file is used for redirect directives.
Must include / must not include:
Explained in chapter: 4.2 Must include / must not include string list.
Multithreaded indexing:
Explained in chapter: 2.6 Multithreaded indexing.
Follow and create sitemap files. See below for details.
Word stemming. See below for details.
2.2 Allow other hosts in same domain
This Admin selectable option allows to index other hosts with the same domain name and it also ignores TLD, SLD and www.
If e.g. calling from http://www.sphider-plus.eu links like:
- http://sphider-plus.eu (without www.) - http://www.info/sphider-plus.eu (additional subdomain) - http://www.sphider-plus.com (different TDL) - http://www.sphider-plus.tec.eu (additional SLD)
- http://sphider-plus.eu (without www.)
- http://www.info/sphider-plus.eu (additional subdomain)
- http://www.sphider-plus.com (different TDL)
- http://www.sphider-plus.tec.eu (additional SLD)
will be followed if this option is activated in Admin settings.
There are 2 different options available in Admin setting to cover this feature. The first one is following all links found during index procedure. The second one is only following the links to other hosts, if the found links are redirected.
2.3 Word stemming
Sphider-plus is offering language specific stemming algorithms for 15 languages:
Bulgarian, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Portuguese, Russian, Spanish and Swedish
Bulgarian, Chinese, Czech, Dutch, English, Finnish, French, German,
Greek, Hungarian, Italian, Portuguese, Russian, Spanish and Swedish
To be activated individually for the language that needs to be indexed. Automatically the according common word list (holding the stop words not to be stored in database) will be activated together with the stemming language. For Chinese, Greek and Russian, additionally the corresponding language support is automatically activated. These additional features remain activated, even if word stemming later on is reset to 'none' and need to be deselected manually.
On the other hand, if activated for indexing, the stemming selection must remain activated, because also the query input must be stemmed. As during the index procedure only the etymons are stored in database, this will create results independent whether the query 'walk', 'walks', walked' or 'walking' is entered (for English stemming).
2.4 Periodical Re-indexing
This mode offers automatically Re-indexing of all sites, or site specific, started periodically at a defined time interval.
In Admin backend the time interval is selectable to
3 hours, 12 hours, 1 day, 1 week or 1 month.
Also the count of periodically performed re-indexing procedures could be defined in Admin backend.
Once started, the Re-index procedures will silently work in the background without creating monitor output, but like in all other indexing modes, writing the index results intolog files. Additionally a log file showing the status of the periodical indexer is created, presenting all dates and times when the Re-index procedures were started, as well as the count. This additional log file is available for the Admin in 'Statistics' menu called 'Auto Re-index log file'.
The periodical Re-indexer could be started and aborted in Admin backend, by selecting the 'Periodical Re-index' submenu in 'Sites' view.
Instead for site individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site.
2.5 Preferred Re-indexing
Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while level 5 should be used for non-prioritized sites. In case that new URLs are not manually supplied with any priority, the level will automatically set to '1'.
While invoking the option 'Re-index only preferred sites', the admin may select a suitable level for the next index procedure. Thus, only those URLs, containing the according level, will be re-indexed.
2.6 Multithreaded indexing
The Admin setting:
Define number of threads allowed for index procedures (max. 10)
activates parallel indexing. For multiple site indexing, this option will speed up the procedure significant. If this option is activated, browser output of logging data, as well as real-time output in a second browser window (tab), is suppressed. Never the less all index results will be stored in log files in sub folder .../admin/log/
The names of the log files look like:
db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread)
db2_100524-21.47.56_1.html (log file of first thread)
db2_100524-21.48.12_2.html (log file of second thread)
and is build by the following items:
db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread was started (hours.minutes.second). 1 - ID-number, which will be incremented by each thread. If multithreaded indexing is not activated, the ID will be set to '0'.
db2 - Number of database.
100524 - Date (May 24, 2010)
21.47.56 - Time when this thread was started (hours.minutes.second).
1 - ID-number, which will be incremented by each thread.
If multithreaded indexing is not activated, the ID will be set to '0'.
The individual threads will be activated by means of the Admin dialog. Multi threaded indexation is foreseen for:
- Re-index all - Erase & Re-index all
- Erase & Re-index all
So, if 'Erase & Re-index' is selected, after the 'Erasing' dialog, the threads could be started in sequencing order. It is not necessary to invoke all predefined threads. The dialog (browser) window will always present the last started thread. It is strongly recommended not to close this window or to use the 'Return' button of the browser. If the thread finished indexing, a 'Ready' message will be shown and the 'Back to admin' button is presented. Never the less, a former started thread still might be busy to index another site. To be seen in Admin 'Sites' view by the 'Unfinished' message at the corresponding site. Refreshing the 'Sites' window offers the successful end for all threads; by replacing the 'Unfinished' message with the date of last index.
During multithreaded indexing browser options like caching, pre-fetching and turbo modus should be disabled.
Multithreaded indexing for command line operation is presented below.
2.7 Create thumbnails during index procedure
This Admin selectable option is supplied by a web service with an individual key, which needs to be signed up by the Sphider-plus admin. The web shots will be presented as thumbnails with the size of 130 X 174 pixels as part of the text result listing, individual for each result. It should be taken into account that this option will slow down the index procedure significantly. As delivered by a web service, this option is not available for Intranet and all 'localhost' applications.
In 'Sites' view of the Sphider-plus admin backend you may check ability of this web service. Also required time (seconds) to create one thumbnail is presented. So you may decide, whether you like to use this web service for all pages to be indexed.
More details are described in the readme.pdf documentation.
2.8 Prevent indexing of known malware and pishing pages
This feature is a service, provided by Google that enables applications to check Internet URLs against Google's constantly updated lists of suspected phishing and malware pages. Sphider-plus is using version 1 of this service by verifying each URL to be indexed on line. Thus the admin does not need to update any database and will receive the results of the verification at the most actual level from the Google database. As delivered by a web service, this option is not available for Intranet and all 'localhost' applications.
To be activated in Admin backend, this service additionally requires an individual key. To be signed up at Google.
2.9 Follow Sitemap file
To be activated in Admin settings, Sphider-plus will use the links found in sitemap.xml or sitemap.xml.gz files. This significantly increases the speed for index and re-index, because the links will not have to be searched in text part of each page.
This option will also force Sphider-plus to re-index only links that are:
- New and not jet known in Sphider's link table - Links with a 'last modified' date, which is newer than Sphider's 'last indexed' date in database.
- New and not jet known in Sphider's link table
- Links with a 'last modified' date, which is newer than Sphider's 'last indexed' date in database.
Sitemaps are always expected in the root folder of the site to be indexed and must be named
sitemap.xml or sitemap.xml.gz
If 'Follow sitemap.xml' is activated and a valid sitemap was found, the log output
Links found: 0 - New links: 0
is no longer shown. Because all links are delivered from the sitemap file and new links are not searched during index / re-index.
If is detected in a sitemap.xml file, and if multiple Sitemap files are available, Sphider-plus will process the secondary Sitemaps and extract all links for index / re-index. Also gzip-compressed files (Index Sitemap files as well as the Sitemap files) will be processed, independent of their file suffix.
A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, the Sitemap index file must be UTF-8 encoded.
For individual Sitemaps with different names and/or Sitemaps that are stored in sub folders, Sphider-plus offers the option of defining their URL and name in 'Add site' menu, as well as in 'Edit site' menu. Never the less, links in these individual Sitemaps need to follow the rules as defined at http://www.sitemaps.org/ and are always treated as absolute links and must be from a single host. RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed Sitemaps are currently not supported by Sphider-plus
Also extensions of the Sitemaps protocol like creating an own namespace <! -- namespace extension -- > is not supported by Sphider-plus.
2.10 Use private sitemap instead of global sitemap.
Foreseen to hold a subset of the global sitemap, this optional selection in ‘Settings’ menu of the admin backend may be used to index only often modified pages of a site. Useful to fasten the index procedure by concentrating only on new page content.
The private Sphider-plus sitemaps are always expected in the root folder of the site to be indexed and must be named
sp-sitemap.xml or sp-sitemap.xml.gz
Over all, links in these individual sitemap need to follow the rules as defined at http://www.sitemaps.org/ and are always treated as absolute links and must be from a single host. RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed sitemaps are currently not supported by Sphider-plus. Also extensions of the sitemaps protocol like creating an own name-space are not supported by Sphider-plus. Top
2.11 Create Sitemap file
Create a Sitemap during index/re-index could be activated in Admin settings. This option offers the following features:
- Compatible with http://www.sitemaps.org/schemas/sitemap/0.9 this module automatically creates a sitemap.xml file. - In Admin settings the folder name for the Sitemaps can be defined. - The xml files will be individually named like 'sitemap_www.abc.de.xml' - When running a 'Re-index', 'Re-index all' or 'Erase & Re-index' existing Sitemaps will be overwritten with the actual data set. - Additional option: Use a unique name (sitemap.xml) for all created sitemap files. Could be selected, if only one single Site is to be indexed. To be used in conjunction with selecting the destination folder for the sitemap files.
- Compatible with http://www.sitemaps.org/schemas/sitemap/0.9 this module automatically creates a sitemap.xml file.
- In Admin settings the folder name for the Sitemaps can be defined.
- The xml files will be individually named like 'sitemap_www.abc.de.xml'
- When running a 'Re-index', 'Re-index all' or 'Erase & Re-index' existing Sitemaps will be overwritten with the actual data set.
- Additional option: Use a unique name (sitemap.xml) for all created sitemap files.
Could be selected, if only one single Site is to be indexed.
To be used in conjunction with selecting the destination folder for the sitemap files.
Additionally a list file could be created, sorted alphabetically also offering all file/page suffixes.
3. Using the indexer from command line
(Matter of re-definition and re-coding, currently without guarantee for proper function)
For more information, please visit the Sphider-plus forum and have a look in section 'Tips & Tricks & Mods'
3.1 All options
It is possible to spider web pages from the command line, using the syntax:
php spider.php <options>
where <options> are:
-all Reindex everything in the database. -eall Erase database and afterward re-index all. -new Index all new URLs in database which had not jet been indexed. -erase Erase the content of the database. -erased Index all meanwhile erased sites. -preferred <level> Index with respect to preference level. -preall Set 'Last indexed' date and time to 0000. -u <url> Set the URL to index. -f Set indexing depth to full (unlimited depth). -d <num> Set indexing depth to <num>. -l Allow spider to leave the initial domain. -r Set spider to reindex a site. -m <string> Set the string(s) that an URL must include (use \\n as a delimiter between multiple strings). -n <string> Set the string(s) that an URL must not include (use \\n as a delimiter between multiple strings).
-all Reindex everything in the database.
-eall Erase database and afterward re-index all.
-new Index all new URLs in database which had not jet been indexed.
-erase Erase the content of the database.
-erased Index all meanwhile erased sites.
-preferred <level> Index with respect to preference level.
-preall Set 'Last indexed' date and time to 0000.
-u <url> Set the URL to index.
-f Set indexing depth to full (unlimited depth).
-d <num> Set indexing depth to <num>.
-l Allow spider to leave the initial domain.
-r Set spider to reindex a site.
-m <string> Set the string(s) that an URL must include (use \\n as a delimiter between multiple strings).
-n <string> Set the string(s) that an URL must not include (use \\n as a delimiter between multiple strings).
For example, for spidering and indexing http://www.domain.com/test.html to depth 2, use:
php spider.php -u http://www.domain.com/test.html -d 2
If you want to reindex the same URL, use:
php spider.php -u http://www.domain.com/test.html -r
3.2 Multithreaded indexing
For command line operation parallel indexing has no restrictions for the count of threads. Just limited by the server resources. Parallel indexing is enabled for several different methods as described below.
3.2.1 Index only the new
Index all new URLs in database which had not jet been indexed <-new>
Simply start several threads and add individual IDs to the option parameter like
php spider.php -new1 php spider.php -new2 etc.
php spider.php -new1
php spider.php -new2
The IDs will be added to the name of the corresponding log files like:
db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread)
db2_100524-21.47.56_ID1.html (log file of first thread)
db2_100524-21.48.12_ID2.html (log file of second thread)
IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken into consideration. There is no auto-increment of IDs like for multi indexing that is initialized by the Admin dialog.
If IDs are not added, it is obligatory to delay the start of each thread for 1 second. The names of the log files are created with a resolution of one second. If started too early, several threads will write into one log file. All unsynchronized, the resulting log file will be unreadable.
3.2.2 Re-index all
To be invoked by once preparing the database with the command
php spider.php <-preall>
This will reset all 'Last indexed' tables to '0000', but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure.
Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like:
php spider.php -erased1 php spider.php -erased2 etc.
php spider.php -erased1
php spider.php -erased2
The IDs will be added to the names of the log files as described above
3.2.3 Index erased sites
Index all meanwhile erased sites <-erased> will index only those sites that had been individually or bulk erased. Multithreaded indexing could be invoked by starting several threads and adding individual IDs to the option parameter like:
4. Keeping pages, words and files from being indexed
4.1 robots.txt
The most common way to prevent pages from being indexed is using the robots.txt standard, by either putting a robots.txt file into the root directory of the server, or adding the necessary Meta tags into the page headers.
This directive could be temporary overwritten site specific for the next index procedure by the advanced option:
Temporary ignore 'robots.txt'
4.2 Must include / must not include string list
A powerful option Sphider-plus supports is defining a 'Must include / Must not include' string list for a site (to be found in Sites / Options / Edit). Any URL containing a string in the 'URL must Not include' list is ignored. Any URL that does not contain any string in the 'URL Must include' list is likewise ignored.
In any case all strings are taken case sensitive.
All strings in the string list should be separated by a new line (Enter). For example, to prevent a forum in your site from being indexed, you might add /forum to the 'URL must Not include' list. This means that all URLs containing the string /forum will be ignored and won't be indexed.
Using Perl style regular expressions instead of literal strings is also supported. But only a string starting with a '*' in front is considered to be a regular expression, so that
*/[a]+/
denotes a string with one or more letter a in it.
In case that all sites should contain a 'URL Must include' or 'URL must Not include' rule, the strings could be placed into the files:
.../include/common/must_include.txt .../include/common/must_not_include.txt
.../include/common/must_include.txt
.../include/common/must_not_include.txt
As per default Sphider-plus download, the two files 'must include.txt' and 'must_not_include.txt' are empty.
While calling 'Settings' in admin backend and enabling in section 'General Settings' the option:
Store global values of string lists 'Must include' and 'Must Not include' for all URLs
And afterward pressing any 'Save' button, the content of these files are transferred into the corresponding option fields of all site.
Used only one time, so that later on individual values could be added to each URL in 'Sites' view. See 'Options' => 'Edit' for each URL individually.
4.3 Ignoring links
Sphider-plus respect rel="nofollow" attribute in <a href..> tags, so for example the link foo.html in
<a href="foo.html" rel="nofollow">
is ignored. Also if the nofollow flag is set in the header of a site, this link will not been followed.
Temporary ignore 'nofollow' directive
4.4 Ignoring parts of a page
Sphider-plus includes an option to exclude parts of pages from being indexed. This can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu).
Any part of a page between
<!--sphider_noindex--> and <!--/sphider_noindex-->
tags is not indexed, however links in it are followed.
4.5 Ignoring parts of a page by <ug id='abc'>
Ignoring parts of a page by the<!--sphider_noindex--> tags requires direct access to the page, because the tags need to be added (edited) to the page.
A more flexible method, which does not require direct access, is enabled by the Admin setting:
'Use list of ul classes to ignore the complete ul content during index/re-index'
If enabled in Admin settings, the values as defined in the list-file …/include/common/uls_not.txt
will be used to delete the content between <ul class='abc'< and </ul>
The same for<ul id='abc'> and </ul>
Also, the global attribute ‘inert’ will be used to ignore the div content.
Values in this common list may end with a wildcard, so that 'menu*' will work for classes like
menu1, menu2, menu_left, etc.
Multiple ul tags will be attended.
For even more flexibility, the file …/include/common/uls_not.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.
Example: */menu[0-5]/
As of Sphider-plus v.4.2024d also class values together with id values will be obeyed. For example
<ul class="submenu2" id="dropdown-africa"> text to be deleted </ul>
For this example the list-file …/include/common/uls_not.txt needs to contain in one row separated by a blank character:
submenu2 dropdown-africa
4.6 Ignoring parts of a page by <div id='abc'>
Ignoring parts of a page by the <!--sphider_noindex--> tags requires direct access to the page, because the tags need to be added (edited) to the page.
'Use list of div ids or classes to ignore the complete div content during index/re-index'
If enabled in Admin settings, the values as defined in the list-file .../include/common/divs_not.txt will be used to delete the content between
<div id='abc'> and </div>
Multiple and nested divs will be attended. Alternately, this option could also be used for <div class='abc'>.
Also, the global attribute 'inert' will be used to ignore the div content.
As of Sphider-plus v.4.2024d also multiple class or id values will be obeyed. For example
<div class="composer rachmaninov"> text to be ignored </div>
For this example the list-file …/include/common/divs_not.txt needs to contain in one row separated by a blank character:
composer rachmaninov
4.7 Indexing only parts of a page by <div id='abc'>
If enabled in Admin settings, the values as defined in the list-file .../include/common/divs_use.txt will be used to index only the content between
Never the less links outside of the div tags will be followed. Values in this common list may end with a wildcard,
so that 'menu*' will work for ids like
Multiple and nested divs will be attended.
For even more flexability, the file …/include/common/divs_use.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.
Example: */table[0-5]/
4.8 Ignore HTML elements defined by <tagname> . . . . </tagname>
This option is foreseen to cooperate with the new HTML5 elements like
section, nav, aside, hgroup, article, header, footer, etc.
HTML elements are written with a start tag, with an end tag, with the content in between:
<tagname> this content </tagname>
For more details, please notice the HTML element and tag references at
http://www.w3schools.com/html/html_elements.asp http://www.w3schools.com/tags/
http://www.w3schools.com/html/html_elements.asp
http://www.w3schools.com/tags/
If enabled in Admin settings, the values as defined in the list-file …/include/common/elements_not.txt will be used to delete the content between <tagname> and </tagname> .
Never the less, links inside of the tags will be followed. Values in this common list are automatically added with a wildcard, so that 'aside' will work for HTML elements like
aside1, aside2, aside_left, etc.
Also for elements like
<nav class="menu"> <ul> <li> <a href="#"> Start </a></li> <li><a href="#"> About us </a></li> <li> <a href="#"> Contact </a></li> </ul> </nav>
<nav class="menu">
<ul> <li> <a href="#"> Start </a></li> <li><a href="#"> About us </a></li> <li> <a href="#"> Contact </a></li> </ul>
<ul>
<li> <a href="#"> Start </a></li> <li><a href="#"> About us </a></li> <li> <a href="#"> Contact </a></li>
<li> <a href="#"> Start </a></li>
<li><a href="#"> About us </a></li>
<li> <a href="#"> Contact </a></li>
</ul>
</nav>
only the name of the element (which is nav) needs to be added into the list-file.
For even more flexibility, the file …/include/common/elements_not.txt may alternately contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash.
Example: */nav[0-5]/
Please keep in mind that element names placed in …/include/common/elements_not.txt will be processed case-sensitive.
4.9 Index only HTML elements defined by <tagname> . . . . </tagname>
This is the vice versa function of what is described in the above chapter.
If enabled in Admin settings, the values as defined in the list-file …/include/common/elements_use.txt will be used to index only the content between <tagname> and </tagname> .
Please keep in mind that element names placed in …/include/common/elements_use.txt will be processed case-sensitive.
4.10 Ignored words
Beginning with version 1.7, Sphider-plus offers the capability to prepare language specific common files. Common words that are not to be indexed can be placed into individual files. The names of this files must start with 'common_' and end with the suffix '.txt', like "common_eng.txt ". The files must be placed into the folder:
.../include/common/
The common word files should not be used, if 'phrase search' is the standard type of search. Sphider-plus will become problems to find complete phrases. Therefore, in Admin / Settings/ Spider settings, the use of common word files may be activated / deactivated by the checkbox:
Use commonlist for words to be ignored during index / re-index?
Take notice, that the 'Ignored words' function for many languages is not case sensitive. So, you only need to include one spelling into the common_xyz.txt file.
Instead the common word list is case sensitive for the following languages:
- Arabic - Chinese - Cyrillic
- Arabic
- Chinese
- Cyrillic
4.11 Use of Whitelist
Sphider-plus offers the capability to control the index / re-index procedure by a list of words called 'whitelist'. Only if the text of the page contains words of the whitelist, the according page will be indexed / re-indexed. The list is placed in the file .../include/common/whitelist.txt Text-content is defined by Admin settings by means of what to index: full text, title, keywords etc. Content of links(URLs) is controlled separately by "Must include / must not include string list"
The use of the whitelist may be activated / deactivated by two different checkboxes in Admin / Settings/ Spider settings:
- Use whitelist in order to index / re-index only those pages that include ANY of the words in whitelist - Use whitelist in order to index / re-index only those pages that include ALL the words in whitelist
- Use whitelist in order to index / re-index only those pages that include ANY of the words in whitelist
- Use whitelist in order to index / re-index only those pages that include ALL the words in whitelist
Take notice, that these functions are not case sensitive. So, you only need to include one spelling into the whitelist.txt file.
Content of whitelist is treated as 'words'. So the word 'kinder' in your whitelist will not accept pages that contain the word 'kindergarten'.
Be aware not to place blank rows into the whitelist. Also the list should end with the last word; not with a line feed or a blank row.
- Each word in list must be in a separate row. - One word per row. - No blank rows. - No blank row at the end of the file.
- Each word in list must be in a separate row.
- One word per row.
- No blank rows.
- No blank row at the end of the file.
4.12 Use of Blacklist
Sphider-plus offers the capability to control the index / re-index procedure by a list of words called 'blacklist'. If the content of the page contains one word of the blacklist, it will not be indexed / re-indexed. The list is placed in the file .../include/common/blacklist.txt
In Admin / Settings/ Spider settings, the use of the blacklist may be activated / deactivated by the checkbox:
Use blacklist to prevent index / re-index of pages that contain any of the words in blacklist?
A second setting in the same settings section enables the rejection of queries that contain a word of the blacklist. Even if the evil word is only part of the query.
If the checkbox:
Use blacklist to delete queries that contain any of the words in blacklist?
is activated, the complete query is deleted and a blank search is performed.
Please keep in mind that 'Use of Blacklist is implemented in a different way than implementation of 'Use of whitelist'. Blacklist is interpreting its content as a string. So, the word 'kinder' in blacklist, will also prevent indexing of a page containing the word 'kindergarten'.
Be aware not to place blank rows into the blacklist. Also the list should end with the last word; not with a line feed or a blank row.
4.13 Ignored files
The list of file types that are not checked for indexing are places in .../include/common/ext.txt. This file holds all file suffixes for those type of files that are to be ignored during index / re-index procedure.
The 'ext.txt' file is independent from the media files to be indexed. All file types not to be followed for text indexing must be placed in 'ext.txt'. To be seen as a blacklist for file suffixes.
While
image.txt audio.txt video.txt
are whitelists that include suffixed for files to be indexed, according to the type of media.
4.14 Canonical <link> tag
As defined by Google, Microsoft and Yahoo! in February 2009, also Sphider-plus will follow the instruction of a rel="canonical" link. You may simply add this <link> tag to specify your preferred page version:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
inside the <head> section of all the duplicate content URLs:
http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678
http://www.example.com/product.php?item=swedish-fish&category=gummy-candy
http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678
and Sphider-plus will understand that the duplicates all refer to the canonical URL:
http://www.example.com/product.php?item=swedish-fish.
The duplicate pages will be ignored and not indexed. Sphider-plus takes the rel="canonical" as a directive, not a hint. The canonical link may also be a relative path, but is not allowed to refer to a different domain. Unfortunately the creation of canonical link tags needs to be done manually. So special care has to be taken that other directives like robots.txt or rel="nofollow" will not prevent the crawling of the canonical origin.
5. UTF-8 Support and 'Preferred charset'
Starting with version 1.2, Sphider-plus provides Unicode assistance and starting with version 2.1 the conversion is obligatory. In consequence, the impact will be powerful.
First of all: the complete full text, and all header information like title, keywords and description tags need to be converted into Unicode. Consequence is an increase of time required for indexing.
As also suggested by Yiannes [pikos], three steps are integrated to realize this procedure:
1. Detect the charset of site, page or file. This information is normally presented as part of the HTML header. If not available, or for files without header like .doc, .rtf, .pdf, .xls and .ptt files, the 'Preferred charset' (as defined in Admin settings) will be used to convert the file into Unicode. In other words: it is not possible to convert DOCs, PDFs etc. that are coded in 'foreign' charset. Only those with your personal charset will be converted correctly. Also it is not possible to convert a Chinese and a Cyrillic coded PDF document at the same time. It is necessary to adapt the 'Preferred charset' before invoking the index procedure for the sites and their links to these documents. 2. By means of the PHP function 'iconv()' all texts will be converted into UTF-8. This step is successful, if the required charset (for the content to be converted) is part of your local PHP installation. In order to find out which charset are available in your installation, please notice the files in server folder: .../apache/bin/iconv/ Depending of the installation you will find about 200 charset files that iconv() is able to use for converting. 3. If the PHP function fails, finally the class 'ConvertCharset' is invoked. This class, originally designed by Mikolaj Jedrzejak, enables converting for a lot of charset. But it takes more time than the compiled PHP function 'iconv()'.
1. Detect the charset of site, page or file.
This information is normally presented as part of the HTML header.
If not available, or for files without header like .doc, .rtf, .pdf, .xls and .ptt files,
the 'Preferred charset' (as defined in Admin settings) will be used to convert the file into Unicode.
In other words: it is not possible to convert DOCs, PDFs etc. that are coded in 'foreign' charset.
Only those with your personal charset will be converted correctly. Also it is not possible
to convert a Chinese and a Cyrillic coded PDF document at the same time. It is necessary to
adapt the 'Preferred charset' before invoking the index procedure for the sites and their
links to these documents.
2. By means of the PHP function 'iconv()' all texts will be converted into UTF-8.
This step is successful, if the required charset (for the content to be converted)
is part of your local PHP installation. In order to find out which charset are available
in your installation, please notice the files in server folder:
.../apache/bin/iconv/
Depending of the installation you will find about 200 charset files that iconv()
is able to use for converting.
3. If the PHP function fails, finally the class 'ConvertCharset' is invoked. This class,
originally designed by Mikolaj Jedrzejak, enables converting for a lot of charset.
But it takes more time than the compiled PHP function 'iconv()'.
As result of the charset conversion, the user is enabled to search also for words that contain non-Latin characters.
In order to enable converting of all charsets into UTF-8, upper and lower case characters are required. So (normally) the query 'html' will not deliver results for sites and files that contain the string 'HTML'. Both are different keywords and stored separately in the Sphider-plus database.
Starting with version 1.6 Sphider-plus offers the additional option:
'Enable distinct results for upper- and lowercase queries'
If enabled in Admin settings, everything remains as descibed above. But if this checkbox is unchecked, result listing will deliver all results; independent of the query input. HTML, html or even hTmL queries will deliver the same (all) results.
The checkbox for this option is placed with full intention in section 'Spider settings', as activating and also deactivating always requires an 'Erase & Re-index' procedure.
The following 63 charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode:
WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one,
used in IE "Cyrillic (DOS)" ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman
ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat
And specially for old Polish programs: mazovia
This list is to be read only as completion to the list of charsets as to be found in the subfolder /iconv/ of your server.
6. Search modes
Beside the original Sphider search queries like:
- Search for a single word - AND and OR search - Search for a phrase
- Search for a single word
- AND and OR search
- Search for a phrase
Sphider-plus offers 7 additional modes to enter queries:
- Search with wildcard - Strict search - Tolerant search - Link search - Media search - Search only in one domain - Search in suggestrd categories
- Search with wildcard
- Strict search
- Tolerant search
- Link search
- Media search
- Search only in one domain
- Search in suggestrd categories
Wildcard, strict and tolerant search modes are available only for single query word input.
Search with wildcards *
This mod enhances the Sphider-plus capabilities to search also for parts of a word. The mod is invoked by entering a * as wildcard for the unknown part of the search query.
Implemented for single-word queries, wildcards could be used like:
*searchme *searchall* *search*more*
*searchme
*searchall*
*search*more*
Depending on the Sphider-plus database, a lot more results may appear using this search mode. In order to limit the amount in the result listing, there is an option in the admin backend named
Define maximum count of result hits for queries with wildcards
To be found in ‘Settings’ => ‘Search Settings’.
If you like to know the multiple words found in the database to be highlighted:
In your editor open the script:
.../include/searchfuncs.php
Find the row containing the text:
Multiple words found in the database to be highlighted: '$hi'
Uncomment this row. Now, if there is more than one result word, on top of the result listing, all the multiple words found in the database will be presented.
Strict search !
This variant is invoked by entering a ! as first character of the search query. If you search for '!plus' only results for the word 'plus' will be presented in the result pages. No results for words like 'spider-plus' or 'spiderplustec' will be shown. This is the reverse function of 'Search for part of a word by means of * wildcards'. Strict search only indicates results in the text part of the indexed pages and will respond only on a single word query. Strict search will overwrite the 'Phrase Search' option, which is nullified.
Tolerant search
This mod enables a tolerant search for Sphider. Selectable in Search-form like AND, OR and Phrase Search a new item "Tolerant Search" is added.
If this item is selected, query input "perdida" will also deliver results for all sites that contain the word "pérdida". Inverse function is also implemented: "pérdida" input will deliver all results for "perdida".
If enabled, this mod equalizes search input for e=é=è=ê and all the other vowels like: ä=a=à=â , ü=u , o=ö etc. The upper-case letters like Ä=A are also taken into account. Tolerant search overwrites the 'Distinct results for upper- and lowercase queries' setting and will mark all results.
Natively developed to deliver most possible results for queries with entities and accents and also to simplify user input, this mod also delivers results that are "like" the query input. So something as the "Did you mean" facility is already integral part of this search method.
Link search
Invoked by starting the query input with ' site: ', the user is enabled to search for all pages of a domain. It is not necessary to enter the full domain address. For example if you enter 'site:sphider-plus.eu' you will get a list of all pages that belong to the domain http://www.sphider-plus.eu
If the search query is part of more than one domain address in Sphiders site table, a list of these domains will be presented as intermediate result. If you then click on the desired domain of this list, all links (pages) of this domain will be presented as final result listing.
Media search
Media search is invoked by an additional checkbox in the Search Form. Media will be found individually for:
- Images - Audio - Video
- Images
- Audio
- Video
Entering 'media:' (without quotes) will present all media stored in the database. For more details please notice the chapter Media Search
Search only in one domain
This mode is invoked by entering:
site:www.abcd.de query
and will present only results delivered by the (example) domain http://www.abc.de. Search input does not need a blank between site: and the URL of the domain, but between the URL and the query. In contrast to the 'Link search', this mode requires the full URL to be entered into the search form (inclusive http://).
Beside www domains, this mode is usable also for localhost applications. The Admin settings 'Address to localhost document root' is used to rebuild the basic address of the local domains.
Search in categories
After the Admin assigned the indexed sites to different categories, the user of the search engine may rarefy the result listing to manually selectable categories or follow the suggestions of Sphider-plus, offerimg alternate categories and subcategories that would also deliver results.
For more details please notice the chapter Search in categories
Greek language support
In order to support international languages, Sphider-plus stores all data UTF-8 coded into the database. But some languages require some special attention. For example the Greek language is supported as following.
In Admin backend there are some special settings, controlling the Greek language support:
- Support Greek language. Offers correct support for upper and lower case Greek letters. - Convert all kind of Greek accents (upper and lower case) into their basic vowels Will present e.g. the same query results, when searching with the letter α as well as using: ἀ, ἁ, ἂ, ἃ, ἄ, ἅ, ἆ, ἇ, ὰ, ά, ά, ᾀ, ᾁ, ᾂ, ᾃ, ᾄ, ᾅ, ᾆ, ᾇ, ᾰ, ᾱ, ᾲ, ᾳ, ᾴ, ᾶ or ᾷ - Transliterate queries with Latin characters into their Greek equivalents (Supports the old and new Greek alphabet)
- Support Greek language. Offers correct support for upper and lower case Greek letters.
- Convert all kind of Greek accents (upper and lower case) into their basic vowels
Will present e.g. the same query results, when searching with the letter α as well as using: ἀ, ἁ, ἂ, ἃ, ἄ, ἅ, ἆ, ἇ, ὰ, ά, ά, ᾀ, ᾁ, ᾂ, ᾃ, ᾄ, ᾅ, ᾆ, ᾇ, ᾰ, ᾱ, ᾲ, ᾳ, ᾴ, ᾶ or ᾷ
Will present e.g. the same query results, when searching with the letter α as well as using:
ἀ, ἁ, ἂ, ἃ, ἄ, ἅ, ἆ, ἇ, ὰ, ά, ά, ᾀ, ᾁ, ᾂ, ᾃ, ᾄ, ᾅ, ᾆ, ᾇ, ᾰ, ᾱ, ᾲ, ᾳ, ᾴ, ᾶ or ᾷ
- Transliterate queries with Latin characters into their Greek equivalents (Supports the old and new Greek alphabet)
By activating the above 3 options, the following behaviour is available:
Additional remarks:
Searching for "αλλα" will not present any results for words written in Latin characters.
Instead searching for "alla" will present results for:
alla, ἀλλα, ἀλλὰ, Αλλα, Άλλα, etc.
Depending on which result was found first in full text, the according text extract would show the first hit. In order to see all available results in result listing (Latin + transliterated Greek), it is suggested to increase the value in Admin setting:
Define maximum count of result hits per page, displayed in search results.
If a "strict" search is invoked (!ἀλλὰ), the conversion of Greek accents into their basic vowels will not be obeyed. Also the "strict" search (!alla) will overwrite the transliteration option, so that αλλα, ἀλλὰ, Αλλα, Άλλα, etc. will not be found for the query "!alla".
If the option " Transliterate queries with Latin characters into their Greek equivalents" is activated, only Greek suggestions will be offered. It is assumed that this option is activated, because Greek results are preferred.
Block queries
Three different methods of blocking queries are offered:
- Block all queries sent by harvester, bots and known evil user-agents (about 1.000 UAs) - Block all queries sent by Meta search engines like Google, MSN, Amazon, etc. (about 4.000 IPs) - Block all queries sent by known spammer (about 190.000 IPs)
- Block all queries sent by harvester, bots and known evil user-agents (about 1.000 UAs)
- Block all queries sent by Meta search engines like Google, MSN, Amazon, etc. (about 4.000 IPs)
- Block all queries sent by known spammer (about 190.000 IPs)
Each option needs to be activated separately in admin backend.
The lists holding the bad bots, as well as the Meta search engines are editable .txt files.
In contrast, the spammer IPs that persist in abusing forums and blogs with their scams, ripoffs, exploits and other annoyances, are automatically updated every 24 hours by means of a web service, and are added to the Meta search engine IPs.
7. Chronological order for result listing
7.1 Text result listing
Sphider-plus offers 11 methods how to sort the text results:
- By relevance (weight %) - By count of hits in full text - By last indexed (date and time) - Main URLs (domains) on top - By URL names - By file suffix - Only top 2 per URL (like Google) - Most Popular Links on top - Promoted domains on top - Pages holding a catchword on top - Single result per page (ignoring arguments in URL)
- By relevance (weight %)
- By count of hits in full text
- By last indexed (date and time)
- Main URLs (domains) on top
- By URL names
- By file suffix
- Only top 2 per URL (like Google)
- Most Popular Links on top
- Promoted domains on top
- Pages holding a catchword on top
- Single result per page (ignoring arguments in URL)
The current selection of 'order for result listing' is visible for the users as additional headline on all result pages. If not desired, this option can be de-selected in Admin setting:
'Show mode of chronological order for result listing as headline'.
In order not to confuse the user, for the 3 methods
- By URL names and then weight - Only top 2 per URL (like Google) - Most Popular Links on top
- By URL names and then weight
the output of relevance (weight %) is suppressed in result listing.
For the method 'By URL names' an additional setting is available called:
'Define number of results shown per domain in result listing'
Using this option, the result listing will present an output similar to 'Like Google', but the count of links added to the main domain is selectable. Instead 'Like Google' will always present one additional link beyond the main domain.
For the 'Most Popular Links on top' method, Sphider-plus uses the before learned link acceptance. If a user leaves the result listing by clicking on any of the offered links, Sphider-plus will memorize this decision. The user is temporary redirected to the script .../include/click_counter.php, which stores the users link decision, last query, time and date before leading the user to the real destination.
This link specific 'best click' counter is used as teach-in to define the chronological order of result listing. In order to prevent promoted clicks on a specific link, there is a delay timer before the next user click will be accepted. To be set in Admin /Settings/ Index Log Settings, the setting defines the idle-time in seconds.
If there are more results as rated links, the rest of the result listing will be presented by relevance (weight), using the weighting of the last index / re-index.
As the 'Most Popular Links On Top' item overwrites all other order of result listing.
For the method of result ordering ' By relevance (weight %)' the weight is calculated. Situation changes, if in Admin settings the item 'Instead of weighting %, show count of query hits in full text' is activated. Now only the hits in full text are used to calculate the order of result listing. Keyword hits in URL name, path, title tag etc. are not taken into consideration.
The weighting of:
- Word in web page title tag - Word in the domain name - Word in the path name - Word in web page keywords tag
- Word in web page title tag
- Word in the domain name
- Word in the path name
- Word in web page keywords tag
may be influenced individually for personal preferences. The according settings could be performed in section 'Page indexing weights' of the Admin settings.
For the method 'Single result per page', result listing will ignore arguments in URL. This will present only one result for URLs like:
. . . /pizza-restaurants.html . . . /pizza-restaurants.html?page=1 . . . /pizza-restaurants.html?page=2
. . . /pizza-restaurants.html
. . . /pizza-restaurants.html?page=1
. . . /pizza-restaurants.html?page=2
Sorting search results by means of 'file suffix' allows to sort all results with respect to the page/file suffix.
So, for example, it could be selected, that all results from .html pages are shown first, all results from .php pages second, and finally all results from pdf documents at the end of the result listing.
Controlled by the file
.../include/common/file_suffix.txt
which contains a list of suffixes to be expected in result listing. The order of all suffixes in this list determines the importance in result listing. The list is free editable.
Might be helpful to select in admin backend => Settings => Spider Settings to select the option:
'Additionally create a list file, sorted alphabetically by file suffixes'
before indexing your site, as this list contains all suffixes found during index procedure. To be observed by
admin backend => Statistics => List of sitemaps
As stated, the file
is free editable and could be created by the suffixes presented in admin backend.
Additionally it is possible to create two different methods for 'promoted sorting' of the result listing:
1. As part of the Admin settings, a domain name or part of the name could be entered. All search results belonging to this domain will be placed on top of result listing. 2. All pages containing a catchword will be displayed on top of the search result listing. As part of the Admin settings, the catchword could be entered.
1. As part of the Admin settings, a domain name or part of the name could be entered. All search results belonging to this domain will be placed on top of result listing.
2. All pages containing a catchword will be displayed on top of the search result listing. As part of the Admin settings, the catchword could be entered.
Both methods of promoted sorting can be combined. If domain name and also the catchword are entered in Admin settings, both conditions must be fulfilled to become a promoted link in result listing.
7.2 Media result listing
Independent from sorting the text results, 5 different modes of sorting the media results are Admin definable:
- By title (alphabetic) - By file suffix - By image size - By 'Last queried' - By 'Most popular'
- By title (alphabetic)
- By image size
- By 'Last queried'
- By 'Most popular'
8. PDF converter
Starting with version 4 a new PDF converter is implemented in Sphider-plus. Realized as pure PHP script, the new script does no longer require the definition to individual path.
The new converter indexes text and images in not encoded PDFs.
9. Clean resources during index / re-index.
In order to prevent performance problems and memory overflow for large amount of URLs, Sphider-plus may clean unused resources during index / re-index. Selectable in Admin settings, this item periodical will:
- Free memory that is allocated to unused MySQL recourses. - Unset PHP variables, which are no longer required.
- Free memory that is allocated to unused MySQL recourses.
- Unset PHP variables, which are no longer required.
As this clearing work is done several times during index / re-index of every URL, additional capacity is required. Consequently overall indexing time will increase. So this item should be selected only for huge amount of URLs. Depending on
- Memory size allocated to PHP - Total number of URLs - Number of internal and external links - Size of text to be indexed for each page - CPU clock rate - System RAM
- Memory size allocated to PHP
- Total number of URLs
- Number of internal and external links
- Size of text to be indexed for each page
- CPU clock rate
- System RAM
there will be an individual limit when to enable this feature. Following the discussion on the Sphider forum this feature should be activated only if > 100 sites are to be indexed, or when Sphider-plus dies a silent death during index prodedure, not indexing any more sites.
Please take notice of the FAQ chapter:
Error message: "Unable to flush table 'addurl'
and
Error message: " Access denied; you need the RELOAD privilege. . .
10. Enable real-time output of logging data
Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because:
- Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - Server modules for Apache do buffering of their own that will cause flush() to not result in data being sent immediately to the client. - Browser may buffer its input before displaying it. Netscape, for example, buffers text until it receives an end-of-line or the beginning of a tag, and it won't render tables until the </table> tag of the outermost table is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output.
- Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser.
- Server modules for Apache do buffering of their own that will cause flush() to not result in data being sent immediately to the client.
- Browser may buffer its input before displaying it. Netscape, for example, buffers text until it receives an end-of-line or the beginning of a tag, and it won't render tables until the </table> tag of the outermost table is seen.
- Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output.
As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was the approach to realize this feature.
Pressing one of the 'Start index / re-index' buttons, three additional scripts are involved.
( onclick=\"window.open('real_log.php')\" )
.../admin/real_log.php By opening a new browser window / tab, this script takes over to display latest logging data. Requesting fresh data from the JavaScript file 'real_ping.js' , all new logging data will always been placed into <div id='realLogContainer' /> So, better not to press the 'Reload' button of your browser. The current <div /> might be already empty. .../admin/real_ping.js Script that transfers requests from HTML client to PHP server script and vice versa. Handling refresh for real-time logging during index and re-index procedure by means of asynchronous requests (AJAX) to the server. .../admin/real_get.php This script delivers 'refresh rate' and latest 'logging data', requested from the JavaScript file 'real_ping.js'. Also performs the reset of the 'real_log' table in Sphiders database.
.../admin/real_log.php
By opening a new browser window / tab, this script takes over to display latest logging data. Requesting fresh data from the JavaScript file 'real_ping.js' , all new logging data will always been placed into <div id='realLogContainer' /> So, better not to press the 'Reload' button of your browser. The current <div /> might be already empty.
.../admin/real_ping.js
Script that transfers requests from HTML client to PHP server script and vice versa. Handling refresh for real-time logging during index and re-index procedure by means of asynchronous requests (AJAX) to the server.
.../admin/real_get.php
This script delivers 'refresh rate' and latest 'logging data', requested from the JavaScript file 'real_ping.js'. Also performs the reset of the 'real_log' table in Sphiders database.
Latest logging data is delivered by the .../admin/messages.php script that, besides writing into the normal log file, feeds the table 'real_log' in Sphiders database. This is the buffer for latest logging data.
Prerequisites are the enabled 'Log spidering dates' and 'Log file format = HTML'. When activating the real-time output, both pre-conditions are automatically selected.
11. Error messages and Debug mode
Starting with version 1.7, Sphider-plus offers the capability to enable / disable the output of MySQL error messages as well as PHP error messages, warnings and notices. To be activated in Admin / Settings / Admin Settings, this capability should only be used for debug purpose. It is recommended to disable the output of these messages for production systems, as they could reveal sensitive information.
Debug modes are individual available for the Admin backend, as well as for the 'Search User'.
Selection of the 'Debug mode' is implemented in Admin settings. If the Debug mod is enabled, for all pages that are indexed the found links and keywords are presented in Log file output and also in Log file real time output. It has to take into consideration, that only the new links and keywords found on the respective page will be presented. Links and keywords already stored in Sphider-plus database (because they were already detected on a former page) will not be presented again.
The 'Debug mode' adds a comma and a blank to each keyword. So, debug output will be something like:
New keywords found here: abc, defg, hijklm, nop, . . .
New keywords found here:
abc, defg, hijklm, nop, . . .
As Sphider-plus also indexes special characters like commas and dots, keywords like defg, and hijklm. will be presented like:
New keywords found here: abc, defg,, hijklm., nop, . . .
abc, defg,, hijklm., nop, . . .
The 'Debug mode' only modifies the log files. Sphider-plus database remains unaffected and will hold the same values as indexing without Debug mode. In other words, activating / deactivating this mode has no effect on the later search results.
For activated 'Debug mode', also the output of MySQL and PHP error messages is activated. Debug mode overwrites the according setting. When deselecting a debug session, also the output of error messages must be disabled manually.
In order to check the availability of all required libraries and extensions Sphider-plus is using, the Debug mode will present the corresponding messages on top of the 'Settings' menu.
If 'Debug mode' is enabled for the 'Search User', the cache activity is presented above the result listing in the form of status messages, as well as automatically performed mode settings for 'Strict search' and 'Search with wildcards'.
12. Delete secondary characters
This feature was implemented in order to kill unimportant (secondary) characters at the end of words and also as trailing characters of words. If activated in Admin / Settings / Spider Settings, the query input sphider will deliver results for all keywords like:
sphider sphider: "sphider" (sphider). sphider. sphider? sphider!
sphider
sphider:
"sphider"
(sphider).
sphider.
sphider?
sphider!
The following characters in front of words are deleted:
" (
Also, if at the end of words, these characters are deleted:
) " ). , . : ? !
If placed at the end of words that contains only digits, the dots are not deleted ( 27. ). So the search for
27. November 2008
remains available.
For personal requirements the following two rows in .../admin/spiderfuncs.php may be edited.
$file = preg_replace('/, |[^0-9]\. |! |\? |" |: |\) |\), |\)./', " ", $file); // kill characters at the end $file = preg_replace('/ "| \(/', " ", $file); // kill special characters in front of words
$file = preg_replace('/, |[^0-9]\. |! |\? |" |: |\) |\), |\)./', " ", $file); // kill characters at the end
$file = preg_replace('/ "| \(/', " ", $file); // kill special characters in front of words
Warning: This option should be used with special care and not be activated for non ISO-8859 charsets.Some special characters as part of the word ending might be erased by accidental.
13. Media search for images, audio streams and videos
13.1 Media indexing
Index of media files is enabled by separated Admin settings for:
- Images - Audio streams - Videos
- Audio streams
- Videos
Three separate files in subfolder .../include/common/ that are named
image.txt audio.txt video.tx
image.txt
audio.txt
video.tx
hold a list of associated file suffixes. Only media files with the corresponding suffix will be taken into account during index / re-index procedure. These three files may be edited for personal purpose.
In order to be indexed, for images additionally the minimum width and height (H x V pixel) may be defined in Admin settings. Image size will be observed for the following image types:
.bmp .gif .j2c .j2k .jp2 .jpc .jpeg .jpeg2000 .jpg .jpx .png .swc .tif .tiff .wbmp
Admin settings also allow selecting whether embed and nested media files should be indexed. This was implemented, as some server hide their media files as embedded objects.
Another Admin setting is used to enable indexing of external hosted media content. When linked by the currently indexed page, also external media files will be indexed. This setting is independent from the Sites / Advanced Option setting 'Sphider can leave domain', which is used for text links only.
Depending of the installed GD-library, during index / re-index procedure Sphider-plus will create thumbnails for the following image types:
gif, png, jpg, ipeg, jif, jpe, gd, gd2 and wbmp
Details about the currently installed GD-library (as part of the PHP environment) and the supported image formats are available at:
Admin / Statistics / Server Info / Image funcs.
Thumbnails will be created as 'gif' or 'png' files. To be selected in Admin settings, the gif files do have a lower quality, but will reduce the required memory for about 50%. Re-sampling the original images, size of thumbnail is defined to a maximum of 160 x 100 pixel. In result listing these stored thumbnails are used as preview.
As far as available the Meta data ID3 and for images EXIF information is indexed and herewith become searchable. In admin backend, indexing and searching EXIF and ID3 info is separately selectable.
In order to create thumbnails and to index ID3 and EXIF information, it is necessary to download the media file. For pages with multiple media content, the time for index /re-index procedure may increase dramatically.
As ID3 information is not available for all audio and video files, the minimum play time in order to be indexed was not yet implemented.
In order to save memory resources, Sphider-plus does not store the media content. Only the links, thumbnails and Meta information are stored.
The limit in Admin settings " Max. links to be followed for each Site" is not taken into account for media links. Only page links are counted and the limitation is valid only for page links.
13.2 Not supported media content
The following examples demonstrate the currently existing limitations for media data that will not be indexed:
- If inserted in documents like pdf, doc, ppt, etc. - If inserted in Java or applets and also direct applet implementations. - Image maps that are server-side or client-side included.
- If inserted in documents like pdf, doc, ppt, etc.
- If inserted in Java or applets and also direct applet implementations.
- Image maps that are server-side or client-side included.
13.3 Search for media content
The search mode is enabled by the checkbox
'Beside text results also show media results in result page'
in Admin / Settings / Search Settings
Once activated, the result listing for each keyword match will be separated into the 4 sections:
- text results - image results - audio resuls - video results
- text results
- image results
- audio resuls
- video results
Each section is marked with an according thumbnail. Result listing will present only those sections that contain results.
Each section will present result number, media title and the page address (link) at which the media was found. The text section will show the results as previous with highlighted keywords and surrounding text.
The image result section additionally presents a thumbnail, the image size (H x V pixel) and a link to EXIF information for each found image. Clicking on the thumbnails will open the original image in a new window / tab.
Video and audio results are presented with title, play time and a link to ID3 information. Media content will be opened with the belonging software by clicking on the media title.
As the media sections are presented separatley for each keyword match, an additionl link called 'All media' is shown. Clicking here will force Sphider-plus to present all media results of the corresponding page (link). In order to return to the standard search modus, the section thumbnails could be clicked.
The search function at first will look for text results (keyword match) and receive the according pages (links). Afterward media files are searched for the pages defined by the text results. So, only those media results that also generate text results will be presented in result listing.
To get all media results (independent of the text results) another search mode is available:
If in Admin / Settings / Search Settings the checkbox 'Advanced search? (Shows 'AND/OR/PHRASE/TOLERANT/MEDIA' etc.)' is activated, the Search Form will present the additional checkbox 'Search only Media' If this checkbox is activated, only media results will be presented in result listing, while possible text results will be ignored.
If in Admin / Settings / Search Settings the checkbox
'Advanced search? (Shows 'AND/OR/PHRASE/TOLERANT/MEDIA' etc.)'
is activated, the Search Form will present the additional checkbox
'Search only Media'
If this checkbox is activated, only media results will be presented in result listing, while possible text results will be ignored.
Media search follows the rules of pre-defined categories. If 'Search only in category xyz' is selected in Search Form, media results will be presented only as found in the particular category.
Search input for media queries is always interpreted as tolerant. So the query 'logo' will present results e.g. for the image 'sphider-logo.gif', while the input 'gif' will show all available gif files.
Additionally the AND, OR and TOLERANT modes are selectable for media search, while the PHRASE mode will be interpreted as an AND search.
The query 'media:' (without quotes) forces Sphider-plus to search for all media stored in its database. If used together with a category selection, all media content of the particular category will be presented.
If in Search Form the checkbox 'Search only Media' is activated, also the suggest framework will present only media suggestions; taking into account also the eventually pre-selected limitation for category search.
In admin backend, searching for media not only by 'title', but also by EXIF and ID3 info is selectable
An additional Admin setting in section 'Suggest options' allows selection whether suggestions should be taken also from EXIF info and ID3 tags. Never the less suggested keywords will always be the title of the media file.
For media search the Admin setting 'Enable distinct results for upper- and lower-case queries' is also taken into account.
Additionally there is an Admin setting called
'If found on different pages, index also duplicate media content'
If activated, all images, audio and video stream will be presented in result listing. Otherwise only the first occurrence (page/link) will be presented.
5 different modes of sorting the media result listing are Admin selectable:
- By title(alphabetic) - By file suffix - By image size - By 'Last queried' - By 'Most popular'
- By title(alphabetic)
13.4 Statistics for media content
In Admin / Statistics the following tables are available:
'Most Popular Media' presenting: - Thumbnail - Details like 'Title' and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail
'Most Popular Media' presenting:
- Thumbnail - Details like 'Title' and 'Found at' - Total clicks - Last clicked - Query input
- Thumbnail
- Details like 'Title' and 'Found at'
- Total clicks
- Last clicked
- Query input
'Indexed Image Thumbnails' presenting:
- Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail
- Thumbnail 150 x 100 pixel
- Image details like title, filename size of original image, link- and thumb-id
- Option to delete the thumbnail
In order to open the media files all tables contain active links.
Media results are also stored in 'Search log', and are presented like the keyword results with:
- Query - Result count - Queried at - Time taken - User IP - Users country code - Users host name
- Query
- Result count
- Queried at
- Time taken
- User IP
- Users country code
- Users host name
14. Feed support
14.1 XML product feeds
If activated in 'Spider Settings' menu of the admin backend, XML product feeds are indexed.
At present Sphider-plus supports:
Content API v2 (XML) like <condition>used</condition>
Currently not supported:
<product ID="1234">used</product> <condition name="deliveryCosts">used</condition> Content API v2 (JSON) like "condition" : "used"
<product ID="1234">used</product>
<condition name="deliveryCosts">used</condition>
Content API v2 (JSON) like "condition" : "used"
Basic design and rules for product feeds are explainedhere
But as there is no well defined specification for product feeds, each user may define own feed details for their specific requirements. Sphider-plus accepts various attribute names, as well as the amount of attributes describing one product may vary for the feeds to be indexed.
The Sphider-plus file
.../include/common/xml_product_feeds.txt
contains the list of attributes to be indexed. Each attribute in a separate line. Each line (attribute) may contain various names for the same product attribute, separated by a comma.
As an example of the content for this file:
id, ID, product_id, produktid title, name, product_name, Name, produktnavn description, Bescheibung, beskrivelse link, URL, productUrl, URL Hersteller, billedurl link reseller, URL Wiederverkäufer, vareurl image_link, image, images, imageUrl, Big_Image availability, Verfügbarkeit, Lagerbestand, lagerantal price, Neupreis, nypris . . . etc.
id, ID, product_id, produktid
title, name, product_name, Name, produktnavn
description, Bescheibung, beskrivelse
link, URL, productUrl, URL Hersteller, billedurl
link reseller, URL Wiederverkäufer, vareurl
image_link, image, images, imageUrl, Big_Image
availability, Verfügbarkeit, Lagerbestand, lagerantal
price, Neupreis, nypris
. . .
The content of this list is free editable, so that individually (differently) created XML product feeds could be indexed together in one index procedure. In order to speed up the indexation, only the really involved attribute names, and only the exiting product attributes should become part of this list.
Also it should taken into account that only the first attribute name (in each line) will be used in search result listing of Sphider-plus.
So, even if an attribute name like 'product_id' is indexed (because the according product feed used it), search results will present 'id' as name of this attribute.
In order to control the result listing of product feeds, there are 2 relevant options selectable in 'Search Settings' menu of the admin backend:
First option:
For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products.
For results of XML product feeds, present the text extract of 350 characters as part of product attributes.
If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products.
The second option should be set to a high value, if multiple search results are to be expected in the feeds.
Define maximum count of result hits per page, displayed in search results
Also it should taken into consideration, that a case sensitive search is not supported for XML product feeds. The often individually molded product feeds would suppress several results.
If you like to to search for prices like '379.00' Dollar, do not search for '379', because this query will deliver also all the other eventually existing results found as part of links containing the string ‘379’. Please follow the suggest framework, which offers '379.00' directly beyond the search form.
14.2 RDF, RSD, RSS and Atom feeds
To be activated in Admin / Settings / section 'Spider settings', the content of the following feeds will be indexed / re-indexed:
RDF(v.1.0) RSD (v.1.0) RSS (v.0.91 / v0.92 / v.2.0) Atom (v.1.0)
Feed content and also the links found as part of the feeds will be followed. Before indexing the feeds, a validation check for a well-formed XML is performed. Corresponding log output is generated to inform the admin.
Depending of the feed content, the following tags are indexed and herewith become searchable:
- For RDF and RSS feeds the following standard tags are processed: Channeltags: 'title', 'link', 'description', 'language', 'copyright', 'managingEditor', 'webMaster', 'pubDate', 'lastBuildDate', 'category', 'generator', 'rating', 'docs'. Itemtags: 'title', 'link', 'description', 'author', 'category', 'comments', 'enclosure', 'guid', 'pubDate', 'source'. Textinputtags: 'title', 'description', 'name', 'link'. Imagetags: 'title', 'url', 'link', 'width', 'height'. Additional remark for RSS feeds: the optional sub-elements of the CATEGORY element (that identifies a categorization taxonomy) are currently not supported. - For RDF feeds, the following individual tags are additionally processed: Dublin Core tags: 'dc:', 'sy:', 'prn:'. Personal channel tags: 'publisher', 'rights', 'date'. Personal item tags: 'country', 'coverage', 'contributor', 'date', 'industry', 'language', 'publisher', 'state', 'subject'. - For Atom feeds the following tags are processed: Metatags: 'author', 'category', 'contributor', 'title', 'subtitle', 'link', 'id', 'published', 'updated', 'summary', 'rights', 'generator', 'icon','logo'. Entrytags: 'author', 'category', 'contributor', 'title', 'link', 'id', 'published', 'updated', 'summary', 'rights'. Authortags: 'name', 'uri', 'email'. Contributortags: 'name', 'uri', 'email'. Categorytags: 'term', 'scheme', 'label'. Generatortags: 'uri', 'version'. Additional remark for Atom feeds: SOURCE elements are currently not supported - For RSD feeds the following tags are processed: Service tags: 'engineName', 'engineLink', 'homePageLink'. API tags: 'name', 'apiLink', 'blogID'. Settings: 'docs', 'notes'.
- For RDF and RSS feeds the following standard tags are processed:
Channeltags: 'title', 'link', 'description', 'language', 'copyright', 'managingEditor', 'webMaster', 'pubDate', 'lastBuildDate', 'category', 'generator', 'rating', 'docs'. Itemtags: 'title', 'link', 'description', 'author', 'category', 'comments', 'enclosure', 'guid', 'pubDate', 'source'. Textinputtags: 'title', 'description', 'name', 'link'. Imagetags: 'title', 'url', 'link', 'width', 'height'. Additional remark for RSS feeds: the optional sub-elements of the CATEGORY element (that identifies a categorization taxonomy) are currently not supported.
Channeltags: 'title', 'link', 'description', 'language', 'copyright', 'managingEditor', 'webMaster', 'pubDate', 'lastBuildDate', 'category', 'generator', 'rating', 'docs'.
Itemtags: 'title', 'link', 'description', 'author', 'category', 'comments', 'enclosure', 'guid', 'pubDate', 'source'.
Textinputtags: 'title', 'description', 'name', 'link'.
Imagetags: 'title', 'url', 'link', 'width', 'height'.
Additional remark for RSS feeds: the optional sub-elements of the CATEGORY element (that identifies a categorization taxonomy) are currently not supported.
- For RDF feeds, the following individual tags are additionally processed:
Dublin Core tags: 'dc:', 'sy:', 'prn:'. Personal channel tags: 'publisher', 'rights', 'date'. Personal item tags: 'country', 'coverage', 'contributor', 'date', 'industry', 'language', 'publisher', 'state', 'subject'.
Dublin Core tags: 'dc:', 'sy:', 'prn:'.
Personal channel tags: 'publisher', 'rights', 'date'.
Personal item tags: 'country', 'coverage', 'contributor', 'date', 'industry', 'language', 'publisher', 'state', 'subject'.
- For Atom feeds the following tags are processed:
Metatags: 'author', 'category', 'contributor', 'title', 'subtitle', 'link', 'id', 'published', 'updated', 'summary', 'rights', 'generator', 'icon','logo'. Entrytags: 'author', 'category', 'contributor', 'title', 'link', 'id', 'published', 'updated', 'summary', 'rights'. Authortags: 'name', 'uri', 'email'. Contributortags: 'name', 'uri', 'email'. Categorytags: 'term', 'scheme', 'label'. Generatortags: 'uri', 'version'. Additional remark for Atom feeds: SOURCE elements are currently not supported
Metatags: 'author', 'category', 'contributor', 'title', 'subtitle', 'link', 'id', 'published', 'updated', 'summary', 'rights', 'generator', 'icon','logo'.
Entrytags: 'author', 'category', 'contributor', 'title', 'link', 'id', 'published', 'updated', 'summary', 'rights'.
Authortags: 'name', 'uri', 'email'.
Contributortags: 'name', 'uri', 'email'.
Categorytags: 'term', 'scheme', 'label'.
Generatortags: 'uri', 'version'.
Additional remark for Atom feeds: SOURCE elements are currently not supported
- For RSD feeds the following tags are processed:
Service tags: 'engineName', 'engineLink', 'homePageLink'. API tags: 'name', 'apiLink', 'blogID'. Settings: 'docs', 'notes'.
Service tags: 'engineName', 'engineLink', 'homePageLink'.
API tags: 'name', 'apiLink', 'blogID'.
Settings: 'docs', 'notes'.
There is an Admin setting to strip CDATA tags. It is called: 'Follow CDATA directives'. A blank checkbox in Admin settings will ignore the CDATA directives in RSS and RDF feeds.
An additional Admin setting enables/disables, whether 'Dublin Core' and other individually marked tags in RDF feeds should be indexed.
Another Admin setting allows defining that the 'preferred' directive in RSD feeds should be followed. If activated in Admin settings, only those API tags with 'preferred = true' will be indexed. If the checkbox remains blank, all API tags will be indexed, even if 'preferred = false is encountered.
Feed links are treated like the standard page links, so that the limit in Admin settings "Max. links to be followed for each Site" is influenced also by feed links (they count).
After indexing the feeds, they are treated like other (HTML) pages. The suggest framework will offer keyword proposals. Also pre-selection of categories is taken into account.
15. Cache for text and media queries
To be activated in Admin settings, section 'Search Settings', the cache will store the results of the 'Most Popular Queries'. Before connecting to the database, each query will request the cache for results. If available, results are presented extremely fast. On the other hand each query, necessary to get results from the database, will automatically store its result into the cache.
Individual cache results are stored following the different Search selections (AND, OR, Phrase, Tolerant). Also individualized cache results are stored for each category and all-sites search requests.
Text and media queries cooperate with different caches. Size of each cache is definable in Admin settings [MByte]. On overflow of a cache, the least important result is deleted from the cache, while 'Most Popular Queries' is updated with each search input.
If in Admin settings the 'Debug mode' is enabled, cache activity is presented on top of the result listing in the form of status messages. Text cache and media cache could be manually cleaned in Admin 'Clean' section, also offering the count of files in each cache and the consumed memory space separately for each cache. Another selectable cache setting allows automatic cache reset, performed on 'Erase & Re-index' procedures.
Definition of required cache size depends on personal preferences. There is a conflict between two opposed requirements: the cache should hold as much as possible 'Most Popular Queries' but not consume too many resources by controlling hundreds of files in a big memory. For a first assumption, size per result should be defined to 2 Kbyte. Multiplied with the matches in database (e.g. found in 20 pages), each result requires approximated 40 Kbytes of RAM. So, a cache of 2 MByte could hold the results for 40 to 50 'Most Popular Queries'. After some time of usage, it might be helpful to observe the information given in 'Clear' section of Admin. Count of result files in cache and consumed memory space are presented. Depending on personal preferences, consumed result size and count of query hits in x pages, it might be necessary to adapt the size for text and media cache.
16. Multiple database support
16.1 Overview
Starting with version 2.0, Sphider-plus offers the capability to cooperate with multiple databases. Currently prepared to work with up to five databases, the development was done under the following aims:
Independent allocation of different databases for the tasks:
- Admin - Search user - Suggest URL user
- Admin
- Search user
- Suggest URL user
This offers the capability to assign the 'Search' user to database1 and let him use the search engine. Meanwhile the 'Admin' may re-index database2. Also 'add new sites' and index them into database2 is performed by the Admin without disturbing the 'Search' user. Also backup, restore and copy functions could be done by the Admin without influence on the availability of the search engine. Later on the Admin may switch the 'Search' user to the updated database, or copy the fresh database content into the 'Search' user database.
With respect to the database, the Sphider-plus scripts create automatically individual settings. These settings might be individualized for each database with respect to the personal requirements.
As Sphider-plus has to survive also in Shared Hosting applications there are some limitations for multiple database support:
- It is not possible to cooperate with a cluster of databases. - Master/Slave Replications are not supported, because the MySQL configuration file my.cnf is not accessible. - Sharding by scaling data-tables is not supported. - Dynamical allocating as a pro-shared assignment is not possible.
- It is not possible to cooperate with a cluster of databases.
- Master/Slave Replications are not supported,
because the MySQL configuration file my.cnf is not accessible.
- Sharding by scaling data-tables is not supported.
- Dynamical allocating as a pro-shared assignment is not possible.
Sphider-plus Admin interface offers the management of multiple databases. There are different menus in section 'Database' as described below.
16.2 Definition and configuration
Sphider-plus version 2.0 (and following) does not require the install_all.php script any longer. Database assigning and table installation is integrated into the Admin interface.
The menu for database definition and configuration is protected by an additional login. Independent from the Admin login, a username and password is required to enter into this section. Username and password are defined in the file .../admin/auth_db.php. As per default download, username and password are both set to 'admin'.
Entering the first time into this section, there will be several warning messages. At minimum one database has to be defined by:
Name of database Username Password Database host Prefix for Tables
Name of database
Username
Password
Database host
Prefix for Tables
Pressing the 'Save' button will assign Sphider-plus to these database definitions. Never the less, the warning message 'Tables are not installed for database x' will remain in the Database settings overview.
The 'Install all tables for database x' is an independent procedure, which has to be invoked by the Admin after the database has been allocated. Chapter Enhancing functionality of multiple database support will describe the reason for these two independent steps.
If the database is allocated and the tables are installed, the message ' Database x settings are okay.' are displayed in the settings overview; showing separately the situation for each of the five databases.
If the application should work with only one or two databases, the settings for the non-required databases may remain blank. A corresponding message will be displayed:
Mysql server for database 3 is not available! Trying to reconnect to database 3 . . . Cannot connect to this database. Never mind if you don't need it.
Mysql server for database 3 is not available!
Trying to reconnect to database 3 . . .
Cannot connect to this database.
Never mind if you don't need it.
So the Admin may assign up to five databasees, as required for the application. Assining of another (the next) database will be possible only, if the settings for the previous database are okay and the tables are installed. Further database setting fields are suppresed.
16.3 Activate / Disable databases
Next step to get multiple databases to work will be the activation of the databases. This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration.
There are four settings available in the 'Activate / Disable' section:
- Select active database for Admin - Select active database for 'Search' user - Select all databases that should deliver search results - Select active database for 'Suggest URL' user
- Select active database for Admin
- Select active database for 'Search' user
- Select all databases that should deliver search results
- Select active database for 'Suggest URL' user
These settings enable independent use of different databases for:
Admin Search User Suggest URL User
Admin
Search User
Suggest URL User
Select all databases that should deliver search results offers the additional capability to fetch results from more than one database. In any case the active database for 'Search User' will be activated to fetch results, as this database is defined to be the default user database. Searching for results in several databases is available for text and media search and all search modes, taking into account all eventually pre-defined categories Search results are logged with respect to the database that delivered results. Consequently the table 'Most popular searches' at the bottom of result listing is offering results for the currently allocated databases, so that clicking on any of these most popular searches will again deliver results from one or several of the currently available databases.
If multiple sets of tables are available, because they have been created for a database before, you will be able to activate any of these table sets by selecting the corresponding prefix. The selection will be presented below the 'Store all selections' button for all databases containing more than one table set. The selected prefix will be commonly used for Admin, Search user, suggest framework etc.
If the table prefix is modified as described in 'Enhancing functionality of multiple database support, this modification is valid for all databases, which are activated to deliver results. In other words, all databases that are used to deliver results and the prefix is manually modified in .../templates/html/020_search-form.html need to contain table sets with the same prefix names.
Consequently the corresponding settings are activated with respect to the database and the activated set of tables.
After activating the databases for the different tasks, multiple database support is ready to use. The currently activated database and the prefix (name) of the actual selected table set (for the Admin) is displayed in 'Sites' table like:
Database 1 with table prefix 'search1_' - Displaying URLs 1 - 10 from 25
If the 'Debug' mode is activated in Admin settings, also the result listing will inform the user about the actual situation:
Results from database 2
When 'Store all selections' is activated to complete the database activation procedure, also the text cache and media cache will be cleared.
16.4 Backup & Restore of databases
This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration.
This section enables the Admin to create backups from the current situation of a selectable database. Vise versa the backup files may be restored into the database.
Backup files are compatible to phpMyAdmin structure and contain the table prefix and date + time of creation as part of the file names. Backup files are stored in subfolders (.../admin/backup/dbx), separated for each database.
Restore of backup files is only possible into that database, which had been used before to create the backup files. Current content of the database tables (those with the same table prefix) will be destroyed by the restore procedure.
16.5 Copy & Move
This section allows copying the content from one database to another. By selecting:
- Source database - Destination database and - Define Copy or Move utility
- Source database
- Destination database
- Define Copy or Move utility
it is possible to copy / move the content from one database to any other database. Beside the table content, both utilities inevitably will also copy the table suffix (of the source db) into the destination database. If tables with the same prefix already exist in the destination database, the content of these tables will be overwritten. Beside the table content also the corresponding thumbnails will be copied.
In contrast to the 'Copy' utility, the 'Move' function additionally will clear the source database and delete the corresponding (source) thumbnails.
16.6 Enhancing functionality of multiple database support
1. 'Backup & Restore' as well as the 'Copy / Move' function will always work with all tables of a selected database. In contrast to these gloabl actions, the 'Import / Export URL list' function is only acting with the currently (for the Admin) activated table prefix. This allows a selective import and export of only those URLs, used for the activated tables as defined by the prefix. The name of the exported URL list contains the (source) database number, the table prefix and the date of creation. Crossover usage of URL lists is enabling to import any URL list (created from database x) into database y
2. When configuring databases, it is strongly recommended to create and use prefixes for the tables. Table prefixes are the key for creating new sets of tables in each database. As described in chapter Definition and configuration, the tables need to be installed separately; after the configuration of the database was saved. After these settings are finished and the database is assigned, Admin may use this database and index sites into the database tables with the given table prefix.
It is evident that one database could be configured with several table prefixes. That is the key for additional 'virtual' databases. By configuring the given database with a new table prefix, Admin is able to install another set of tables into the same database. This set of tables (with the new prefix), may be used to index another set of sites into the same database. This is performed without destroying the content of the prior used tables.
3. The above mentioned allows to add quasi-additional databases without really creating new databases. It was also mentioned before that Sphider-plus has to survive in 'Shared Hosting' applications. Consequently Admin may assign one database to the 'Search' user.
But there is a feature integrated into Sphider-plus to bypass this restriction. Assuming that result listing should be offered in two (or even more) versions. For example in English and another language. One result listing for global users, the other for registered users. One info result, one shopping result listing etc.
To enable such a feature, the search form of Sphider-plus contains two hidden variables called 'db' and 'prefix':
<input type="hidden" name="db" value="$user_db" /> <input type="hidden" name="prefix" value="$user_prefix" />
<input type="hidden" name="db" value="$user_db" />
<input type="hidden" name="prefix" value="$user_prefix" />
As long as the variables are set to '0' (how to alter, see below), the search script will use the settings as defined in the Admin settings:
"Select database for 'Search' user". This standard setting may be used for the first search form, offering the results of the first set of tables (which e.g. holds the English results). But for a second search form, the value for 'prefix' may be set (name of prefix) for another set of tables that hold the results of the second language. The setting of the second search form will (temporary) overwrite the Admin settings for its own result listing.
Selection of different sets of tables could be performed in the Database => Activate / Disable menu. If multiple sets of tables are available, because they have been created for a database before, you will be able to activate any of these table sets by selecting the corresponding prefix. The selection will be presented for all databases containing more than one table set.
The selected prefix will be commonly used for Admin, Search user, suggest framework etc.
Multiple database enhancements are assisted by the fact that Sphider-plus is supporting multiple settings. Each database and each set of tables contains an individual Admin setting.
The selected set of tables could be overwritten individual for a search form by modifying the variable:
$user_prefix = "0";
Here the prefix name of the table set, which should be used instead of the default table set, needs to be entered
The selected database could be overwritten individual for a search form by modifying the variable:
$user_db = "0";
Here the number of the database, which should be used instead of the default db, needs to be entered
Both variables are to be found and must be changed if necessary in the script .../search_ini.php
This implementation could be interpreted as a super category feature. Not requiring the selection of a category, or even a subcategory, by the 'Search' user. Not predicating that the normal category function would be lost by use of multi database support and its extended features.
Another useful application for multiple table sets would be the support of several languages. By indexing language specific sites into different sets of tables, the 2 hidden fields in the search form will define, which language is presented in result listing.
Usually the language for the user dialog is defined in the Admin backend and could be automatically adapted to the user's language by an additional Admin setting.
In order to overwrite these Admin settings individually for one search form and the according result listing, there is a variable placed in the script .../search_ini.php
$user_lng = "0";
As long as the value is set to "0", the Admin settings will be used. Entering "fr", "it", etc., will force this search form to use the languages French, Italian, etc.
17. Search in categories
17.1 Hierachical structur
In order to prepare Sphider-plus for category search, the categories may be defined in Admin 'Categories' menu. Different categories (top level) as well as subcategories (Create new subcategory under . . .) are added here.
Second step to prepare Sphider-plus, is assigning the sites to the different categories and sub-categories. To be found at Sites / Options / Edit / Category.
Assigning and even change of category affiliation may be done also after the index procedure.
Third step is to activate one or both of the Admin settings:
- Show category selection in search form. - If available, user may select 'More results of category . . . ' at each result in results listing.
- Show category selection in search form.
- If available, user may select 'More results of category . . . ' at each result in results listing.
The first setting will present all prior defined categories and their subcategories as part of the search form and the user may select one top-level category or a subcategory to limit the search results.
The Admin setting
'If available, user may select 'More results of category . . . ' at each result in results listing'
will present all additional categories that would also deliver results for the search query. Presented under the link URL, the user may click on any of these suggestions to automatically perform a new search in the suggested category.
The new result listing will present all results of that category and, if available, will also suggest subcategories that would be able to deliver results for the current query. Again the user may rarefy the result listing. Clicking on any suggested subcategory, will again perform a search for the query; now in the selected subcategory.
As additional headline of the result listing the user will be informed about the source if the results like:
Presented results are captured from category: abc
In order to return to the standard search without category-selection, the user only needs to activate the checkbox
'Search in all sites'
as part of the Search form.
17.2 Parallel structur
Starting with version 3.0, Sphider-plus offers the feature to restrict the search results by means of up to 5 categories simultaneously. The titles (names) for the 5 categories could be defined in the Settings menu of the Admin backend (see section 'General Settings').
If the title for category0 or category1 is defined with any value, the hierarchical category structure is disabled, and the search results will be restricted by the parallel category structure. After defining the names of all categories that should be used to restrict the search results, any of the 'Save' buttons in the 'Settings' menu of the admin backend needs to be activated. Afterward the category names are available for all Sphider-plus scripts.
Undefined categories, because the regarding title is left blank, will remain uncared by the search algorithm, and also will not be presented in the search form.
The relationship between URL and all involved categories is to be defined in a file, which is imported into Sphider-plus by means of the 'Import / export URL list' feature of the Admin backend.
As per default, the import file needs to be placed into
.../admin/urls/
as a subfolder of the Sphider-plus installation. If the import file is to be found in another folder, the regarding address needs to be edited in the script
.../admin/configset.php
The import file could be a pure text file or an EXEL table. The name of the file is user-defined, but must contain at minimum one underscore. Examples:
2012.12.07_import.txt My_friends.xml
2012.12.07_import.txt
My_friends.xml
More details regarding the import of parallel usable categories are described in th readme.pdf documentation.
One category variable could be used to build up a range selection (min. – max. values) in the search form. This might be useful for postal codes, etc. The search form will present 2 selection fields for $cat_sel0, which could be used by the user of the search engine to define the minimum and maximum values. All results matching this range will be presented in result listing. Additional the categories $cat_sel1 - $catsel4 could be used to improve the search results.
18. User suggested sites
Reachable via a link at the bottom of result listing, a form is presented that allows users to suggest URLs to become indexed by the search engine. The user needs to enter:
- URL - Title - Short description - Dispatcher e-mail account.
- URL
- Title
- Short description
- Dispatcher e-mail account.
In order to prevent spam proposals, the form optionally will present a Captcha.
The admin of the search engine may
- approve - reject - bann
- approve
- reject
- bann
the suggested sites by means of a menu, presented in Admin backend. A corresponding e-mail is automatically generated and sent to the dispatcher.
All features of 'User suggested sites' are optional and could be defined as part of the Admin settings.
An additional option offers the function:
Suggested sites require authentication tags
If activated, all suggested sites will need an additional meta tag in their header. This authentication tag needs to be written as:
<meta name='Sphider-plus' content='1234'>
The content value (here e.g. 1234) is defined by the administrator of the search engine. As part of the approval form, an additional field needs to be filled in by the admin. So, individual values could be defined for each suggested site. The text of the automatically generated acknowledgment e-mail, sent to the dispatcher, is altered to:
Your suggestion was accepted by the system administrator and will be indexed shortly. Please add the following tag into the header of the suggested site: <meta name='Sphider-plus' content='1234'> In order to enable indexing of your site, this tag is mandatory and is tested periodically by the indexer of Sphider-plus. We appreciate your help and effort in building this search engine. This mail was automatically generated by Sphider-plus.
Your suggestion was accepted by the system administrator and will be indexed shortly.
Please add the following tag into the header of the suggested site:
In order to enable indexing of your site, this tag is mandatory
and is tested periodically by the indexer of Sphider-plus.
We appreciate your help and effort in building this search engine.
This mail was automatically generated by Sphider-plus.
The meta tag needs to be implemented only into the suggested site. It is not necessary to add this tag into all pages of the site. Only the header of the suggested URL will be verified for existence of the tag and correct content value.
The authentication value may be altered by the admin of the search engine later on.
In
Sites view => Site Options => Edit
an additional input field is presented. If the value is left empty, the site will be indexed without verification of the header tag. The dispatcher will not be informed about any modification done by the admin.
The additional input field to enter/modify the authentication value is offered for all sites stored in the database of Sphider-plus, so that an authentication value could be added also subsequently by the admin.
If the tag is missing or contains an invalid authentication value, a corresponding warning message is created during index procedure. The complete site with all their pages will be skipped by the index procedure, but the former content as well as the known links will remain part of the Sphider-plus database. This behavior offers the capability to reactivate the site by the admin later on.
19. Vulnerability protection
19.1 Prevent queries from Meta search engines and crawler known to be evil
In order to reduce Internet traffic and server load, there are several settings available in Admin backend
in section 'Search Settings' called:
- Block all queries sent by harvester, bots and known evil user-agents - Block all queries sent by Meta search engines like Google, MSN, Amazon, etc. - Block all queries sent by known spammer (IPs) that persist in abusing forums and blogs with their scams, ripoffs, exploits and other annoyances - Block all queries, which could cause an XSS attack, shell execution, tag inclusion, SQL injections, directory traversals, XSRF attacks, or a JavaScript execution
- Block all queries sent by harvester, bots and known evil user-agents
- Block all queries sent by Meta search engines like Google, MSN, Amazon, etc.
- Block all queries sent by known spammer (IPs) that persist in abusing forums and blogs
with their scams, ripoffs, exploits and other annoyances
- Block all queries, which could cause an XSS attack, shell execution, tag inclusion,
SQL injections, directory traversals, XSRF attacks, or a JavaScript execution
If activated, the corresponding search queries will be rejected.
The first option is controlled by the file
.../include/common/black_uas.txt
holding lists of user agents known to be evil. Here well known evil bot UAs are stored.
Additionally there is a list of well known brave bots, which s stored in
.../include/common/white_uas.txt
If the user UA is part of this white list, the comparison with all the black listed UAs will be skipped.
Meta search engines are identified by their IP. The IPs could be entered as single IP, as well as IP ranges into the file
.../include/common/black_ips.txt
Prevented queries are answered with the text 'No results found'.
Instead, if the option 'Enable Debug mode for User interface' is activated in Admin backend, the IP (which caused the query) is also presented as result. If an evil user agent sent the query, the client user agent string is presented.
19.2 Basic input validation against vulnerability attacks
The following protections are implemented:
- Prevent SQL-injections - Prevent XSS-attacks - Prevent Shell-executes - Suppress JavaScript executions - Suppress Tag inclusions - Prevent Directory Traversal attacks - Delete input if query contains any word of (editable) blacklist - Prevent buffer overflow errors. - Suppress JavaScript execution and tag inclusions masked as XSS attacks. - Prevent C-function 'format-string' vulnerability.
- Prevent SQL-injections
- Prevent XSS-attacks
- Prevent Shell-executes
- Suppress JavaScript executions
- Suppress Tag inclusions
- Prevent Directory Traversal attacks
- Delete input if query contains any word of (editable) blacklist
- Prevent buffer overflow errors.
- Suppress JavaScript execution and tag inclusions masked as XSS attacks.
- Prevent C-function 'format-string' vulnerability.
As the protections against XSS attacks, Shell execution, Tag inclusions, as well as the suppression of JavaScript executions do avoid some words in the search query, a special Admin setting is used to activate this protection. The setting is to be found in section "Search Settings" and is called:
Block all queries, which could cause an XSS attack, Shell execution, Tag inclusion, or a JavaScript execution
19.3 Admin backend protection against remote access
In order to hedge the admin backend of Sphider-plus against use as directed, you may prevent usage of the admin backend for remote operation. This means:
'Lock in' as admin will only be possible, if installation address (IP) of Sphider-plus and the IP, used by the admin lock in request, are equal.
In section 'General Settings' an option is available to enable/prevent remote access to the admin backend.
19.4 Log file reporting attempts to abuse Sphider-plus
Each attempt to intrude the user interface is stored in a log file. If the according option for vulnerability protection is activated in admin backend, the following events will trigger a new entry in log file:
- Queries sent by harvester, bots and known evil user-agents (UAS). - Queries sent by Meta search engines like Google, MSN, Amazon, etc. (IPs). - Queries sent by known spammer (IPs) that persist in abusing forums and blogs with their scams. - Queries, which could cause an XSS attack, shell execution, tag inclusion, - SQL injection, directory traversal, XSRF attack, or a JavaScript execution. - Attempts to flood the search form by too many queries per unit of time. - Blocked Internet traffic of IP's, which already caused intrusion attempts (IDS).
- Queries sent by harvester, bots and known evil user-agents (UAS).
- Queries sent by Meta search engines like Google, MSN, Amazon, etc. (IPs).
- Queries sent by known spammer (IPs) that persist in abusing forums and blogs with their scams.
- Queries, which could cause an XSS attack, shell execution, tag inclusion,
- SQL injection, directory traversal, XSRF attack, or a JavaScript execution.
- Attempts to flood the search form by too many queries per unit of time.
- Blocked Internet traffic of IP's, which already caused intrusion attempts (IDS).
The log file is available at the admin backend in menu 'Statistics' => 'Report log file', offering all details about each event. The log file could be flushed in menu 'Clear'.
There is an additional option called:
On occurrence, send e-mail report about above attempts to harm Sphider-plus
If this option is activated, each event will be reported by e-mail to the administrator account as defined at:
Administrator e-mail address
More details about vulnerability protection of Sphider-plus are available in the readme.pdf documentation.
20. Bound database
Entering a word into the search form will force Sphider-plus to scan the database for all links offering a result for this query. Already integrated had been a limit to present only x results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x results. This option will be most useful especially for huge databases, holding the content of many links.
The setting in section 'Search Settings' called:
Define max. amount of results presented in result listing:
will define the limitation for the database volume. In order to activate this limitation the 'Clean' menu presents the option:
Bound database
When activated, all keyword / link relationships - stored during index procedure - will be reduced to the above defined amount. All overhanging relationships will be deleted from the database. Consequently all further search enquires will be responded much faster, because only the relevant amount of results are available.
It is up to the admin to define how many results are relevant for the application Sphider-plus is integrated into. The 'Top keywords' table, as part of the 'Statistics' menu, could be helpful to define the limit. Once the database is bounded, also this table will only show the 'bounded' available results.
The 'Bound database' option should not be invoked, if the chronological order of result listing is defined to 'By hit counts in full text' or to 'By index date', because limitation of the database is performed by weighting.
Some patient is required for this option. Once activated, the following steps need to carried out:
- Get all keywords from database. - Get all results from db for each keyword. - Bound the results of each keyword to the defined limitation. - Delete all possible results, exceeding the limitation (for each keyword).
- Get all keywords from database.
- Get all results from db for each keyword.
- Bound the results of each keyword to the defined limitation.
- Delete all possible results, exceeding the limitation (for each keyword).
21. Suggest framework
Sphider-plus offers an auto-complete function as part of the search form. Starting with version 3, the suggest framework was switched over from 'Prototype' to the JavaScript library 'jQuery'.
Suggestions are presented for single word queries, as well as for phrases, and also for media search. The suggest framework is configurable in Admin backend. As part of the 'Settings' menu in section 'Suggest Options', the following items are definable:
- Minimum count of query letters in order to get a suggestion. - Search for suggestions in query log. - Search for suggestions in keywords. - Build suggestions also for 'Phrase' search. - Obey the list of words not to be suggested. - For 'Media' search get suggestions also from EXIF info and ID3 tags. - Limit count of suggestions.
- Minimum count of query letters in order to get a suggestion.
- Search for suggestions in query log.
- Search for suggestions in keywords.
- Build suggestions also for 'Phrase' search.
- Obey the list of words not to be suggested.
- For 'Media' search get suggestions also from EXIF info and ID3 tags.
- Limit count of suggestions.
The option 'Obey the list of words not to be suggested' is based on a list of words placed in the file
.../include/common/suggest_not.txt
All words placed in this file will be used to suppress this suggestion. All other suggestions will be presented. Even if the stop word in the list is only part of a suggestion, the suggestion will be also suppressed. Example: If the list contains the word 'kinder', also 'kindergarten' will not be presented as a suggestion.
With respect to the different templates of Sphider-plus, also the design of the suggest field is adapted to the regarding template. Responsible for this adoption is the style sheet file
/templates/your_template/jquery-ui-1.10.2.custom.css
These style sheet files are based on the themes of jQuery. They are edited individually for the Sphider-plus templates 'Pure', 'Slade' and 'Sphider-plus'. Each of these style sheet files contains a link (see row 4) to the jQuery 'ThemeRoller'. Thus individual adoptions to personalized Sphider-plus templates are enabled by just editing the style sheet …/templates/your_template/jquery-ui-1.10.2.custom.css by means of the jQuery 'ThemeRoller'.
22. Integration of Sphider-plus into existing sites
There are 2 different ways of integrating Sphider-plus into existing sites:
1. Use layout and templates of Sphider-plus. 2. Embed the search engine into existing HTML code.
1. Use layout and templates of Sphider-plus.
2. Embed the search engine into existing HTML code.
22.1 Integration into existing site by use of Sphider-plus templates
This mode is simply invoked by calling the script 'search.php' in the root folder of the Sphider-plus installation.
Assuming that the search engine is placed in a subfolder called 'sphider-plus', the according call would be something like:
http://www.abc.de/sphider-plus/search.php
Once called, the search engine will build up a complete HTML page with
- Headline - Search form - Result listing - Footer
- Headline
- Search form
- Result listing
- Footer
The design of this page is defined by one of the 3 templates delivered together with Sphider-plus. They are named:
- Pure (close to Google design) - Slade (dark shadow design) - Sphider-plus (default, as on this project page)
- Pure (close to Google design)
- Slade (dark shadow design)
- Sphider-plus (default, as on this project page)
and are selectable by the Admin backend of Sphider-plus. Usually no one of these 3 templates will fulfil the requirements of an existing site design. Consequently the (activated) style sheet
.../templates/Pure/thisstyle.css
needs to be individualized.
In order to create a new template it might be useful to copy one of the Sphider-plus template folders completely with all files, rename the folder with a new name and afterward edit the personal style sheet
.../templates/your_template_name/thisstyle.css
The new template will also be presented in the Admin settings, as one of the available selections.
22.2 Embed the search engine into existing HTML code
This mode extends the capabilities as described in the above chapter. There is an Admin setting called:
- Embed 'Search form' and 'Result listing' into an existing HTML page
If this checkbox is activated, the search engine will not create a complete HTML page, but needs to be embedded into an existing HTML code.
As the Sphider-plus scripts are based on charset UTF-8, in any case, the scripts for search form, result listing, etc. could only be embedded into UTF-8 coded HTML pages. No problem to index other coded content, but embedding is possible only on base of charset UTF-8.
As described in above chapter, the script 'search.php' is used to embed the search engine into the existing page. In general the script 'search.php' only consists of several include directives.
The 'search.php' script contains complete documentation (comments) how and where to use the different include directives of this script, so that it needs not to be repeated here.
In any case the link
<link rel='stylesheet' href='http://abc.com/search/templates/Pure/userstyle.css' type='text/css'>
should be placed into the HTML head tag, before an already existing stylesheet.css file is called. This approach will ensure that the existing css will automatically overwrite the Sphider-plus style sheet. Consequently only the search engine specific settings need to be modified in the Sphider-plus userstyle.css.
Replace in above example
http://abc.com/search/
with your personal address to the Sphider-plus installation folder on your server.
On the same subject another Admin setting might be also helpful:
Name of search script
By means of this option, an individual script could be defined and used to control the search engine, by containing the required 'includes'. Sphider plus will automatically reference on this script, when searching and presenting the results.
The Sphider-plus functions
- Intrusion Detection System - User may suggest URLs form - Admin backend
- Intrusion Detection System
- User may suggest URLs form
- Admin backend
are always using the template as defined in the Sphider-plus Admin backend and (at present) are working non embedded.
22.3 The different style sheet files.
Sphider-plus is delivered with two style sheet files:
- adminstyle.css - userstyle.css
- adminstyle.css
- userstyle.css
Both files are part of each template design (Pure, Slade and Sphider-plus) In order to adapt one of the three templates to an existing site design; only the userstyle.css file needs to be individualized. So the Admin backend remains stable and executable even during development of the final design for the user interface.
.../templates/your_template/jquery-ui-1.10.2.custom.css
22. JSON, XML and RSS result output
Usually the result listing is presented as HTML output for the client that has sent the query to the Sphider-plus scripts. Additional output is available as JSON, and XML files, as well as RSS feed. If requested by the search_ini.php script, the results will be presented as separated files in subfolder …/xml/
With respect to the query type, JSON, XML and RSS files will contain text, media, or link results. The according output files are called:
text_results.txt => JSON output file text_results.xml => XML output file text_results.rss => RSS output feedd media_results.txt media_results.xml media_results.rss link_results.txt link_results.xml multiple_link_results.txt multiple_link_results.xml
text_results.txt => JSON output file
text_results.xml => XML output file
text_results.rss => RSS output feedd
media_results.txt
media_results.xml
media_results.rss
link_results.txt
link_results.xml
multiple_link_results.txt
multiple_link_results.xml
As media results could be images, audio streams, as well as videos, the media type is marked by a type tag in the output files.
The additional output files are activated by the variable $out . In order to create the additional output files, the variable needs to be set to ‘xml’. To be found in the script …/search_ini.php
The same script also offers the variable $xml_name (above set to 'result'), which may be used to define individual names for the output files. Individual for each search form. The names are always completed by one of the prefixes
text, media, link, multiple_link
so that the complete name will become
text_your-choice.xml media_your-choice.txt etc.
text_your-choice.xml
media_your-choice.txt
If a new query is sent to Sphider-plus, first of all the old JSON, XML and RSS result files are deleted from the sub folder .../xml. This is performed before searching the database for possible results. When new results are found in db, the search-script will store the results in new JSON, XML and RSS file, and also present the results as HTML output to the client's browser.
Additionally all JSON,XML and RSS output files are saved in sub folder
.../xml/stored/
The file names in this folder additionally contain date and time of storage like
_2014.03.02_04-47-10_PM_text_results.xml
The files in sub folder . . ./xml/stored/ will not be deleted automatically, but remain available until manually deleted by the admin backend. To be performed in menu 'Clear' by means of the item:
Delete all files in 'XML' sub folder
An XML media result file may look like:
<?xml version="1.0" encoding="utf-8" ?> <media_results> <query>warp</query> <ip>127.0.0.1</ip> <host_name>www.007guard.com</host_name> <query_time>2014-03-02 04:47:10 PM;</query_time> <consumed>0.043</consumed> <total_results>2</total_results> <media_result> <num>1</num> <type>image</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/my_music/warp.m4a</link> <title>Warp.m4a</title> </media_result> </media_results>
<?xml version="1.0" encoding="utf-8" ?>
<media_results>
<query>warp</query> <ip>127.0.0.1</ip> <host_name>www.007guard.com</host_name> <query_time>2014-03-02 04:47:10 PM;</query_time> <consumed>0.043</consumed> <total_results>2</total_results> <media_result> <num>1</num> <type>image</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/my_music/warp.m4a</link> <title>Warp.m4a</title> </media_result>
<query>warp</query>
<ip>127.0.0.1</ip>
<host_name>www.007guard.com</host_name>
<query_time>2014-03-02 04:47:10 PM;</query_time>
<consumed>0.043</consumed>
<total_results>2</total_results>
<media_result>
<num>1</num> <type>image</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size>
<num>1</num>
<type>image</type>
<url>http://www.abc.de/index.php</url>
<link>http://www.abc.de/images/warp.gif</link>
<title>warp.gif</title>
<x_size>635</x_size>
<y_size>98</y_size>
</media_result>
<num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> <link>http://www.abc.de/my_music/warp.m4a</link> <title>Warp.m4a</title>
<num>2</num>
<type>audio</type>
<link>http://www.abc.de/my_music/warp.m4a</link>
<title>Warp.m4a</title>
</media_results>
Instead a JSON text file may contain the following content:
{"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . Preamplifier and driver for connecting cable. HDTV receiver for digital TV and radio. DVB-S\/S2 and DVB-T, digital colour TV set with 24" LED back light illuminated display. 7 Exterior Views 8 9 Interior Views Former, the cow and pig stable were placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active driven DVB-T antenna. HDTV receiver for digital TV and radio. DVB-S\/S2 and DVB-T digital colour TV set. First floor Upper floor Plot: Size: about 6,2 ha Hilly side, partly wooded View over the hills of the Chianti D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]}