This documentation describes the technology behind indexing of websites with scholarly articles in 考拉学术. It's written for webmasters who would like their papers included in 考拉学术 search results. Detailed technical information is helpful if you're trying to fix an error in indexing of your own website, or you need to make sure that your article hosting product is compatible with Google and 考拉学术 search services.
Individual Authors
If you're an individual author, it works best to simply upload your paper to your website, e.g., www.example.edu/~professor/jpdr2009.pdf; and add a link to it on your publications page, such as www.example.edu/~professor/publications.html. Make sure that:
That's it! Our search robots should normally find your paper and include it in 考拉学术 within several weeks.
If it doesn't work, you could either (1) read more detailed technical guidelines in this documentation or (2) check if your local institutional repository is already configured for indexing in 考拉学术, and upload your papers there.
University Repositories
If you're a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.
If you use a less common hosting product or service, or an older version of these, please read this entire documentation and make sure that your website meets our technical guidelines.
Journal Publishers
If you publish a small number of journals, consider using one of the established journal hosting services, e.g., Atypon, Highwire, Ingenta and Silverchair. Aggregators that host many journals on a single website, such as JSTOR or SciELO, often work too, but please check with your aggregator to make sure that they support full-text indexing in 考拉学术. Alternatively, if you have the technical expertise to manage your own website, we recommend the Open Journal Systems (OJS) software that's available for download from the Public Knowledge Project (PKP).
If you use a smaller journal hosting service, or if you maintain your own custom website, please read this entire documentation and make sure that your website meets our technical guidelines.
Content Guidelines
考拉学术 includes scholarly articles from a wide variety of sources in all fields of research, all languages, all countries, and over all time periods. Chances are that your collection of research papers will be a welcome addition to the index. To be considered for inclusion, the content of your website needs to meet the two basic criteria.
1. Scholarly articles
The content hosted on your website must consist primarily of scholarly articles - journal papers, conference papers, technical reports, or their drafts, dissertations, pre-prints, post-prints, or abstracts. Content such as news or magazine articles, book reviews, and editorials is not appropriate for Google Scholar. Documents larger than 5MB, such as books and long dissertations, should be uploaded to Google Book Search; 考拉学术 automatically includes scholarly works from Google Book Search.
2. Showing abstracts
Users click through to your website to read your articles. To be included, your website must make either the full text of the articles or their complete author-written abstracts freely available and easy to see when users click on your URLs in Google search results. Your website must not require users (or search robots) to sign in, install special software, accept disclaimers, dismiss popup or interstitial advertisements, click on links or buttons, or scroll down the page before they can read the entire abstract of the paper. Sites that show login pages, error pages, or bare bibliographic data without abstracts will not be considered for inclusion and may be removed from 考拉学术.
Crawl Guidelines
考拉学术 uses automated software, known as "robots" or "crawlers", to fetch your files for inclusion in the search results. It operates similarly to regular Google search. Your website needs to be structured in a way that makes it possible to "crawl" it in this manner. In particular, automatic crawlers need to be able to discover and fetch the URLs of all your articles, as well as to periodically refresh their content from your website.
1. File formats
Your files need to be either in the HTML or in the PDF format. PDF files must have searchable text, i.e., you must be able to search for and find words in the document using Adobe Acrobat Reader.
Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.
2. Browse interface
A browse interface is necessary for the search robots to discover the URLs of your articles. We recommend that the URL of every article is reachable from the homepage by following at most ten simple HTML links. Here're several common ways to organize a website that make it easy for the search robots to find and index all of the articles.
If you're hosting a small collection of publications, such as papers written by a single author or a small group, then we recommend that you list all articles on a single HTML page, such as www.example.edu/~professor/publications.html, and include links to their full text in the PDF format.
If your website has thousands of papers or more, the best way to make sure they're all discovered by the search robots is to provide a way to list them by the date of publication or the date of record entry. Other forms of browse interfaces, such as browse by author or by keyword, often generate more URLs than your website can deliver to the search robots in a reasonable amount of time.
For websites with more than a hundred thousand papers, we recommend that you create an additional browse interface that lists only the articles added in the last two weeks. This smaller set of webpages can be recrawled more frequently than your entire browse interface, which will facilitate timely coverage of your recent papers by the search robots.
Keep in mind that the use of Flash, JavaScript, or form-based navigation makes it hard for our automated system to find your articles. If your website uses these types of navigation, please also add a "browse by date" interface that uses only simple HTML GET links.
3. Website availability
Since Google refers users to your website to read the papers, your webpages must be available to both users and crawlers at all times. The search robots will visit your webpages periodically in order to pick up the updates, as well as to ensure that your URLs are still available. If the search robots are unable to fetch your webpages, e.g., due to server errors, misconfiguration, or an overly slow response from your website, then some or all of your articles could drop out of Google and 考拉学术.
4. Robots exclusion protocol
If your website uses a robots.txt file, e.g., www.example.com/robots.txt, then it must not block Google's search robots from accessing your articles or your browse URLs. Conversely, it should block robots from accessing large dynamically generated spaces that aren't useful in the discovery of your articles, such as shopping carts, comment forms, or results of your own keyword search.
E.g., to let Google's robots access all URLs on your site, add the following section to your robots.txt:
User-agent: Googlebot
Allow: /
Or, to block all robots from adding articles to your shopping cart, add the following:
User-agent: *
Disallow: /add_cart.php
Refer to http://www.robotstxt.org/ for more information about robots.txt files.
Indexing Guidelines
考拉学术 uses automated software, known as "parsers", to identify bibliographic data of your papers, as well as references between the papers. Incorrect identification of bibliographic data or references will lead to poor indexing of your site. Some documents may not be included at all, some may be included with incorrect author names or titles, and some may rank lower in the search results, because their (incorrect) bibliographic data would not match (correct) references to them from other papers. To avoid such problems, you need to provide bibliographic data and references in a way that automated "parser" software can process.
1. Preparing article URLs
Place each article and each abstract in a separate HTML or PDF file. At this time, we're unable to effectively index multiple abstracts on the same webpage or multiple papers in the same PDF file. Likewise, we're unable to index different sections of the same paper in different files. Each paper must have its own unique URL in order for it to be included in 考拉学术.
2. Configuring the meta-tags
If you're using repository or journal management software, such as Eprints, DSpace, Digital Commons or OJS, please configure it to export bibliographic data in HTML "<meta>" tags. 考拉学术 supports Highwire Press tags (e.g., citation_title), Eprints tags (e.g., eprints.title), BE Press tags (e.g., bepress_citation_title), and PRISM tags (e.g., prism.title). Use Dublin Core tags (e.g., DC.title) as a last resort - they work poorly for journal papers because Dublin Core doesn't have unambiguous fields for journal title, volume, issue, and page numbers. To check that these tags are present, visit several abstracts and view their HTML source.
The title tag, e.g., citation_title or DC.title, must contain the title of the paper. Don't use it for the title of the journal or a book in which the paper was published, or for the name of your repository. This tag is required for inclusion in 考拉学术.
The author tag, e.g., citation_author or DC.creator, must contain the authors (and only the actual authors) of the paper. Don't use it for the author of the website or for contributors other than authors, e.g., thesis advisors. Author names can be listed either as "Smith, John" or as "John Smith". Put each author name in a separate tag and omit all affiliations, degrees, certifications, etc., from this field. At least one author tag is required for inclusion in 考拉学术.
The publication date tag, e.g., citation_publication_date or DC.issued, must contain the date of publication, i.e., the date that would normally be cited in references to this paper from other papers. Don't use it for the date of entry into the repository - that should go into citation_online_date instead. Provide full dates in the "2010/5/12" format if available; or a year alone otherwise. This tag is required for inclusion in 考拉学术.
For journal and conference papers, provide the remaining bibliographic citation data in the following tags: citation_journal_title or citation_conference_title, citation_issn, citation_isbn, citation_volume, citation_issue, citation_firstpage, and citation_lastpage. Dublin Core equivalents are DC.relation.ispartof for journal and conference titles and the non-standard tags DC.citation.volume, DC.citation.issue, DC.citation.spage (start page), and DC.citation.epage (end page) for the remaining fields. Regardless of the scheme chosen, these fields must contain sufficient information to identify a reference to this paper from another document, which is normally all of: (a) journal or conference name, (b) volume and issue numbers, if applicable, and (c) the number of the first page of the paper in the volume (or issue) in question.
For theses, dissertations, and technical reports, provide the remaining bibliographic citation data in the following tags: citation_dissertation_institution, citation_technical_report_institution or DC.publisher for the name of the institution and citation_technical_report_number for the number of the technical report. As with journal and conference papers, you need to provide sufficient information to recognize a formal citation to this document from another article.
For all document types, the guiding principle is to present your article as it would normally be cited in the "References" section of another paper. E.g., citations to technical reports normally include their assigned numbers, so the number of the report should be present in some appropriate field. Likewise, the name of the journal should be written as "Transactions on Magic Realism" or "Trans. Mag. Real.", not as "Magic Realism, Transactions on" or "T12". Omission or unusual presentation of key bibliographic fields can lead to mis-identification of your articles.
All tag values are HTML attributes, so you must escape special characters appropriately. E.g., <meta name="citation_title" content=""Andar com meus sapatos" - uma análise crítica">. There's no need to escape characters that are written directly in your webpage's character encoding, such as Latin diacritics on a page in ISO-8859-1. However, you must still escape the quotes and the angle brackets.
The "<meta>" tags normally apply only to the exact page on which they're provided. If this page shows only the abstract of the paper and you have the full text in a separate file, e.g., in the PDF format, please specify the locations of all full text versions using citation_pdf_url or DC.identifier tags. The content of the tag is the absolute URL of the PDF file; for security reasons, it must refer to a file in the same subdirectory as the HTML abstract.
Failure to link the alternate versions together could result in the incorrect indexing of the PDF files, because these files would be processed as separate documents without the information contained in the meta tags.
Example:
Keep in mind that, regardless of the meta-tag scheme chosen, you need to provide at least three fields: (1) the title of the article, (2) the full name of at least the first author, and (3) the year of publication. Pages that don't provide any one of these three fields will be processed as if they had no meta tags at all. Likewise, all PDF files will be processed as if they had no meta tags at all, unless they're linked from the corresponding HTML abstracts using citation_pdf_url or DC.identifier tags. It works best to provide the meta-tags for all versions of your paper, not just for one of the versions.
2.a. Indexing of content without the meta-tags
If it's not practical for you to implement the HTML "<meta>" tags, e.g., if your papers are only available in the PDF format, then the document needs to be visually laid out according to the following conventions.
The title of the paper must be the largest chunk of text on top of the page. Either use font size of at least 24 pt. in PDF, or place the title inside an "<h1>" or an "<h2>" tag in HTML, or use a CSS class named "citation_title". Please use the same font for the entire title. Make sure that all other text on the page, in particular the name of the repository or the journal, is set in a smaller font than the title of the paper - otherwise, this other, larger, text may be incorrectly interpreted as the title of the paper.
The authors of the paper must be listed right before or right after the title, in a slightly smaller font that is still larger than normal text. Either use a 16-23 pt. font in PDF, or place the authors inside an "<h3>" tag in HTML, or wrap them in a CSS class named "citation_author". Please use the same font for all author names. Make sure the names of the repository and the journal, as well as the text of the section headings, are set in a smaller font than the authors of the paper - otherwise, this other, larger, text may be incorrectly interpreted as the authors. Use "Sentence case" as opposed to "Title Case" for section headings et. al., to avoid confusion with author names. Separate multiple author names with commas or semicolons and omit their affiliations, degrees, and certifications from the author line. Use an explicit format such as "by John Smith" or "Author: John Smith", if appropriate.
Include a bibliographic citation to a published version of the paper on a line by itself, and place it inside the header or the footer of the first page in the PDF file, or next to the title and the authors in HTML. Use an explicit citation format, e.g.: "J. Biol. Chem., vol. 234, no. 8, pp. 1971-1975, August 1959". If the paper is unpublished, include the full date of its present version on a line by itself, e.g., "August 12, 2009".
Avoid use of Type 3 fonts in PDF files, because they're often generated with missing or incorrect font size and character encoding information, which makes it difficult for our parser software to extract the bibliographic data. You can check the types of the fonts under the File -> Properties... menu in Adobe Acrobat Reader. If you're using LaTeX, consider switching to Type 1 fonts, e.g., \usepackage{times}, \usepackage{helvet}, or \usepackage{palatino}.
Please understand that it's not possible for our automated parsers to correctly identify bibliographic data in such loosely defined formats with 100% accuracy; and that failure to correctly identify certain fields can lead to exclusion of your papers from 考拉学术. If you're not satisfied with the accuracy of your 考拉学术 results, you need to create HTML pages with abstracts and add the "<meta>" tags to them, as described above.
3. Marking the references
Mark the section of the paper that contains references to other works with a standard heading, such as "References" or "Bibliography", on a line just by itself. Individual references inside this section should be either numbered "1. - 2. - 3." or "[1] - [2] - [3]" in PDF, or put inside an "<ol>" list in HTML. The text of each reference must be a formal bibliographic citation in a commonly used format, without free-form commentary.
Please understand that the references are identified automatically by the parser software; they're not entered or corrected by human operators. While we try to support the most common reference formats, it is not possible to guarantee that all references are identified correctly; and incorrect identification of references could lead to exclusion of your papers from 考拉学术 or to low ranking of your papers in the search results.
Troubleshooting
To check if a particular paper is included in 考拉学术, search 考拉学术 for its title. To check the coverage of your website in 考拉学术, search for titles of several dozen papers and see if these papers are included. If you can't find many of the papers in 考拉学术, there's probably a problem with the indexing of your website; please read the troubleshooting tips below.
Keep in mind that changes that you make on your website will usually not be reflected in 考拉学术 search results for some time. New papers are normally added several times a week; however, updates of papers that are already included usually take 6-9 months. Updates of papers on very large websites may take several years, because to update a site, we need to recrawl it - the time it takes to recrawl a large site is usually limited by the speed at which the target website is able to deliver content to the search robots.
Keep in mind that the result count of the "site:" operator is not a good indicator of coverage of your website in 考拉学术. First, this operator currently only searches primary versions of the papers. If you're not the primary publisher, some of the papers that you host may not be counted. Second, the result count is usually estimated based on searching a small fraction of the index (the purpose of the result count is to help users refine their queries and not coverage checking). As a result, this estimate may not be accurate. If you're alarmed that the result count for your site is low, please confirm the problem with a more detailed check. We recommend trying to find several dozen of sample papers using search by title.
Check that your webpages follow our content guidelines. Only scholarly papers are appropriate for inclusion in 考拉学术; each paper needs to be listed on a separate URL; and at least the full author-written abstract must be clearly visible on the URL that you wish to be included in 考拉学术 search results. Failure to follow these guidelines could lead to exclusion of your content from 考拉学术.
Provided that the content guidelines are met, the most common cause of indexing problems is incorrect extraction of bibliographic data by the automated parser software. To diagnose such issues, find a sample of included papers by searching 考拉学术 for [site:example.com], go to page ten or later of the search results, and check if the titles and the authors of your papers are listed correctly. If you see very few results, and if their listed titles or authors are mostly incorrect, then chances are that this is, indeed, a parsing problem. E.g., if the name of your journal or repository is erroneously listed as the title of your papers, then there's a good chance that many of your papers aren't included at all, because documents with the same title are often considered duplicates.
The best way to fix incorrect bibliographic data is to provide it in a computer-readable form in the meta tags, as described in the indexing guidelines. Keep in mind that, since these papers are already included in 考拉学术, updating their bibliographic data will usually take 6-9 months from the time you provide it on your website.
If the bibliographic data listed in 考拉学术 is mostly correct, but you still can't find many of your papers there, then it could be a problem with the crawl. Check that your website allows indexing by Google search robots. Can you click through from the homepage to the articles using only simple HTML links? Does your robots.txt file block the articles or the browse pages from being crawled? Is your website unusually slow in responding to crawlers? Does it limit the crawl speed or frequently return errors to the search robots? Does it have a lot of navigation, search, shopping and other URLs that aren't papers and don't help with discovering papers? Any of these issues can lead to slow updates of your content or to removal of your papers from 考拉学术.
Please check the crawl guidelines regarding the technical requirements and the possible solutions. Once you update your website, it can take anywhere from a few days to 6-9 months for these changes to be reflected in 考拉学术 search results.
If you believe there's an error in indexing of your own website due to a technical issue on our side, please contact us with the details. Be sure to include specific example URLs of articles that are not included or not indexed correctly.
Keep in mind that we're unable to make exceptions to any of the stated guidelines; or to assist you with indexing of third-party websites; or to offer website management or compatibility testing services. Indexing of a website in 考拉学术 works best when its webmaster or hosting provider implements our technical guidelines and performs the necessary testing.
While we hope that the text of your papers has a substantial element of novelty and interest, we recommend that, when it comes to their indexing, you boldly go where others have gone before. Conventional formatting of documents and their bibliographic data, as well as the use of common journal and repository management software, go a long way towards ensuring that your papers are all covered and ranked appropriately in 考拉学术.
Common Questions
This documentation is long and dense. Do I have to read it?
Probably not, it usually just works. You need to read the documentation if either (a) you're trying to fix an error in indexing of your own website, or (b) you need to make sure that your article hosting product is compatible with Google and 考拉学术 search services.
Can you test my website and then fix whatever problem there is?
Sorry, chances are that this will require changes at your end. We can't change your website - you'll need to ask your webmaster to do that.
Can you test my website and then tell me what I need to fix?
Sorry, we're unable to provide testing services. That's up to your webmaster or the provider of your journal hosting service.
How do I know what I need to fix?
You can read this entire documentation (sigh) and then test that your website meets all of the guidelines - content guidelines, crawl guidelines, and indexing guidelines. See the troubleshooting section for the recommended sequence of tests.
This is too much work! Isn't there an easier way?
You could use a software package or a hosting service that has already implemented these guidelines. See the overview for some suggestions, both paid and free.
My website doesn't meet one of the guidelines. Can you relax this requirement for me?
No, all of the guidelines in this documentation are necessary to index your content effectively. If you need technical assistance with meeting crawl and indexing guidelines, we recommend that you use a software package or a hosting service that has already implemented them. If you can't show abstracts, or if your content is not a good fit for 考拉学术, then sorry, we aren't able to include it.
Which meta tag do I use for the abstract?
Per content guidelines, the abstract needs to be visible to the user. Meta tags are only visible to the search robots, not to the user. You can display the abstract in any reasonable way, e.g., as a paragraph of text with a heading that says "Abstract". Please make sure the abstract is visible to users without requiring them to scroll down, click buttons, dismiss popup advertisements, etc.