Band Assignment Parser
Paul D. Dobson and Ewan W. Blanch
The University of Manchester, UK


1. How do I use this?
2. How does it work?
3. What's with the blue and yellow?
4. Why do we prefer HTML?
5. Why can't you accept a web address for a paper?
6. What are the additional terms I can add?
7. Why are the links dead in the HTML results?
8. Where are the images?
9. How do you convert PDF files?
10. Does the PDF converter work a bit differently?


1. How do I use this?
If you have a manuscript in either HTML or PDF format, upload it using the appropriate box above and let us strip out the sentences that contain band assignments. We can't guarantee to catch them all, or that every sentence we catch contains a band assignment, but it's a lot more efficient than reading the whole thing! We'll give you some statistics to support this soon. (top)

2. How does it work?
The underlying mechanism by which sentences are extracted is very simple. Spectroscopists are very reliable when it comes to reporting units. We make the very basic assumption that if a sentence does not contain a reference to a wavenumber (which we detect through the presence of its units), then it is unlikely to contain a band assignment. (top)

3. What's with the blue and yellow?
The yellow sentence is the one that contains the unit notation. The blue sentences are those that appear before and after the yellow sentence. They are included to help give the hit sentence context. (top)

4. Why do we prefer HTML?
HTML is much easier to parse than PDF. As we have to convert PDF files first, the conversion process can mangle special characters and layout a bit. Your results are likely to be far easier to understand if you send us HTML files because we can preserve more of the original formatting. This is true so long as the publishers don't use images to represent special characters, which can mess things up, but they don't really need to still do that anymore so it's their fault, not ours. (top)

5. Why can't you accept a web address for a paper?
We considered implementing this, but given the access control publishers place on their papers, we didn't think it wise to let Manchester servers access them as this would let anyone from any institution access papers through Manchester's licence. Which is bad. (top)

6. What are the additional terms I can add?
These are extra words that can help you qualify your search more closely. For example, you might want to look for sentences that contain "helix" in addition to the wavenumber unit. This would be handy if you wanted to search for band assignments that might have something to do with helices. Search terms are optional. (top)

7. Why are the links dead in the HTML results?
We will try to activate them soon. (top)

8. Where are the images?
Images are replaced by the text [IMAGE FILE]. This is a temporary solution to the problem of special characters being represented using images. This isn't really necessary as most characters can be represented in HTML and many publishers manage it perfectly well without images. (top)

9. How do you convert PDF files?

We use ps2ascii, part of the Ghostscript suite. As you can see, it isn't perfect and sometimes the results are a bit messy. Usually there is sufficient information to determine a band assignment, though HTML is preferable. (top)

10. Does the PDF converter work a bit differently?
Well spotted! Once converted to text the PDF file is scanned for the specified units and ~200 characters up and downstream reported. This avoids some problems that come about due to determining sentences boundaries.  (top)