[.net] C# Improved Substring with Escape Characters

Started by
18 comments, last by jpetrie 13 years, 6 months ago
I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters. So I found a way to go around it. Here is the code I use to get a certain substring from HTML. Maybe this will also ignite Microsoft to put this code in their other form of substring.

        public string ImprovedSubstring(string fullString, string startString, string endString)        {            string theSubstring = "";            int i1 = fullString.IndexOf(startString);            fullString = fullString.Substring(i1);            int i2 = fullString.IndexOf(endString);            theSubstring = fullString.Remove(i2 + endString.Length);            return theSubstring;        }


LoL there is a plus sign in the last Remove method, some bug on gamedev doesn't show it in preview.

Thank you and hope this helps a little with escape chars.
Advertisement
Quote:
I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters

Are you sure? Which escape characters? Have you a small example, like a unit test, that demonstrates how it fails? You know that the function is actually substring(start, length)?

Your function doesn't appear to have any error checking in the case where the substrings are not located, nor does it handle the case where i2 < i1.
Quote:Original post by Flopid
I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters.


What do you mean? Last time I checked it handled them just fine. This is more likely to be a problem with your understanding than the standard library.
Sorry I did not write a more error proof code, just didn't know I need to, since it works if you know what you are doing. Thought a simple code would just help people who struggle with this stuff. Also I used the substring(start, to) for just extracting the part I need from the html code, and it always gave me an error saying that index is out of range. It seems to ignore all the escape chars like \n \t \ etc.. So it basicly searches in the string that is smaller (ie without escape chars). However the Substring(start) does not ignore them. So that is why I delete after a certain point. You could try that with some website's html code and you'll see.
You are completely incorrect. Substring does not ignore escape characters in either overload. More likely is the fact that you simply calculated the length incorrectly.
Mike Popoloski | Journal | SlimDX
Errrr, I am wrong... I put the index of the last element as a lenth parameter which the substring takes. Rycross was right, I dind't understand the substring completely. Thank you for correcting me.
Quote:
Sorry I did not write a more error proof code, just didn't know I need to, since it works if you know what you are doing.

So does Substring [grin]

But seriously, learning to spot odd boundary conditions is a good habit to get into, but a vital one if you are going to write generic functions like that which will end up being used all over the place.

In general, if you think you've found a bug in something like System.String, you are almost certainly wrong. This code is independently tested in millions if not billions of lines of code all around the world. If Substring stopped working there would be a deluge of bug reports. Not to say it cannot happen, just that it is vanishingly unlikely.

If you do think you've found such a bug, you should write a minimal example which demonstrates it. Something like this:
using System;public class Example{    public static void Main(string [] args)    {        string html = "<html><head><title>Whatever</title></head><body>CONTENT</body></html>";        int start = html.IndexOf("<title>");        int end = html.IndexOf("</title>");        // Expected "Whatever" but is "Whatever</title></head><bod"        Console.WriteLine(html.Substring(start + "<title>".Length, end));    }}

This immediately demonstrates that you are using the API incorrectly. In other examples, it will focus your mind on the problem and you might be able to solve it yourself. In the case where there is still an apparent bug, it gives us something to try out ourselves to confirm the behaviour diverges from the expected output.
Quote:
Errrr, I am wrong... I put the index of the last element as a lenth parameter which the substring takes. Rycross was right, I dind't understand the substring completely. Thank you for correcting me.

It was right there all along in the very first reply:
Quote:
You know that the function is actually substring(start, length)?
Quote:Original post by tinybronco
If the issue is \n \t - you shouldn't use those for newlines and tabs and newlines.

In the place of \n, concat Environment.NewLine
In the place of \t, contact (char)9

you can also try using ASCII code
\r \n would be 0x0D 0x0A
\t would be 9


I'm not really sure what you'd really gain by by manually inserting the actual characters in there. The compiler inserts the real values at compile time. The runtime doesn't actually parse escape sequences.
Quote:Original post by tinybronco
Quote:Original post by Flimflam
Quote:Original post by tinybronco
If the issue is \n \t - you shouldn't use those for newlines and tabs and newlines.

In the place of \n, concat Environment.NewLine
In the place of \t, contact (char)9

you can also try using ASCII code
\r \n would be 0x0D 0x0A
\t would be 9


I'm not really sure what you'd really gain by by manually inserting the actual characters in there. The compiler inserts the real values at compile time. The runtime doesn't actually parse escape sequences.


Escape sequences work only in one direction. If you want to evaluate if a string has a newline in it, you cant go "if(x.indexOf("\n") != -1)" etc. For the sake of clarity and consistency, IMHO its a better practice to always use char or Environment in the place of escape sequence that way they appear the same way any place in your code.


You are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
Console.WriteLine("blah\nblah".IndexOf("\n"));


Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.
Mike Popoloski | Journal | SlimDX
Quote:Original post by Mike.Popoloski
You are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
*** Source Snippet Removed ***

Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.

I think he meant: if you want to find the string literal \n you would need to search for \\n.

Although I don't know what he's getting at as the original issue was solved.

This topic is closed to new replies.

Advertisement