Sign in to follow this  

[.net] C# Improved Substring with Escape Characters

This topic is 2628 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters. So I found a way to go around it. Here is the code I use to get a certain substring from HTML. Maybe this will also ignite Microsoft to put this code in their other form of substring.


public string ImprovedSubstring(string fullString, string startString, string endString)
{
string theSubstring = "";

int i1 = fullString.IndexOf(startString);

fullString = fullString.Substring(i1);

int i2 = fullString.IndexOf(endString);

theSubstring = fullString.Remove(i2 + endString.Length);

return theSubstring;
}



LoL there is a plus sign in the last Remove method, some bug on gamedev doesn't show it in preview.

Thank you and hope this helps a little with escape chars.

Share this post


Link to post
Share on other sites
Quote:

I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters

Are you sure? Which escape characters? Have you a small example, like a unit test, that demonstrates how it fails? You know that the function is actually substring(start, length)?

Your function doesn't appear to have any error checking in the case where the substrings are not located, nor does it handle the case where i2 < i1.

Share this post


Link to post
Share on other sites
Quote:
Original post by Flopid
I've been trying to use the substring(start, end) function on HTML code, and it doesn't work right because it seems as if this particular function doesn't count the escape characters.


What do you mean? Last time I checked it handled them just fine. This is more likely to be a problem with your understanding than the standard library.

Share this post


Link to post
Share on other sites
Sorry I did not write a more error proof code, just didn't know I need to, since it works if you know what you are doing. Thought a simple code would just help people who struggle with this stuff. Also I used the substring(start, to) for just extracting the part I need from the html code, and it always gave me an error saying that index is out of range. It seems to ignore all the escape chars like \n \t \ etc.. So it basicly searches in the string that is smaller (ie without escape chars). However the Substring(start) does not ignore them. So that is why I delete after a certain point. You could try that with some website's html code and you'll see.

Share this post


Link to post
Share on other sites
Errrr, I am wrong... I put the index of the last element as a lenth parameter which the substring takes. Rycross was right, I dind't understand the substring completely. Thank you for correcting me.

Share this post


Link to post
Share on other sites
Quote:

Sorry I did not write a more error proof code, just didn't know I need to, since it works if you know what you are doing.

So does Substring [grin]

But seriously, learning to spot odd boundary conditions is a good habit to get into, but a vital one if you are going to write generic functions like that which will end up being used all over the place.

In general, if you think you've found a bug in something like System.String, you are almost certainly wrong. This code is independently tested in millions if not billions of lines of code all around the world. If Substring stopped working there would be a deluge of bug reports. Not to say it cannot happen, just that it is vanishingly unlikely.

If you do think you've found such a bug, you should write a minimal example which demonstrates it. Something like this:


using System;

public class Example
{
public static void Main(string [] args)
{
string html = "<html><head><title>Whatever</title></head><body>CONTENT</body></html>";
int start = html.IndexOf("<title>");
int end = html.IndexOf("</title>");
// Expected "Whatever" but is "Whatever</title></head><bod"
Console.WriteLine(html.Substring(start + "<title>".Length, end));
}
}


This immediately demonstrates that you are using the API incorrectly. In other examples, it will focus your mind on the problem and you might be able to solve it yourself. In the case where there is still an apparent bug, it gives us something to try out ourselves to confirm the behaviour diverges from the expected output.
Quote:

Errrr, I am wrong... I put the index of the last element as a lenth parameter which the substring takes. Rycross was right, I dind't understand the substring completely. Thank you for correcting me.

It was right there all along in the very first reply:
Quote:

You know that the function is actually substring(start, length)?

Share this post


Link to post
Share on other sites
Quote:
Original post by tinybronco
If the issue is \n \t - you shouldn't use those for newlines and tabs and newlines.

In the place of \n, concat Environment.NewLine
In the place of \t, contact (char)9

you can also try using ASCII code
\r \n would be 0x0D 0x0A
\t would be 9


I'm not really sure what you'd really gain by by manually inserting the actual characters in there. The compiler inserts the real values at compile time. The runtime doesn't actually parse escape sequences.

Share this post


Link to post
Share on other sites
Quote:
Original post by tinybronco
Quote:
Original post by Flimflam
Quote:
Original post by tinybronco
If the issue is \n \t - you shouldn't use those for newlines and tabs and newlines.

In the place of \n, concat Environment.NewLine
In the place of \t, contact (char)9

you can also try using ASCII code
\r \n would be 0x0D 0x0A
\t would be 9


I'm not really sure what you'd really gain by by manually inserting the actual characters in there. The compiler inserts the real values at compile time. The runtime doesn't actually parse escape sequences.


Escape sequences work only in one direction. If you want to evaluate if a string has a newline in it, you cant go "if(x.indexOf("\n") != -1)" etc. For the sake of clarity and consistency, IMHO its a better practice to always use char or Environment in the place of escape sequence that way they appear the same way any place in your code.


You are completely mistaken as well. You can most certainly check the index of escape characters in C# code:

Console.WriteLine("blah\nblah".IndexOf("\n"));



Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.

Share this post


Link to post
Share on other sites
Quote:
Original post by Mike.Popoloski
You are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
*** Source Snippet Removed ***

Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.

I think he meant: if you want to find the string literal \n you would need to search for \\n.

Although I don't know what he's getting at as the original issue was solved.

Share this post


Link to post
Share on other sites
Sure, you can't use "/n", but you can certainly use "\n", which is the proper escape sequence. Seriously, did you even try it? There's no difference between the escaped sequence and the actual character code. On most systems Environment.NewLine is simply going to either be "\n" or "\r\n" anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by tinybronco
For the sake of clarity and consistency, IMHO its a better practice to always use char or Environment in the place of escape sequence that way they appear the same way any place in your code.


I'm confused. How is (char)9 more clearly a tab than '\t' ?

I'm also confused how it's more consistent. Or do you use (char)97 instead of 'a' as well?

Share this post


Link to post
Share on other sites
I think you're confusing two issues here, and also have a misunderstanding about escape sequences in general. First of all, you cannot say that escape sequences "don't work". They have worked, they do work, and will continue to work perfectly fine in all cases. They simply tell the compiler to replace a given escape sequence with a predetermined character code.

Now, what a given control or class decides to do with a certain character code is completely orthogonal to the issue of escape sequences. You could write a control that uses the character 'A' to indicate a newline. You can't say that escape sequences don't work, because the issue has nothing to do with escape sequences. Instead, you could say that "this control only recognizes newlines as \r".

I'm thinking somebody once told you that using Environment.NewLine is a better practice in general than using raw character codes like '\n' (which is certainly true). I'm also thinking you took this the wrong way, and assumed that all escape sequences are better off not used, which is patently false. This is evidenced by your advice of using (char)9 instead of '\t', which is hilariously dumb, since the compiler is going to translate '\t' into (char)9 anyway. Don't believe me? Try running the following program and looking at the results:


Console.WriteLine((int)'\t');
Console.WriteLine((int)((char)9));

Console.WriteLine("'\t'");
Console.WriteLine("'" + (char)9 + "'");



So the real issue here is that "some controls don't accept \n as a newline", not that "Escape sequences don't work in many cases that apply in that scenario." In light of the original topic, that makes your input irrelevant at best, and downright misleading in places.

Share this post


Link to post
Share on other sites
Quote:
Original post by Mike.PopoloskiYou are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
*** Source Snippet Removed ***

Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.


I think I got caught in the crossfire here unless you weren't referring to me when you said "both of you". I said basically the same thing you did. There's no reason to use straight chars over escape sequences because they're changed at compile time.

Share this post


Link to post
Share on other sites
Quote:
Original post by tinybronco
I was very clear when I said "For the sake of clarity and consistency, IMHO its a better practice to always use char or Environment in the place of escape sequence that way they appear the same way any place in your code."


So you do prefer (char)97 to 'a' ?
I still don't get how this improves clarity.

Quote:
I very specifically said it was because in certain cases (not all, but certain) using an escape sequence for a newline just doesn't work in the way you would want it to.

Replacing those escapes with (char) casted numbers will never help. You're getting "reamed" because it's at best an irrelevant, unclear, and confusing addition to your post -- and at par, a blatently bad idea. We're just addressing and focusing on the confusion to better clarify the whole.

Quote:
So, Ill say it again. Why escape a newline in some places, and use Environment.Newline in others when it is just more consistent to always use Environment.Newline?

Convenience. Granted, that's generally a bad always a terrible reason, so I'll give another: Especially with internet protocols, there are situations where you do not care what the local newline convention is. You care about the convention the protocol uses. For example, in HTTP requests, or implementing the IRC protocol, I want to use "\r\n" because Environment.Newline will be wrong and a bug on Unix platforms.

It's also quite likely that your software may need to read and handle multiple newline conventions, a situation for which Enivronment.Newline is also quite useless. Consistency is great -- but Environment.Newline alone can't make newline handling consistent.

Share this post


Link to post
Share on other sites
Personally I find using escape sequences a lot easier than Environment. It's just less typing. Maybe it is more useful if you are writing a word processor or something of that sort. Then again you could do something like this if you want to add a tab, for example, when you hit a tab key.

string c = "word processor text";
if (key == tab)
{
c += "\t";
}

Some people say to code one way is clearer, but I think it is just a style or personal preference. It really depends if you are working with a team or not. That way everyone can settle on one style.

Share this post


Link to post
Share on other sites
Quote:
Original post by Flimflam
Quote:
Original post by Mike.PopoloskiYou are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
*** Source Snippet Removed ***

Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.


I think I got caught in the crossfire here unless you weren't referring to me when you said "both of you". I said basically the same thing you did. There's no reason to use straight chars over escape sequences because they're changed at compile time.


Nah, I said both (you and the OP), referring to tinybronco and the original poster. No "of" in there [grin]

Share this post


Link to post
Share on other sites
Quote:
Original post by Mike.Popoloski
Quote:
Original post by Flimflam
Quote:
Original post by Mike.PopoloskiYou are completely mistaken as well. You can most certainly check the index of escape characters in C# code:
*** Source Snippet Removed ***

Outputs 4, as expected. I suggest that both you and the original poster read up on escape sequences; they are simply textual replacements at compile time.


I think I got caught in the crossfire here unless you weren't referring to me when you said "both of you". I said basically the same thing you did. There's no reason to use straight chars over escape sequences because they're changed at compile time.


Nah, I said both (you and the OP), referring to tinybronco and the original poster. No "of" in there [grin]


Haha, my mistake. I entirely misread that :)

Share this post


Link to post
Share on other sites
N.B. that the user "tinybronco" quoted frequently in this thread has edited the content of his/her original posts into nothingness and/or deleted them. That is why the discussion seems so fragmented.

Share this post


Link to post
Share on other sites

This topic is 2628 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this