DSA Tutorial
🔍

How Strings Work in Memory

Why String Memory Matters

You can write string code without understanding memory. Most beginners do. But without this understanding, you will hit confusing bugs:

  • Why does s1 == s2 return false in Java even when both strings contain "hello"?
  • Why does concatenating strings in a loop suddenly make your program 100x slower at large n?
  • Why can you change arr[0] in an array but not s[0] in a string?
  • Why does Python sometimes tell you two separate string objects are the same object?

These are not random quirks. They all have the same root cause — how strings are physically stored in memory and the design decisions each language made about that storage.

String Storage: Character Array Under the Hood

At the physical level, a string is a contiguous block of memory holding character values one after another — exactly like an array of characters.

String "hello" stored in memory (UTF-16, 2 bytes per char in Java):

Address:   1000  1002  1004  1006  1008
Value:      104   101   108   108   111
Character: 'h'   'e'   'l'   'l'   'o'

address of char[i] = base_address + (i × bytes_per_char)
char[0] = address 1000 → 'h'
char[2] = address 1004 → 'l'

This layout is why character access is O(1) — the same address formula from arrays applies. Knowing the base address and the index is enough to jump directly to any character.

The difference from a plain array: languages wrap this memory block in additional metadata and enforce rules about how it can be modified.

Why Strings Are Immutable in Java, Python, and JavaScript

The decision to make strings immutable is a deliberate design choice, not a limitation. Three reasons drive it:

Safety. If you pass a string to a function and the function could modify it, you can never be certain what the string contains after the call. Immutability guarantees that the string you read is the string that was passed — no hidden mutation.

Hashing. Strings are used as hash map keys constantly. If a string could change after being inserted, its hash value would change, making it unfindable in the map. Immutability guarantees a stable hash.

Sharing (interning). If a string can never change, multiple variables pointing to the same string object are always safe. No one can modify the object through one reference and break another. This enables the string pool optimization described below.

Immutability means:

  String s = "hello";
  s.charAt(0) = 'H';     ← illegal in Java — charAt() returns a value, not a slot

What actually happens when you "change" a string:

  s = s.toUpperCase();   ← creates a NEW string "HELLO", s now points to it
                           the original "hello" object is unchanged
                           if nothing else references "hello", it is garbage collected

  Memory before:        Memory after:
  [s] → "hello"         [s] → "HELLO"
                              "hello" (unreachable, eligible for GC)

C++ std::string made the opposite choice — mutable — for maximum performance control. This is why C++ does not have a string pool and why you can write s[0] = 'H' in C++.

The String Pool (Java)

Java maintains a special region of memory called the String Pool (also called the String Intern Pool or String Constant Pool) inside the heap.

When you write a string literal in Java source code, the JVM checks the pool first:

  • If that exact string already exists in the pool, return a reference to the existing object
  • If not, create a new string object in the pool and return a reference to it
String a = "hello";   ← JVM checks pool: "hello" not found → create, add to pool
String b = "hello";   ← JVM checks pool: "hello" found → return same reference

a and b point to the SAME object in memory.

String c = new String("hello");  ← bypasses the pool entirely
                                    creates a NEW object on the heap (not the pool)
                                    c points to a different object than a and b

This is why == and .equals() give different results in Java:

Java
1public class StringPool { 2 3 public static void main(String[] args) { 4 // String literals — may share the same object from the pool 5 String a = "hello"; 6 String b = "hello"; 7 8 // == checks reference equality (same object in memory) 9 System.out.println("a == b: " + (a == b)); // true — same pool object 10 System.out.println("a.equals(b): " + a.equals(b)); // true — same content 11 12 // new String() bypasses the pool — always creates a new heap object 13 String c = new String("hello"); 14 15 System.out.println("a == c: " + (a == c)); // false — different objects 16 System.out.println("a.equals(c): " + a.equals(c)); // true — same content 17 18 // intern() forces pool lookup — returns the pool version 19 String d = c.intern(); 20 System.out.println("a == d: " + (a == d)); // true — d is pool version 21 System.out.println("a.equals(d): " + a.equals(d)); // true 22 23 // RULE: Always use .equals() for string content comparison in Java 24 // == for strings is almost always wrong 25 } 26}
Output (Java):
a == b:          true
a.equals(b):     true
a == c:          false
a.equals(c):     true
a == d:          true
a.equals(d):     true

Why Concatenation in a Loop Is O(n²)

This is the most important performance fact about strings. In Java (and to a lesser extent Python and JavaScript), each + operation creates a brand new string object. The old strings are not modified — they cannot be, because they are immutable.

Building "abcde" by concatenation:

Step 1: s = "" + "a"    → allocate "a"           (1 char)
Step 2: s = "a" + "b"   → allocate "ab"           (2 chars, copy "a" + add "b")
Step 3: s = "ab" + "c"  → allocate "abc"          (3 chars, copy "ab" + add "c")
Step 4: s = "abc" + "d" → allocate "abcd"         (4 chars, copy "abc" + add "d")
Step 5: s = "abcd" + "e"→ allocate "abcde"        (5 chars, copy "abcd" + add "e")

Characters copied: 0 + 1 + 2 + 3 + 4 = 10 = n(n-1)/2 ≈ O(n²)

For n=1000: 499,500 character copies
For n=10,000: 49,995,000 character copies
For n=100,000: ~5 billion character copies → program becomes unusably slow

The fix is to use a mutable buffer that amortizes copying — exactly the dynamic array strategy applied to strings.

Java
1public class ConcatenationCost { 2 3 // BAD: O(n²) — creates a new String object at every + operation 4 public static String buildBad(int n) { 5 String s = ""; 6 for (int i = 0; i < n; i++) { 7 s = s + (char)('a' + (i % 26)); // NEW string created each iteration 8 } 9 return s; 10 } 11 12 // GOOD: O(n) — StringBuilder uses a mutable char array internally 13 public static String buildGood(int n) { 14 StringBuilder sb = new StringBuilder(); 15 for (int i = 0; i < n; i++) { 16 sb.append((char)('a' + (i % 26))); // Appends to internal buffer 17 } 18 return sb.toString(); // ONE final string creation 19 } 20 21 public static void main(String[] args) { 22 int n = 50_000; 23 24 long start1 = System.nanoTime(); 25 String bad = buildBad(n); 26 long time1 = System.nanoTime() - start1; 27 28 long start2 = System.nanoTime(); 29 String good = buildGood(n); 30 long time2 = System.nanoTime() - start2; 31 32 System.out.println("Bad (+ in loop): " + (time1 / 1_000_000) + " ms length=" + bad.length()); 33 System.out.println("Good (StringBuilder): " + (time2 / 1_000_000) + " ms length=" + good.length()); 34 // Bad may take seconds; Good takes milliseconds — same result 35 } 36}
Output (approximate):
Bad  (+ in loop):    2400 ms  length=50000
Good (StringBuilder):   3 ms  length=50000

Why StringBuilder Is Fast

StringBuilder (Java) and similar mutable buffers use the same doubling strategy as dynamic arrays:

StringBuilder internal buffer:

Initial capacity: 16 characters
  Buffer: [_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]

After 10 appends:
  Buffer: [a, b, c, d, e, f, g, h, i, j, _, _, _, _, _, _]
  size=10, capacity=16

After 16 appends — RESIZE:
  Allocate new buffer of size 32
  Copy 16 existing chars
  Continue appending

After n appends: O(n) total work (geometric copy sum)
toString(): ONE O(n) copy to create the final immutable String
Total: O(n) vs O(n²) for repeated +

Encoding: How Characters Become Bytes

A string in memory is ultimately a sequence of bytes. The encoding determines how many bytes represent each character.

ASCII — 1 byte per character, covers only 128 characters (English alphabet, digits, punctuation).

UTF-8 — 1 to 4 bytes per character. ASCII characters use 1 byte (backward compatible). Common in files, network protocols, Python source code.

UTF-16 — 2 bytes per character for most characters, 4 bytes for rare ones. Used internally by Java and JavaScript strings.

UTF-32 — 4 bytes per character. Fixed-width, simpler but wasteful.

Character encoding sizes:

"hello" in UTF-8:   5 bytes   (5 × 1 byte each — all ASCII)
"hello" in UTF-16: 10 bytes   (5 × 2 bytes each)
"hello" in UTF-32: 20 bytes   (5 × 4 bytes each)

"café" in UTF-8:    5 bytes   (4 chars but 'é' uses 2 bytes)
"café" in UTF-16:   8 bytes   (4 × 2 bytes each)

"😀" in UTF-8:      4 bytes   (emoji needs 4 bytes in UTF-8)
"😀" in UTF-16:     4 bytes   (needs a surrogate pair — two 2-byte units)

For interview problems, this is usually irrelevant — problems assume ASCII input unless stated otherwise. But it explains why s.length() in Java returns 2 for an emoji (two UTF-16 code units), and why you should use s.codePointCount(0, s.length()) for the true character count of Unicode strings.

String Interning in Python

Python interns strings automatically under certain conditions. Understanding when helps you predict is behavior.

Python interns:
  - String literals in source code
  - Strings that look like valid identifiers (only letters, digits, underscores)
    and are short enough (implementation-defined threshold)
  - Strings explicitly interned with sys.intern()

Python does NOT automatically intern:
  - Strings created by runtime concatenation
  - Strings with spaces or special characters
  - Very long strings

Examples:
  "hello" is "hello"         → True  (literals, interned)
  "hello world" is "hello world" → False (contains space, not interned)
  ("hel" + "lo") is "hello"  → True  (compile-time constant folding)
  x = "hel"; (x + "lo") is "hello" → False (runtime concatenation)

The practical rule: never use is to compare string content in Python. It works sometimes but fails on dynamically created strings. Always use ==.

Memory Cost of Strings

Understanding the memory overhead per string matters when you store large numbers of strings in memory — common in interview problems involving hash maps or frequency counts.

Java String object memory layout (approximate, JVM-specific):
  Object header:          16 bytes
  char[] reference:        8 bytes (pointer to char array)
  int hash (cached):       4 bytes
  char[] array header:    16 bytes
  char[] content:       2 × n bytes (UTF-16, n = length)

  Minimum overhead for an empty String: ~40 bytes
  A string of n characters: ~40 + 2n bytes

Python str object memory layout (CPython, approximate):
  Object header + metadata:  49 bytes
  Characters (Latin-1):     1 × n bytes for ASCII strings
  Characters (UCS-2):       2 × n bytes for strings with non-ASCII
  Characters (UCS-4):       4 × n bytes for strings with high code points

  A short ASCII string of n characters: ~49 + n bytes

C++ std::string (typical small string optimization — SSO):
  Stack storage (SSO buffer):  15 characters fit inline, no heap allocation
  For strings > 15 chars:      heap allocation + pointer + size + capacity
  Minimum size:               32 bytes on 64-bit systems

JavaScript:
  Engine-specific — V8 uses various internal representations
  Short strings are often stored inline; longer strings on the heap

For most interview problems this level of detail is unnecessary. What matters: a string of n characters costs O(n) memory — this is reflected in space complexity analysis whenever you create a new string from parts of another.

Comparing Strings: What Actually Happens

String comparison checks characters one by one from left to right until a difference is found or one string ends.

Compare "apple" and "application":

  a == a → continue
  p == p → continue
  p == p → continue
  l == l → continue
  e vs i → 'e' (101) < 'i' (105) → "apple" < "application"

  Steps taken: 5 (up to the first difference)
  Worst case: O(min(len1, len2))
  Best case: O(1) (first characters differ)

This is why string comparison is O(n) in the worst case — not O(1) like integer comparison. Sorting a list of strings costs O(k × n log n) where k is the average string length, not just O(n log n).

Java
1public class StringComparison { 2 3 public static void main(String[] args) { 4 String a = "apple"; 5 String b = "application"; 6 String c = "apple"; 7 8 // equals() — content comparison, O(n) worst case 9 System.out.println("a.equals(c): " + a.equals(c)); // true 10 11 // compareTo() — lexicographic, returns negative/0/positive 12 System.out.println("a.compareTo(b): " + a.compareTo(b)); // negative ('e' < 'i') 13 System.out.println("b.compareTo(a): " + b.compareTo(a)); // positive 14 15 // equalsIgnoreCase() — case-insensitive content comparison 16 System.out.println("APPLE equals apple: " + "APPLE".equalsIgnoreCase("apple")); // true 17 18 // startsWith and endsWith 19 System.out.println("starts 'app': " + a.startsWith("app")); // true 20 System.out.println("ends 'tion': " + b.endsWith("tion")); // true 21 } 22}
Output:
a.equals(c):      true
a.compareTo(b):   negative
a < b:            true
starts 'app':     true
ends 'tion':      true

The Cost of Substring Operations

Creating a substring always involves allocating new memory and copying characters — O(k) where k is the length of the substring.

Original string "Hello, World!" — 13 characters — at address 1000

s.substring(7, 12) in Java → "World"

  Allocate new string object
  Copy 5 characters ('W','o','r','l','d') to new memory
  Cost: O(5) = O(k) where k = 12 - 7 = 5

This is NOT free. Creating n substrings of average length k costs O(n × k) total.
For string problems involving many substrings, this adds up.

Example trap:
  for i in range(n):
      for j in range(i, n):
          s[i:j+1]  ← creates O(n²) substrings of average length O(n)
                       total cost: O(n³) — not O(n²) as it appears

In Python, string slicing s[i:j] creates a new string every time. In Java, s.substring(i, j) creates a new String object every time. Neither is a view into the original — they are independent copies.

This is why algorithms that avoid creating substrings (working with indices instead) are more memory-efficient and often faster.

Interview Questions

Q: Why should you use .equals() instead of == to compare strings in Java?

== checks reference equality — whether two variables point to the same object in memory. String literals may share pool references, making == accidentally work in simple cases. But new String("hello") creates a new heap object, making == return false even when content is identical. .equals() always checks content. In interviews and production code, always use .equals() for content comparison.

Q: Why is building a string with + in a loop O(n²) in Java?

Because Java strings are immutable. Each + creates a brand new String object containing all previous characters plus the new one. Building a string of n characters this way performs 0 + 1 + 2 + ... + (n-1) = n(n-1)/2 total character copies — O(n²). StringBuilder avoids this by maintaining a mutable char array that uses the doubling strategy, making n appends O(n) total.

Q: What is string interning and when does it matter?

String interning is the practice of maintaining a pool of unique strings and reusing existing objects when the same content is needed. Java interns all string literals automatically. Python interns short identifier-like strings. Interning saves memory when many variables hold equal strings — they share one object instead of each having a copy. It matters for performance in code that creates many equal strings, and it matters for understanding why == comparisons between literals can return true.

Q: How long does it take to compare two strings?

O(min(len1, len2)) in the worst case — comparison goes character by character until a difference is found or one string ends. If the first characters differ, it is O(1). Strings that share a long common prefix cost more to compare. This is why sorting a list of strings is O(k × n log n), not O(n log n), where k is the average string length.

FAQs

If Python strings are immutable, how does s += "x" work?

s += "x" is syntactic sugar for s = s + "x". It creates a brand new string containing the concatenated content and rebinds the name s to the new object. The original string is not modified. If nothing else references the original, it becomes eligible for garbage collection. CPython has a special optimization for += in simple cases, but the semantic model is always "create a new string."

Does C++ std::string being mutable cause any problems?

Mutability gives C++ maximum flexibility but removes the safety guarantees Java and Python have. In C++, if two std::string variables both reference logically the same string, modifying one does not affect the other — they are separate objects. There is no sharing/interning in std::string. The tradeoff: no accidental aliasing bugs, but no memory savings from sharing either. Copy-on-write optimization was removed from the C++11 standard due to threading complexity.

Why does String.length() in Java return 2 for some emoji?

Java strings are UTF-16 encoded. Most common characters fit in one 16-bit code unit. Characters outside the Basic Multilingual Plane (code points > 65535) — including most emoji — require two 16-bit code units called a surrogate pair. length() counts code units, not characters. For the true character count, use s.codePointCount(0, s.length()). This is an edge case in most interview problems, but it matters for Unicode-aware applications.

Is a substring always a copy in Java, or can it share memory?

In modern Java (Java 7 update 6 and later), substring() always creates an independent copy. Older Java versions shared the underlying char array (just adjusting offset and length pointers), which avoided copying but caused memory leaks — a small substring would keep a large original string alive. The current behavior (always copy) uses more memory for individual substrings but is simpler and avoids the memory leak problem.

Quick Quiz

Question 1: In Java, String a = "hello"; String b = "hello"; System.out.println(a == b); prints:

  • A) false — always use .equals()
  • B) true — literals share the same pool object
  • C) Compilation error
  • D) Runtime exception

Answer: B) true — literals share the same pool object. Java interns string literals, so both a and b point to the same object in the String Pool. However, this behavior is an implementation detail — never rely on == for content comparison. Always use .equals().

Question 2: What is the time complexity of building a string by concatenating n single characters using + in a Java loop?

  • A) O(n)
  • B) O(n log n)
  • C) O(n²)
  • D) O(1)

Answer: C) O(n²). Each + copies all previous characters into a new String. The i-th iteration copies i-1 characters. Total copies: 0 + 1 + 2 + ... + (n-1) = n(n-1)/2 = O(n²). Use StringBuilder to achieve O(n).

Question 3: A Python function receives a string and does s += "x" inside it. After the function returns, has the caller's string changed?

  • A) Yes — strings are passed by reference in Python
  • B) No — += creates a new string object and rebinds the local variable
  • C) Yes — Python strings are mutable within the same scope
  • D) Depends on the string's length

Answer: B) No. Python strings are immutable. s += "x" creates a new string and rebinds the local variable s to it. The caller's string is unaffected. This is different from passing a list — lists are mutable, so lst.append(x) inside a function does modify the caller's list.

Question 4: Why does creating substrings inside a double loop make the algorithm O(n³) rather than O(n²)?

  • A) Substrings are created using nested loops, which multiplies complexity
  • B) Each substring creation copies up to O(n) characters, adding an extra O(n) factor to the O(n²) pairs
  • C) Substrings require hashing, which is O(n) per operation
  • D) Modern languages cache substrings, so it is actually still O(n²)

Answer: B) Each substring copy is O(k) where k is the substring length. An O(n²) double loop generates O(n²) pairs. Creating a substring for each pair costs O(average length) = O(n). Total: O(n²) pairs × O(n) per copy = O(n³). Working with index pairs (i, j) instead of actual substrings avoids this cost.

Summary

Strings are contiguous character arrays in memory, wrapped with rules that vary by language. Java, Python, and JavaScript make strings immutable — a design choice enabling safe sharing, stable hashing, and string interning. C++ makes strings mutable for maximum performance control.

The key ideas to carry forward:

  • String access is O(1) — same address formula as arrays: base + (i × bytes_per_char)
  • Immutability means every "modification" creates a new string — the original is unchanged
  • Java's String Pool interns literals — "hello" == "hello" is true for literals but false for new String("hello") — always use .equals()
  • + concatenation in a loop is O(n²) in Java — use StringBuilder for O(n)
  • Python's "".join(list) is O(n) — always prefer join over + in loops
  • C++ += on std::string is amortized O(1) — same as vector push_back
  • Substring creation is always O(k) — it copies k characters into new memory
  • String comparison is O(min(len1, len2)) worst case — not O(1) like integer comparison
  • is in Python checks identity, == checks content — always use == for string comparison

In the next topic, you will explore String Traversal — visiting every character in a string systematically, with the patterns and techniques that appear most often in interview problems.

How Strings Work in Memory